Using AI to generate test suites — what works, what doesn't
I've been using AI-assisted coding tools for test generation for a while now, and my take is probably not what you'd expect. It's not "AI will replace test writing" and it's not "AI-generated tests are useless." The truth is messier and more interesting than either extreme.
What actually works well
Boilerplate and scaffolding. Setting up test files, configuring mocks, writing the structural bits that every test needs — this is where AI shines. I can describe a class and get a test file with proper imports, setup/teardown methods, and a reasonable structure in seconds. That used to take me 5-10 minutes of tedious typing.
Happy path tests. Give Copilot or an LLM a function signature and a brief description, and it'll generate solid happy path coverage. CRUD operations, straightforward transformations, simple validation — these come out surprisingly good. The tests are readable, they follow conventions, and they actually catch regressions.
Repetitive patterns. If you're testing 15 API endpoints that all follow the same pattern — validate input, call service, return response — AI can crank those out faster than any human. You write the first one, and it extrapolates the rest.
AI-generated tests are excellent first drafts. The mistake is treating them as final drafts.
What doesn't work
Edge cases. This is the big one. AI-generated tests almost never cover the weird stuff — null values in nested objects, concurrent access patterns, timezone-related bugs, off-by-one errors in pagination. These are the tests that actually save you from production incidents, and they require understanding the domain, not just the code.
Integration tests. Getting an LLM to write a meaningful integration test that sets up realistic state, exercises a real workflow, and asserts on the right outcomes? I haven't seen it done well yet. The generated tests either mock too much (defeating the purpose) or assume an environment that doesn't exist.
Nuanced business logic. If your settlement calculation has special handling for leap years, partial refunds, and multi-currency rounding rules, no AI tool is going to infer those test cases from the code alone. It needs context that lives in product specs, Slack conversations, and the developer's head.
The false confidence problem
This is what worries me most. I've reviewed PRs where a developer pointed to 90% test coverage as proof their code was solid. When I looked closer, the AI-generated tests were asserting on... basically nothing. Tests that call a function and check that it doesn't throw. Tests that verify a response is "not null" without checking the actual values. Tests that mock every dependency so thoroughly that the test is just exercising the mocking framework.
Coverage numbers lie when the assertions are weak. A test that passes regardless of the implementation is worse than no test at all, because it gives you confidence you haven't earned.
How I actually use AI for tests
Here's my workflow:
- Generate the scaffolding. Let AI set up the test file, imports, and basic structure.
- Generate happy path tests. Accept these with minor tweaks — they're usually 80% right.
- Write edge case tests myself. This is where the real thinking happens. I look at the code and ask: what breaks? What are the boundary conditions? What assumptions am I making?
- Use AI to generate variations. Once I have one good edge case test, I'll ask AI to generate similar variations. "Now test this with negative amounts. Now test with amounts exceeding the maximum. Now test with zero."
- Review every assertion. I read every
assertorexpectstatement the AI wrote. If it's asserting something trivial, I strengthen it or delete the test.
The bottom line
AI test generation is a productivity multiplier, not a quality multiplier. It makes you faster at producing tests, but it doesn't make the tests better. The thinking — what to test, why it matters, what could go wrong — still has to come from you.
I'd estimate AI saves me about 30-40% of the time I used to spend writing tests. That's significant. But the time I save on typing, I reinvest in thinking about what to test. That's the part that actually prevents bugs, and it's the part no tool can do for you yet.
Use AI-generated tests as a starting point. A good starting point. Just never mistake the starting point for the finish line.