Toward automated verification of unreviewed AI-generated code

[−] jryio 60d ago

This is a naïve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.

For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.

If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.

[−] spaceywilly 60d ago

Exactly. The challenge isn’t getting the LLMs to make sure they validate their own code. It’s getting the LLMs to write the correct code in the first place. Adding more and more LLM-generated test code just obfuscates the LLM code even further. I have seen some really wild things where LLM jumps through hoops to get tests to pass, even when they actually should be failing because the logic is wrong.

The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.

When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.

It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.

[−] bisonbear 60d ago

I'm becoming convinced that test pass rate is not a great indicator of model quality - instead we have to look at agent behavior beyond the test gate, such as how aligned is it with human intent, and does it follow the repo's coding standards.

I wrote a short blog about this phenomenon here if you're interested https://www.stet.sh/blog/both-pass

also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing

[−] mvrckhckr 60d ago

The best way I can describe the approach I take is having the ability to "smell" what the AI might have gotten wrong (or forgotten completely).

It happens all the time, even when I only scan the code or simply run it and use it. It's uncanny how many such "smells" I find even with the most trivial applications. Sometimes its replies in Codex or Claude Code are enough to trigger it.

These are mistakes only a very (very) inexperienced developer would make.

[−] seanmcdirmid 60d ago

If you wrote a spec for a memory allocator and asked the AI to identify edge cases and points that need to be tested first, it could work (I never asked AI to do that, but it works for other problems I’ve done). Yes, but you can’t feed in a garbage prompt and context and expect magically good tests to come out of that.

[−] raw_anon_1111 60d ago

He’s saying you should write or at least have the LLM write the tests and you carefully review the tests and not the code.

[−] skydhash 60d ago

That’s like saying to trace a spline, you only need to place a few points, carefully verify that the spline pass by those points and not verify the actual formula for the spline.

Or in other words: Test only guarantees their own result, not the code. The value of the test is because you know the code is trying to solve the general problem, not the test’s assertions.

[−] raw_anon_1111 60d ago

That’s a horrible analogy. He specifically said he was designing and validating the tests based on his knowledge of what the goal of the project was.

[−] maplethorpe 59d ago

Have you tried Claude 4.6 Opus? I think it might be able to do what you're suggesting.

[−] tedivm 60d ago

While I understand why people want to skip code reviews, I think it is an absolute mistake at this point in time. I think AI coding assistants are great, but I've seen them fail or go down the wrong path enough times (even with things like spec driven development) where I don't think it's reasonable to not review code. Everything from development paths in production code, improper implementations, security risks: all of those are just as likely to happen with an AI as a Human, and any team that let's humans push to production without a review would absolutely be ridiculed for it.

Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.

[−] fcatalan 60d ago

A couple weeks ago on a lark I asked Claude/Gemini/Codex to hallucinate a language they would like to program in and they always agreed on strong types, contracts, verification, proving and testing. So they ended up brainstorming a weird Forth-like with all those on top. I then kept prodding for an implementation and burned my weekly token budget until a lot of the language worked. They called it Cairn.

So now I prompted this: "can you generate a fizzbuzz implementation in Cairn that showcases as much as possible the TEST/PROVE/VERIFY characteristics of the language? "

Producing this (working) monstrosity (can't paste here, it's 200+ lines of crazy): https://gist.github.com/cairnlang/a7589de126b14e50a53b9bdc28...

[−] pron 60d ago

> The code must pass property-based tests

Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.

There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.

[−] jghn 60d ago

I do think that GenAI will lead to a rise in mutation testing, property testing, and fuzzing. But it's worth people keeping in mind that there are reasons why these aren't already ubiquitous. Among other issues, they can be computationally expensive, especially mutation testing.

[−] duskdozer 60d ago

So are we finally past the stage where people pretend they're actually reading any of the code their LLMs are dumping out?

[−] sharkjacobs 60d ago

I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.

[−] otabdeveloper4 60d ago

This one is pretty easy!

Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.

Bam, no coding required.

[−] agentultra 60d ago

This might work on small, self contained projects.

No side effects is a hefty constraint.

Systems tend to have multiple processes all using side effects. There are global properties of the system that need specification and tests are hard to write for these situations. Especially when they are temporal properties that you care about (eg: if we enter the A state then eventually we must enter the B state).

When such guarantees involve multiple processes, even property tests aren’t going to cover you sufficiently.

Worse, when it falls over at 3am and you’ve never read the code… is the plan to vibe code a big fix right there? Will you also remember to modify the specifications first?

Good on the author for trying. Correctness is hard.

[−] keithnz 60d ago

I've been working on a "vibe coded" project to create a open source TUI sql query tool a bit like DataGrip, with autocomplete, syntax highlighting, schema introspection, vim mode/non vim, allows MCP mode so an agent can help with queries/get results, editing rows, etc. It's mostly an experiment into how to build software from scratch via an Agent without looking at the code (other than to see what decisions its making) and I wanted something reasonably complicated so the requirements evolve / change over time. There are a couple of issues I find, many bugs are unspecified edge cases especially because many of the features "combo" together, and the other issue is it's hard for it to maintain consistency across the UI. You start setting up a lot more context for cross cutting concerns, reviewing itself, and testing. The tool itself is actually really useful and it is my main tool for querying our dbs now. Most of the problem I find are due to "sloppy" prompting (or just not thinking through the edge cases), and a lack of project wide guidance for dealing with the architecture of the system to maintain consistency across the same concerns.

[−] phailhaus 60d ago

Using FizzBuzz as your proxy for "unreviewed code" is extremely misleading. It has practically no complexity, it's completely self-contained and easy to verify. In any codebase of even modest complexity, the challenge shifts from "does this produce the correct outputs" to "is this going to let me grow the way I need it to in the future" and thornier questions like "does this have the performance characteristics that I need".

[−] softwaredoug 60d ago

When you write enough tests to verify AI code, you’re just making the tests the code and compiling an executable from tests

https://softwaredoug.com/blog/2026/03/10/the-tests-are-the-c...

[−] boombapoom 60d ago

production ready "fizz buzz" code. lol. I can't even continue typing this response.

[−] artee_49 60d ago

Unintended side-effects are the biggest problems with AI generated code. I can't think of a proper way to solve that.

[−] eggbrain 60d ago

I find people over-rotate on whether we should be reviewing AI-produced code. "What if bad code gets into production!" some programmers gasp, as if they themselves have never pushed bad code, or had coworkers do the same.

I've worked at places where I've trusted everyone on my team to the extent that most PRs got only a quick glance before getting a "LGTM". On the flipside, I've also worked on teams where every person was a different kind of liability with the code that they pushed, and for those teams I implemented every linting / pre-commit / testing tool possible that all needed to pass inspection (including human review) before any code arrived on production.

A year ago, AI was like that latter team I mentioned -- something I had to check, double check, and correct until I was happy with what it produced. Over the past 6 months, it's gotten closer (but still fairly far away) from the former team I mentioned -- I have to correct it about 10% of the time, whereas for most things it gets it right.

The fact that AI produces a much _larger_ volume of code than the average engineer is perhaps slightly concerning, but I don't see it much differently than code at large companies. Does every Facebook engineer review every junior engineer's pull request to make sure bad code doesn't slip in?

That isn't to say I'm for letting AI go wild with code -- but I think if at worse we consider AI to be a junior engineer we need to reign in with static analysis tools / linters / testers etc, we will probably be able to mitigate a lot of the downside.

[−] teiferer 60d ago

If that would work reliably then you could apply that to human-produced code too. But nothing like that has shown to work, so I wouldn't put money on it working for LLM output.

[−] Ancalagon 60d ago

Even with mutation testing doesn’t this still require review of the testing code?

[−] vemv 60d ago

What is a correct, bug-free program?

...It's one that does what a specific set of humans want. There's no other useful definition. One man's feature is another's bug.

It logically follows that there must be a human review step. How else would you know what the human wants, with sufficient detail?

Otherwise, there's an infinite number of undesired programs with passing test suites that AI can generate for you.

[−] Andrei_dev 60d ago

The testing angle keeps coming up but it's sort of missing the point. I spent a few weeks poking through public repos built with AI tools — about 100 projects. 41% had secrets sitting raw in the source. Not in env files. In the code itself. Supabase service_role keys committed to GitHub, .env.example files with actual credentials, API keys hardcoded in client-side fetch calls.

No test catches any of that. Code works, tests pass, database is wide open.

It's not even a correctness problem. It's that the LLM never thought about rate limiting, CORS headers, CSRF tokens, a sane .gitignore — because nobody asked it to. Those are things devs add from muscle memory, from getting burned. The AI has no scars.

[−] morpheos137 60d ago

I think we need to approach provable code.

[−] jerf 60d ago

"However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code."

I would like to put my marker out here as vigorously disagreeing with this. I will quote my post [1] again, which given that this is the third time I've referred to a footnote via link rather suggests this should be lifted out of the footnote:

"It has been lost in AI money-grabbing frenzy but a few years ago we were talking a lot about AIs being “legible”, that they could explain their actions in human-comprehensible terms. “Running code we can examine” is the highest grade of legibility any AI system has produced to date. We should not give that away.

"We will, of course. The Number Must Go Up. We aren’t very good at this sort of thinking.

"But we shouldn’t."

Do not let go of human-readable code. Ask me 20 years ago whether we'd get "unreadable code generation" or "readable code generation" out of AIs and I would have guessed they'd generate completely opaque and unreadable code. Good news! I would have been completely wrong! They in fact produce perfectly readable code. It may be perfectly readable "slop" sometimes, but the slop-ness is a separate issue. Even the slop is still perfectly readable. Don't let go of it.

[1]: https://jerf.org/iri/post/2026/what_value_code_in_ai_era/

Toward automated verification of unreviewed AI-generated code (peterlavigne.com)

83 comments