The peril of laziness lost (bcantrill.dtrace.org)

by gpm 143 comments 480 points
Read article View on HN

143 comments

[−] btrettel 33d ago
Similar to bragging about LOC, I have noticed in my own field of computational fluid dynamics that some vibe coders brag about how large or rigorous their test suites are. The problem is that whenever I look more closely into the tests, the tests are not outstanding and less rigorous than my own manually created tests. There often are big gaps in vibe coded tests. I don't care if you have 1 million tests. 1 million easy tests or 1 million tests that don't cover the right parts of the code aren't worth much.
[−] CJefferson 32d ago
Yes, I've found tests are the one thing I need to write. I then also need to be sure to keep 'git diff'ing the tests, to make sure claude doesn't decide to 'fix' the tests when it's code doesn't work.

When I am rigourous about the tests, Claude has done an amazing job implementing some tricky algorithms from some difficult academic papers, saving me time overall, but it does require more babysitting than I would like.

[−] Tuna-Fish 32d ago
Give claude a separate user, make the tests not writable for it. Generally you should limit claude to only have write access to the specific things it needs to edit, this will save you tokens because it will fail faster when it goes off the rails.
[−] LelouBil 32d ago
Don't even need a separate user if you're on linux (or wsl), just use the sandbox feature, you can specify allowed directories for read and/or write.

The sandbox is powered by bubblewrap (used by Flatpaks) so I trust it.

[−] eru 32d ago
You might want to look into property based testing, eg python-hypothesis, if you use that language. It's great, and even finds minimal counter-examples.
[−] senko 32d ago
The “red/green TDD” (ie. actual tdd) and mutation testing (which LLMs can help with) are good ways to keep those tests under control.

Not gonna help with the test code quality, but at least the tests are going to be relevant.

[−] eru 32d ago
If you start with the failing tests, you can use them plus the spec to give to review to another agent (human or silicon).

It's a bit like pre-registering your study in medicine.

[−] colechristensen 33d ago
It's a struggle to get LLMs to generate tests that aren't entirely stupid.

Like grepping source code for a string. or assert(1==1, true)

You have to have a curated list of every kind of test not to write or you get hundreds of pointless-at-best tests.

[−] btrettel 32d ago
What I've observed in computational fluid dynamics is that LLMs seem to grab common validation cases used often in the literature, regardless of the relevance to the problem at hand. "Lid-driven cavity" cases were used by the two vibe coded simulators I commented on at r/cfd, for instance. I never liked the lid-driven cavity problem because it rarely ever resembles an actual use case. A way better validation case would be an experiment on the same type of problem the user intends to solve. I think the lid-driven cavity problem is often picked in the literature because the geometry is easy to set up, not because it's relevant or particularly challenging. I don't know if this problem is due to vibe coders not actually having a particular use case in mind or LLMs overemphasizing what's common.

LLMs seem to also avoid checking the math of the simulator. In CFD, this is called verification. The comparisons are almost exclusively against experiments (validation), but it's possible for a model to be implemented incorrectly and for calibration of the model to hide that fact. It's common to check the order-of-accuracy of the numerical scheme to test whether it was implemented correctly, but I haven't seen any vibe coders do that. (LLMs definitely know about that procedure as I've asked multiple LLMs about it before. It's not an obscure procedure.)

[−] colechristensen 32d ago
Both of these points seem like they would be easy to instruct an LLM to shape its testing strategy.
[−] btrettel 32d ago
I think so too. If unclear, I don't use LLMs for coding at the moment and was just commenting on what I've seen from others who do in computational fluid dynamics.

Edit: Let me add that while I think it would be easy to instruct a LLM to do what I'd like, LLMs don't do these things by default despite them being recognized as best practices, and I'm not confident in LLMs getting the data or references right for validation tests. My own experience is that LLMs are pretty bad when it comes to reproducing citations, and they tend to miss a lot of the literature.

[−] theshrike79 32d ago

> You have to have a curated list of every kind of test not to write

This should be distilled into a tool. Some kind of AST based code analyser/linter that fails if it sees stupid test structures.

Just having it in plain english in a HOW-TO-TEST.md file is hit and miss.

[−] gpm 32d ago

> have a curated list of every kind of test not to write

I've seen a lot of people interact with LLMs like this and I'm skeptical.

It's not how you'd "teach" a human (effectively). Teaching (humans) with positive examples is generally much more effective than with negative examples. You'd show them examples of good tests to write, discuss the properties you want, etc...

I try to interact with LLMs the same way. I certainly wouldn't say I've solved "how to interact with LLMs" but it seems to at least mostly work - though I haven't done any (pseudo-)scientific comparison testing or anything.

I'm curious if anyone else has opinions on what the best approach is here? Especially if backed up by actual data.

[−] jerf 32d ago
It's going to be difficult for anyone to have any more "data" than you already do. It's early days for all of us. It's not like there's anyone with 20 years of 2026 AI coding assistant experience.

However we can say based on the architecture of the LLMs and how they work that if you want them to not do something, you really don't want to mention the thing you don't want them to do at all. Eventually the negation gets smeared away and the thing you don't want them to do becomes something they consider. You want to stay as positive as possible and flood them with what you do want them to do, so they're too busy doing that to even consider what you didn't want them to do. You just plain don't want the thing you don't want in their vector space at all, not even with adjectives hanging on them.

[−] TeMPOraL 32d ago
I don't have much data to go on (in accordance with what 'jerf wrote), however I offer a high-level, abstract perspective.

The ideal set of outcomes exist as a tiny subspace of a high-dimensional space of possible solutions. Almost all those solutions are bad. Giving negative examples is removing some specific bits of the possibility space from consideration[0] - not very useful, since almost everything else that remains is bad too. Giving positive examples is narrowing down the search area to where the good solutions are likely to be - drastically more effective.

A more humane intuition[1], something I've observed as a parent and also through introspection. When I tell my kid to do something, and they don't understand WTF it is that I want, they'll do something weird and entirely undesirable. If I tell them, "don't do that - and also don't do [some other thing they haven't even thought of yet]", it's not going to improve the outcome; even repeated attempts at correction don't seem effective. In contrast, if I tell (or better, show) them what to do, they usually get the idea quickly, and whatever random experiments/play they invent, is more likely to still be helpful.

--

[0] - While paradoxically also highlighting them - it's the "don't think of a pink elephant" phenomenon.

[1] - Yes, I love anthropomorphizing LLMs, because it works.

[−] colechristensen 32d ago
It's not a person. Unlike a person it has a tremendous "memory" of everything ever done its creators could get access to.

If I tell it what to do, I bias it towards doing those things and limit its ability to think of things I didn't think of myself, which is what I want in testing. In separate passes, sure a pass where I prescribe types and specific tests is effective. But I also want it to think of things I didn't, a prompt like "write excellent tests that don't break these rules..." is how you get that.

[−] michaelbuckbee 32d ago
Two things:

1. Tests have always been both about the function of the application, but also the communication of what should be occurring to the larger team or yourself six months down the road.

With automated software development the communication with the LLM itself is a much larger part of it so I feel like it's "ok" to have lots of easy tests that are less about rigor and more about "yes this is how this should work"

2. Ideally we're going to get to the point where the tooling allows for adversarial agents with one writing code and one writing tests. Even for now just popping open a separate terminal window and generating+running tests in it from your main coding terminal is helpful.

[−] WalterBright 32d ago
The trick is crafting the minimal number of tests.
[−] bodegajed 32d ago
it is like reward hacking, where the reward function in this case the test is exploited to achieve its goals. it wants to declare victory and be rewarded so the tests are not critical to the code under test. This is probably in the RL pre-training data, I am of course merely speculating.
[−] suzzer99 33d ago

> Generally, though, most of us need to think about using more abstraction rather than less.

Maybe this was true when Programming Perl was written, but I see the opposite much more often now. I'm a big fan of WET - Write Everything Twice (stolen from comments here), then the third time think about maybe creating a new abstraction.

[−] badlucklottery 33d ago

>WET - Write Everything Twice

I've always heard this as the "Rule of three": https://en.wikipedia.org/wiki/Rule_of_three_(computer_progra...

[−] hackable_sand 32d ago
Antymony with DRY
[−] jpfr 32d ago
Probably the second rewrite is really tight with good abstractions and little repetition.

So no, the end result can still be DRY.

[−] marcus_holmes 32d ago
"Duplication is far cheaper than the wrong abstraction"

Sandi Metz https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction

[−] suzzer99 32d ago
And adding an abstraction later is much easier than removing an unneeded one, which can be very hard or even impossible depending on the complexity of the app.
[−] dasil003 33d ago
Totally agree with this, the beauty of software is the right abstractions have untold impact, spanning many orders of magnitude. I'm talking about the major innovations, things like operating systems, RDBMS, cloud orchestration. But the majority of code in the world is not like that, it's just simple business logic that represents ideas and processes run by humans for human purposes which resist abstraction.

That doesn't people from trying though, platform creation is rife within big tech companies as a technical form of empire building and career-driven development. My rule of thumb in tech reviews is you can't have a platform til you have three proven use cases and shown that coupling them together is not a net negative due to the autonomy constraint a shared system imposes.

[−] genxy 32d ago
That is where I put systems programmers, they need to extract an abstract algebra out of the domain. If they are able to accomplish this, the complexity of the problem largely evaporates.

Use the wrong abstraction and you are constantly fighting the same exact bug(s) in the system. Good design makes entire classes of bugs impossible to represent.

I don't believe the trope that you need to make a bunch of bad designs before you can do good. Those lessons are definitely valuable, but not a requirement.

A great example is the evolution from a layered storage stack to a unified one like ZFS. Or compilers from multipass beasts to interactive query based compilers and dynamic jits.

The design and properties of the system was always the problem I loved solving, sometimes the low level coding puzzles are fun. Much of programming is a slog though, the flow state has been harder and harder to achieve. The super deep bug finding, sometimes, if you satisfactorily found it and fixed it. This is the part where you learn an incredible amount. Fixing shallow cross module bugs is hell.

Don't you have to be really seasoned to in good faith, attempt to couple two systems and say where that would be productive? You can't prove this negative. I would imagine a place like that would have to have a very strong culture of building towards the stated goals. Keeping politics and personalities out of it as much as possible.

[−] layer8 33d ago
More than twice is a rather low bar, I don’t think that it conflicts with the quote from Programming Perl.
[−] suzzer99 32d ago
I don't think it's a hard rule, more of an ethos. If you know there are going to be a bunch of something, write the abstraction out of the gate. If you have three code entities with a lot of similar properties, but the app is new and you feel like there's a good chance they might diverge in the future, then leave them separate.
[−] raincole 32d ago
I agree. It's crazy how many layers of abstraction have been created since 1991 (when Programming Perl was published.)
[−] nixpulvis 33d ago
I've been advocating for writing everything twice since college.
[−] jimbokun 32d ago
That will still result in more abstraction than the average programmer.
[−] HarHarVeryFunny 32d ago
Writing twice makes sense if time permits, or the opportunity presents itself. First time may be somewhat exploratory (maybe a thow-away prototype), then second time you better understand the problem and can do a better job.

A third time, with a new abstraction, is where you need to be careful. Fred Brooks ("Mythical Man Month") refers to it as the "second-system effect" where the confidence of having done something once (for real, not just prototype) may lead to an over-engineered and unnecessarily complex "version 2" as you are tempted to "make it better" by adding layers of abstractions and bells and whistles.

[−] wcarss 32d ago
I agree with what you're saying about writing something twice or even three times to really understand it but I think you might have misunderstood the WET idea: as I understand it, it's meant in opposition to DRY, in the sense of "allow a second copy of the same code", and then when you need a third copy, start to consider introducing an abstraction, rather than religiously avoiding repeated code.
[−] HarHarVeryFunny 32d ago
Personally, even for a prototype, I'd be using functions immediately as soon as I saw (or anticipated) I needed to do same thing twice - mainly so that if I want to change it later there is one place to change, not many. It's the same for production code of course, but when prototyping the code structure may be quite fluid and you want to keep making changes easy, not have to remember to update multiple copies of the same code.

I'm really talking about manually writing code, but the same would apply for AI written code. Having a single place to update when something needs changing is always going to be less error prone.

The major concession I make to modularity when developing a prototype is often to put everything into a single source file to make it fast to iteratively refactor etc rather than split it up into modules.

[−] suzzer99 32d ago

> mainly so that if I want to change it later there is one place to change, not many

But what happens when new requirements come in for just one of the things? If you left them separate, it's an easy change of a few lines. If you created an abstraction, now you either have to add a bunch of if statements, or spend time undoing the entire abstraction that you spent X amount of time creating.

If a bunch of other code has built up around that abstraction, undoing it can become a serious chore. I've worked on apps that had way too many premature abstractions, and we just had to live with it because it would be too risky and onerous to try to undo them.

In my experience, it's generally an order of magnitude easier to add an abstraction to a mature app when you get tired of making changes in multiple places, than to remove one when the app evolves and you realize these things aren't actually that similar. Also when you wait to abstract, you might see a better way to do it, or how to reduce the scope so that you're using composition to share a bunch of smaller pieces vs. sharing the entire page/object/interface/endpoint/etc.

Obviously, this isn't a blanket rule. There's an aspect of soothsaying to guess which things might diverge and which are likely to spawn a lot more similar copies.

[−] HarHarVeryFunny 32d ago

> But what happens when new requirements come in for just one of the things?

I guess it could happen, but that depends on your mental model when coding - if you're just pattern matching similar chunks of code (which are not being used in a semantically identical way) then all bets are off, although that seems a very alien concept of how someone might code.

OTOH, if you have a higher level mental model of what you are doing then it's not a matter of "this looks like common code" but rather "i need to do the exact same operation" (same inputs/outputs/semantics) here. Maybe I'm expressing it poorly, but I can't recall ever having to fork a function because requirements at two call sites just diverged.

[−] skydhash 32d ago
The danger with people that claims to follow DRY is that they don’t check first that they are repeating yourself. As soon as they’re encounter similarity, they assume equality and rush to abstract it. But if one knows the domain enough to know that some logic is the same, not just similar, then no need to write it twice first.
[−] reenorap 32d ago
As someone who has switched to exclusively coded using AI after 30 years of coding by myself, I find it really weird when people take credit for the lines of code ad features that AI generates. Flexing that one "coded" tens of hundreds of thousands of lines per day is a bit cringe, seeing as though it's really just the prompt that one typed.
[−] stratts 32d ago
It's a spectrum, isn't it? From targeted edits that you approve manually - which I think you can reasonably take credit for - all the way to full blown vibe-coded apps where you're hardly involved in the design process at all.

And then there's this awkward bit in the middle where you're not necessarily reviewing all the code the AI generates, but you're the one driving the architecture, coming up with feature ideas, pushing for refactors from reading the code, etc. This is where I'm at currently and it's tricky, because while I'd never say that I "wrote" the code, I feel I can claim credit for the app as a whole because I was so heavily involved in the process. The end result I feel is similar to what I would've produced by hand, it just happened a lot faster.

(granted, the end result is only 2000 LoC after a few weeks working on and off)

[−] Centigonal 32d ago
I think LOC and "writing code" are largely irrelevant as metrics of productivity in a world with LLMs that love to churn out overly loquacious code.

I think the right way to explain the work done sounds something like, "I worked with Claude to create an app that does ______. I know it works because ______."

[−] HarHarVeryFunny 32d ago
Meta apparently now has a "leaderboard" for who is using the most AI - consuming the most tokens. Must make Anthropic happy, since Meta is using Claude, and accounts for some significant percentage (10%? 20%?) of their total volume.
[−] WatchDog 32d ago
Token usage is a different and more sympathetic heuristic than LOC produced.

The metric by itself tells you nothing about what value those tokens produced, but to some extent it represents the amount of thinking you are able to offload to the computer for you.

Wide breadth problems seem to scale well with usage, like scanning millions of LOC of code for vulnerabilities, such as the recent claude mythos results.

[−] HarHarVeryFunny 32d ago
The trouble with rewarding token usage is the same as rewarding LOC written/generated - if that's what you are asking for then that is what you will get. Asking the AI to "scan the entire codebase for vulnerabilities" would certainly be a good way to climb the leaderboard!
[−] WatchDog 32d ago
Absolutely, no one should be rewarded for either tokens used or LOC generated. I just think in the absence of any incentives, token usage is a better heuristic as to value produced than LOC generated.
[−] ignoramous 32d ago
Some argue, LoC is irrelevant as a quality/complexity metric as (in this new software product development lifecycle) implementation + testing + maintainance is wholly overseen by agents.

It has never been possible to code & deploy software with all but specs. Whatever software Garry is building are products he couldn't otherwise. LoC, in that context, serves as a reminder of the capabilities of the agents to power/slog through reqs/specs (quite incredibly so).

Besides, critical human review can always be fed back as instructions to agents.

[−] marcus_holmes 32d ago
Yes!

I don't mind it so much when it's a newbie or non-techie who has never actually written code before, because bless their hearts, they did it! They got some code working!

But if you've been developing for decades, you know that counting lines of code means nothing, less than nothing. That you could probably achieve the same result in half the lines if you thought about it a bit longer.

And to claim this as an achievement when it's LLM-generated... that's not a boast. That doesn't mean what you think it means.

But I guess we hit the same old problem that we've always had - how do you measure productivity in software development? If you wanted to boast about how an LLM is making you 100x more productive, what metric could you use? LOC is the most easily measurable, really, really, terrible measure that PMs have been using since we started doing this, because everything else is hard.

[−] bdangubic 32d ago
here’s one thing that somewhat worked for my team. when we first started using LLMs we decided to run the same process as if they did not exist, same sprint planning meetings, same estimation. we did this for 6 months and saw roughly 55% increase in output compared to pre-LLM usage. there are biases in what were tried to achieve, it is not easy to estimate something will take XX hours when you know some portion (for example writing documentation or portions of the test coverage) you won’t have to write but we did our best. after we convinced ourselves of productivity gains we stopped doing this.
[−] eucyclos 32d ago
I forget who said it, but I heard the idea floated that if your work can be measured in terms of productivity at all, it can and probably should be done by software. Not sure how that applies here since as you point out, a 10x programmer probably doesn't produce 10x the code.
[−] ofjcihen 32d ago
If anything couldn’t huge amounts of code changes or LoC be a sign of a poor outcome?
[−] njarboe 33d ago
German General Kurt von Hammerstein-Equord (a high-ranking army officer in the Reichswehr/Wehrmacht era):

“I divide my officers into four groups. There are clever, diligent, stupid, and lazy officers. Usually two characteristics are combined.

Some are clever and diligent — their place is the General Staff.

The next lot are stupid and lazy — they make up 90% of every army and are suited to routine duties.

Anyone who is both clever and lazy is qualified for the highest leadership posts, because he possesses the intellectual clarity and the composure necessary for difficult decisions.

One must beware of anyone who is both stupid and diligent — he must not be entrusted with any responsibility because he will always cause only mischief.”

[−] xhrpost 33d ago
I've had this exact sentiment in the past couple months after seeing a few PRs that were definitely the wrong solution to a problem. One was implementing it's own parsing functions to which well established solutions like JSON or others likely existed. I think any non-llm programmer could have thought this up but then immediately decide to look elsewhere, their human emotions would have hit and said "that's way too much (likely redundant) work, there must be a better way". But the LLM has no emotion, it isn't lazy and that can be a problem because it makes it a lot easier to do the wrong thing.
[−] juanre 32d ago
Should we be talking about LLMs' taste and proclivities? Because these can also be prompted. You can put your Claude or Codex in the mind of someone who remembers Larry Wall and his three virtues, and it will do a fantastic job at uncovering the lacking abstractions and poor quality _in someone else's code_.

The jury is still out in my mind. Can I use these tools to create software that does not suck? Will the speed at which code can be created and modified lead to a change in our ideas of what good code looks like?

Last week I had a good idea for a change in architecture in my software that will make it much more powerful. I set a team of 12 agents on it, mostly unsupervised, with a pretty weak org structure. After a day and a half, and way too many tokens spent, they managed to build the entirely wrong thing. All tests passed.

The next few days have been spent with a much simpler structure: two teams, each of two agents, one coding (Codex is better at it these days) and one reviewing and keeping things aligned with the docs (Claude). This may have worked, I am still not sure.

My best guess right now of how good software development will look like with these tools: the effort/tokens spent on reviewing needs to be commensurate with the effort spent on coding.

[−] johnfn 33d ago
As dumb as it is to loudly proclaim you wrote 200k loc last week with an LLM, I don’t think it’s much better to look at the code someone else wrote with an LLM and go “hah! Look at how stupid it is!” You’re making exactly the same error as the other guy, just in the opposite direction: you’re judging the profession of software engineering based on code output rather than value generation.

Now, did Garry Tan actually produce anything of value that week? I dunno, you’ll have to ask him.

[−] arthurjj 33d ago
LLMs not being lazy enough definitely feels true. But it's unclear to me if it a permanent issue, one that will be fixed in the next model upgrade or just one your agent framework/CICD framework takes care of.

e.g. Right now when using agents after I'm "done" with the feature and I commit I usually prompt "Check for any bugs or refactorings we should do" I could see a CICD step that says "Look at the last N commits and check if the code in them could be simplified or refactored to have a better abstraction"

[−] red_admiral 32d ago

> 37K LoC per day across 5 projects

I remember the days when we talked about mythical man-months and why LoC was not a good metric to measure programmer output. And then Ken Thompson said [1]

> One of my most productive days was throwing away 1000 lines of code.

Or the famous -2K LoC story:

> Bill Atkinson, the author of Quickdraw [...] had completely rewritten the region engine using a simpler, more general algorithm which, after some tweaking, made region operations almost six times faster. As a by-product, the rewrite also saved around 2,000 lines of code [...] it was time to fill out the management form for the first time. When he got to the lines of code part, he thought about it for a second, and then wrote in the number: -2000. [2]

[1] disputed: https://skeptics.stackexchange.com/questions/43800/did-the-c...

[2] https://www.folklore.org/Negative_2000_Lines_Of_Code.html

[−] singron 33d ago
I have noticed LLMs have a propensity to create full single page web applications instead of simpler programs that just print results to the terminal.

I've also struggled with getting LLMs to keep spec.md files succinct. They seem incapable of simplifing documents while doing another task (e.g. "update this doc with xyz and simply the surrounding content") and really need to be specifically tasked at simplifying/summarizing. If you want something human readable, you probably just need to write it yourself. Editing LLM output is so painful, and it also helps to keep yourself in the loop if you actually write and understand something.

[−] jimbokun 32d ago
Time to teach the LLMs and the vibe coders one of the timeless lessons of software development:

https://www.folklore.org/Negative_2000_Lines_Of_Code.html

[−] pityJuke 33d ago
Man, I cannot imagine how nice it must to be to work with leadership like this, who just gets it.
[−] abcde666777 32d ago
Being a somewhat lazy individual myself, I'm wary of this statement. It feels too... comforting. "It's okay that I wasn't productive today, because laziness has merits".

I consider my laziness a part of who I am, and I don't demonize it, but I also don't consider it my ally - to get the things I care about done I often have to actively push against it.