Speed at the cost of quality: Study of use of Cursor AI in open source projects (2025) (arxiv.org)

by wek 81 comments 148 points
Read article View on HN

81 comments

[−] rfw300 61d ago
Super interesting study. One curious thing I've noticed is that coding agents tend to increase the code complexity of a project, but simultaneously massively reduce the cost of that code complexity.

If a module becomes unsustainably complex, I can ask Claude questions about it, have it write tests and scripts that empirically demonstrate the code's behavior, and worse comes to worst, rip out that code entirely and replace it with something better in a fraction of the time it used to take.

That's not to say complexity isn't bad anymore—the paper's findings on diminishing returns on velocity seem well-grounded and plausible. But while the newest (post-Nov. 2025) models often make inadvisable design decisions, they rarely do things that are outright wrong or hallucinated anymore. That makes them much more useful for cleaning up old messes.

[−] joshribakoff 61d ago
Bad code has real world consequences. Its not limited to having to rewrite it. The cost might also include sanctions, lost users, attrition, and other negative consequences you don’t just measure in dev hours
[−] SR2Z 61d ago
Right, but that cost is also incurred by human-written code that happens to have bugs.

In theory experienced humans introduce less bugs. That sounds reasonable and believable, but anyone who's ever been paid to write software knows that finding reliable humans is not an easy task unless you're at a large established company.

[−] MeetingsBrowser 61d ago
The question then becomes, can LLMs generate code close to the same quality as professionals.

In my experience, they are not even close.

[−] SR2Z 60d ago
Well, if you keep in mind that "professionals" means "people paid to write code" then LLMs have been generating code at the same quality OR BETTER for about a year now. Most code sucks.

If you compare it to beautiful code written by true experts, then obviously not, but that kind of code isn't what makes the world go 'round.

[−] mathgeek 61d ago
We should qualify that kind of statement, as it’s valuable to define just what percentile of “professional developers” the quality falls into. It will likely never replace p90 developers for example, but it’s better than somewhere between there and p10. Arbitrary numbers for examples.
[−] MeetingsBrowser 61d ago
Can you quantify the quality of a p90 or p10 developer?

I would frame it differently. There are developers successfully shipping product X. Those developer are, on average, as skilled as necessary to work on project X. else they would have moved on or the project would have failed.

Can LLMs produce the same level of quality as project X developers? The only projects I know of where this is true are toy and hobby projects.

[−] mathgeek 61d ago

> Can you quantify the quality of a p90 or p10 developer?

Of course not, you have switched “quality” in this statement to modify the developer instead of their work. Regarding the work, each project, as you agree with me on from your reply, has an average quality for its code. Some developers bring that down on the whole, others bring it up. An LLM would have a place somewhere on that spectrum.

[−] vannevar 59d ago
In a one-shot scenario, I agree. But LLMs make iteration much faster. So the comparison is not really between an AI and an experienced dev coding by hand, it's between the dev iterating with an LLM and the dev iterating by hand. And the former can produce high-quality code much faster than the latter.

The question is, what happens when you have a middling dev iterating with an LLM? And in that case, the drop in quality is probably non-linear---it can get pretty bad, pretty fast.

[−] verdverm 61d ago
There was a recent study posted here that showed AI introduces regressions at an alarming rate, all but one above 50%, which indicates they spend a lot of time fixing their own mistakes. You've probably seen them doing this kind of thing, making one change that breaks another, going and adjusting that thing, not realizing that's making things worse.
[−] sanxiyn 61d ago
The study is likely "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration". Regression rate plot is figure 6.

Read the study to understand what it is measuring and how it was measured. As I understand parent's summary is fine, but you want to understand it first before repeating it to others.

https://arxiv.org/abs/2603.03823

[−] verdverm 60d ago
Observation 3
[−] GorbachevyChase 61d ago
Bentley Software is proof that you can ship products with massive, embarrassing defects and never lose a customer. I can’t explain enterprise software procurement, but I can guarantee you product quality is not part of that equation.
[−] MeetingsBrowser 61d ago
This only helps if you notice the code is bad. Especially in overlay complex code, you have to really be paying attention to notice when a subtle invariant is broken, edge case missed, etc.

Its the same reason a junior + senior engineer is about as fast as a senior + 100 junior engineers. The senior's review time becomes the bottleneck and does not scale.

And even with the latest models and tooling, the quality of the code is below what I expect from a junior. But you sure can get it fast.

[−] phillipclapham 61d ago
This is the most important point in the thread. The study measures code complexity but the REAL bottleneck is cognitive load (and drain) on the reviewer.

I've been doing 10-12 hour days paired with Claude for months. The velocity gains are absolutely real, I am shipping things I would have never attempted solo before AI and shipping them faster then ever. BUT the cognitive cost of reviewing AI output is significantly higher than reviewing human code. It's verbose, plausible-looking, and wrong in ways that require sustained deep attention to catch.

The study found "transient velocity increase" followed by "persistent complexity increase." That matches exactly. The speed feels incredible at first, then the review burden compounds and you're spending more time verifying than you saved generating.

The fix isn't "apply traditional methods" — it's recognizing that AI shifts the bottleneck from production to verification, and that verification under sustained cognitive load degrades in ways nobody's measuring yet. I think I've found some fixes to help me personally with this and for me velocity is still high, but only time will tell if this remains true for long.

[−] tabwidth 61d ago
The part that gets me is when it passes lint, passes tests, and the logic is technically correct, but it quietly changed how something gets called. Rename a parameter. Wrap a return value in a Promise that wasn't there before. Add some intermediate type nobody asked for. None of that shows up as a failure anywhere. You only notice three days later when some other piece of code that depended on the old shape breaks in a way that has nothing to do with the original change.
[−] chrisweekly 61d ago

>

The study found "transient velocity increase" followed by "persistent complexity increase."

Companies facing this reality are of course typically going to use AI to help manage the increased complexity. But that leads quickly to AI becoming a crutch, without which even basic maintenance could pose an insurmountable challenge.

[−] galbar 61d ago

>The fix isn't "apply traditional methods"

I would argue they are. Those traditional methods aim at keeping complexity low so that reading code is easier and requires less effort, which accelerates code review.

[−] i_love_retros 61d ago

> have it write tests

Just make sure it hasn't mocked so many things that nothing is actually being tested. Which I've witnessed.

[−] camdenreslink 60d ago
I find LLMs get much more prone to making mistakes or missing references when the size or complexity of the code increases. I have a “vibe coded” application that is just for personal use, and I’ll usually create a fresh prompt after a large refactor and ask “were all references to the previous approach removed, and has the application been fully migrated to using the new approach?”

It finds spots it missed during the refactor basically every time.

So I partially agree with you, but I think it takes multiple passes and at least enough understanding to challenge the LLM and ask pointed questions.

[−] duskdozer 60d ago
What happens if you or future developers become unable to access Claude, the proprietary product of Anthropic?
[−] AlexandrB 61d ago

> Super interesting study. One curious thing I've noticed is that coding agents tend to increase the code complexity of a project, but simultaneously massively reduce the cost of that code complexity.

This is the same pattern I observed with IDEs. Autocomplete and being able to jump to a definition means spaghetti code can be successfully navigated so there's no "natural" barrier to writing spaghetti code.

[−] jwpapi 61d ago
I think thats a fallacy. As of right now there is a point of no return where the complexity cant be broken by the agent itself without breaking more on other things. I’ve seen it before. Agents cheat on tests, break lint and type rules.

I was hoping for it to work, but It didn’t for me.

Still trying to figure out how to balance it.

[−] IsTom 60d ago

> empirically demonstrate the code's behavior

That is completely insufficient for code of any real complexity. All this does is replacing known bugs with unknown bugs.

[−] FuckButtons 61d ago

> but simultaneously massively reduce the cost of that code complexity.

Citation needed. Until proven otherwise complexity is still public enemy #1. Particularly given that system complexity almost always starts causing most of its problems once a project is further along I don’t think we will know anything meaningful about that statement for at least a year.

[−] matt_heimer 61d ago
Yes, it's not surprising that warnings and complexity increased at a higher rate when paired with increased velocity. Increased velocity == increased lines of code.

Does the study normalize velocity between the groups by adjusting the timeframes so that we could tell if complexity and warnings increased at a greater rate per line of code added in the AI group?

I suspect it would, since I've had to simplify AI generated code on several occasions but right now the study just seems to say that the larger a code base grows the more complex it gets which is obvious.

[−] keeda 61d ago
There are actually quite a few studies out there that look at LLM code quality (e.g. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=LLM+...) and they mostly have similar findings. This reinforces the idea that LLMs still require expert guidance. Note, some of these studies date back to 2023, which is eons ago in terms of LLM progress.

The conclusion of this paper aligns with the emerging understanding that AI is simply an amplifier of your existing quality assurance processes: Higher discipline results in higher velocity, lower discipline results in lower stability (e.g. https://dora.dev/research/2025/) Having strong feedback and validation loops is more critical than ever.

In this paper, for instance, they collected static analysis warnings using a local SonarQube server, which implies that it was not integrated into the projects they looked at. As such these warnings were not available to the agent. It's highly likely if these warnings were fed back into the agent it would fix them automatically.

Another interesting thing they mention in the conclusion: the metrics we use for humans may not apply to agents. My go-to example for this is code duplication (even though this study finds minimal increase in duplication) -- it may actually be better for agents to rewrite chunks of code from scratch rather than use a dependency whose code is not available forcing it to instead rely on natural language documentation, which may or may not be sufficient or even accurate. What is tech debt for humans may actually be a boon for agents.

[−] mentalgear 61d ago

> We find that the adoption of Cursor leads to a statistically significant, large, but transient increase in project-level development velocity, along with a substantial and persistent increase in static analysis warnings and code complexity. Further panel generalized-method-of-moments estimation reveals that increases in static analysis warnings and code complexity are major factors driving long-term velocity slowdown. Our study identifies quality assurance as a major bottleneck for early Cursor adopters and calls for it to be a first-class citizen in the design of agentic AI coding tools and AI-driven workflows.

So overall seems like the pros and cons of "AI vibe coding" just cancel themselves out.

[−] AstroBen 61d ago
They're measuring development speed through lines of code. To show that's true they'd need to first show that AI and humans use the same number of lines to solve the same problem. That hasn't been my experience at all. AI is incredibly verbose.

Then there's the question of if LoC is a reliable proxy for velocity at all? The common belief amongst developers is that it's not.

[−] bisonbear 61d ago
Really interesting study. One thing I keep coming back to is that tests have no way of catching this sort of tech debt. The agent can introduce something that will make you rip your hair out in 6 months, but tests are green...

My theory is that at least some of this is solvable with prompting / orchestration - the question is how to measure and improve that metric. i.e. how do we know which of Claude/Codex/Cursor/Whoever is going to produce the best, most maintainable code *in our codebase*? And how do we measure how that changes over time, with model/harness updates?

[−] dalemhurley 61d ago
I think the issue is people AI assisted code, test then commit.

Traditional software dev would be build, test, refactor, commit.

Even the Clean Coder recommends starting with messy code then tidying it up.

We just need to apply traditional methods to AI assisted coding.

[−] woeirua 61d ago
This study's cutoff date was August 2025. I don't think this result is surprising given the level of coding agent ability back then. The whole thing just shows how out-of-date academic publishing is on this subject.

>This yields 806 repositories with adoption dates between January 2024 and March 2025 that are still available on GitHub at the time of data analysis (August 2025).

There were very few people who thought that coding agents worked very well back then. I was not one of them, but I _do_ think they work today.

[−] faheembm 60d ago
This matches what I've seen building with AI assistance, velocity goes up fast, but you start accumulating complexity debt you didn't consciously design. The difference is intentionality. When the architecture and systems decisions are yours and you're using AI to execute, the complexity stays manageable. When AI drives both the architecture AND the code, that's when this paper's findings kick in hard.
[−] Slav_fixflex 60d ago
Interesting findings. I use AI agents (Claude, Windsurf) exclusively to build production software without being a developer myself. Speed is real but so is context drift – the AI breaks unrelated things while fixing others. Git became essential for me because of this.
[−] felix9527 60d ago
The study only looks at what lands in the PR. In my experience a single prompt can trigger 20+ tool calls, most of them reads and greps. The final diff is a tiny fraction of what actually happened. Hard to judge quality without seeing the process.
[−] mellosouls 61d ago
Depends on the nature of the tool I would imagine - eg. Claude Code Terminal (say) would have higher entry requirements in terms of engineering experience (Cursor was sold as newbie-friendly) so I would predict higher quality code than Cursor in a similar survey.

ofc that doesn't take into account the useful high-level and other advantages of IDEs that might mitigate against slop during review, but overall Cursor was a more natural fit for vibe-coders.

This is said without judgement - I was a cheerleader for Cursor early on until it became uncompetitive in value.

[−] chris_money202 61d ago
Now someone do a research study where a summary of this research paper is in the AGENTS.md and let’s see if the overall outcomes are better
[−] qcautomation 61d ago
[flagged]
[−] entrustai 60d ago
[dead]
[−] ryguz 61d ago
[flagged]
[−] scka-de 61d ago
[dead]
[−] levelsofself 60d ago
[dead]
[−] productinventor 61d ago
[flagged]
[−] simumw 61d ago
[flagged]
[−] PeterStuer 61d ago
Interesting from an historical perspective. But data from 4/2025? Might as well have been last century.
[−] duendefm 61d ago
AI is not perfect sure, one has to know how to use it. But this study is already flawed since models improved a lot since the beginning of 2026.