$500 GPU outperforms Claude Sonnet on coding benchmarks (github.com)

by yogthos 284 comments 489 points
Read article View on HN

284 comments

[−] bloppe 50d ago
Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.
[−] bartread 50d ago
I agree. Also good for small changes that need to be applied consistently across an entire codebase.

I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

[−] CraigJPerry 50d ago
Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).

You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.

[−] dakolli 49d ago
must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.
[−] sigmoid10 50d ago
Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.
[−] jakozaur 50d ago
Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

[−] slashdev 50d ago
Generating big chunks code is all I do, all day.

I don't write code by hand any more, neither at work, nor for side projects.

I work mostly in Rust and TypeScript at a developer tools company.

[−] Bombthecat 50d ago
Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

[−] seunosewa 50d ago
Create it!
[−] d0963319287 50d ago
[flagged]
[−] philbitt 49d ago
[dead]
[−] mmaunder 50d ago
I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
[−] selcuka 50d ago
It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

[−] memothon 50d ago
I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

[−] DanielHall 50d ago
These small models, having been fine-tuned for the test, achieve frighteningly high scores, yet perform abysmally in real-world scenarios.
[−] b3ing 50d ago
Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet
[−] electroglyph 50d ago
what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism
[−] tgiba 50d ago
Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.
[−] emp17344 50d ago
Yet more evidence that the harness matters more than the model.
[−] riidom 50d ago
Not a word about the tok/sec, unfortunately.
[−] superkuh 50d ago
If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.
[−] 15minutemail 50d ago
74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.
[−] alkonaut 50d ago
Great, it became a $1000 gpu while you were reading that.
[−] 0xbadcafebee 50d ago
This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.
[−] rldjbpin 46d ago

> coding benchmarks

> V3 phases were designed and tuned for LiveCodeBench.

only compared on the above benchmark, while this has been identified and being improved for the next version.

curious to see how it compares across the board against the base model (Qwen3-14B-Q4_K_M)

[−] josefritzishere 50d ago
The core problem of AI remains unresolved, with no conceivable path to solvency. The issue is that AI isn't very good. It's OK, sometimes under very narrow criteria. But providing AI in reality very costly. Vague promises of it magically becoming better remain, very optimistic at best and still provide no route to solvency.
[−] bdbdbdb 50d ago
This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house
[−] bilekas 50d ago
Where is a RTX 5060 Ti 16 GB 500$?

Edit : The 8GB seems to hit this price but 16 not so much.

[−] Temporary_31337 50d ago
the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor
[−] dwa3592 50d ago
I wonder if it's working out for the benchmark problems only?

one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.

[−] Aurornis 49d ago
This AI-written project is running its own LiveCodeBench on a completely different methodology. The AI-written notes even admit it:

> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.

Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.

The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.

[−] negativegate 50d ago
Am I still SOL on AMD (9070 XT) when it comes to this stuff?
[−] limoce 50d ago
The title should be "Adaptive Test-time Learning and Autonomous Specialization".
[−] sznio 50d ago
On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.
[−] eddie-wang 50d ago
[dead]
[−] itigges22 50d ago
[dead]
[−] johnwhitman 49d ago
[flagged]
[−] paxrel_ai 50d ago
[dead]
[−] wiradikusuma 50d ago
[dead]
[−] LuisvelAI 50d ago
[flagged]
[−] mergeshield 50d ago
[flagged]
[−] felixagentai 50d ago
[flagged]
[−] sayYayToLife 50d ago
[dead]
[−] ozgurozkan 50d ago
[dead]
[−] bustah 50d ago
[flagged]
[−] Razengan 50d ago
Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.