Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)

by eichin 268 comments 433 points
Read article View on HN

268 comments

[−] mattbee 41d ago
Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?" is a very persuasive on-ramp for developers new to AI. It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.

I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.

[−] merlindru 41d ago
I like biasing it towards the fact that there is a bug, so it can't just say "no bugs! all good!" without looking into it very hard.

Usually I ask something like this:

"This code has a bug. Can you find it?"

Sometimes I also tell it that "the bug is non-obvious"

Which I've anecdotally found to have a higher rate of success than just asking for a spot check

[−] majormajor 41d ago
Do you not run into too many false positives around "ah, this thing you used here is known to be tricky, the issue is..."

I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."

[−] cmrdporcupine 41d ago
What's more useful is to have it attempt to not only find such bugs but prove them with a regression test. In Rust, for concurrency tests write e.g. Shuttle or Loom tests, etc.
[−] majormajor 41d ago
It would be generally good if most code made setting up such tests as easy as possible, but in most corporate codebases this second step is gonna require a huge amount of refactoring or boilerplate crap to get the things interacting in the test env in an accurate, well-controlled way. You can quickly end up fighting to understand "is the bug not actually there, or is the attempt to repro it not working correctly?"

(Which isn't to say don't do it: I think this is a huge benefit you can gain from being able to refactor more quickly. Just to say that you're gonna short-term give yourself a lot more homework to make sure you don't fix things that aren't bugs, or break other things in your quest to make them more provable/testable.)

[−] simulator5g 41d ago
That is an unfortunate case you described, but also, git gud and write tests in the first place so you don't need to refactor things down the road.
[−] merlindru 39d ago
yes but i can identify those easily. i know that if it flags something that is obviously a non issue, i can discard it.

...because false positives are good errors. false negatives is what i'm worried about.

i feel massively more sure that something has no big oversights if multiple runs (or even multiple different models) cannot find anything but false positives

[−] Nition 41d ago
Just in case you didn't read the full article, this is how they describe finding the bugs in the Linux kernel as well.

Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.

[−] merlindru 39d ago
very interesting. i think "verbal biasing" and "knowing how to speak" in general is a really important thing with LLMs. it seems to massively affect output. (interestingly, somewhat less with Opus than with GPT-5.4 and Composer 2. Opus seems to intuit a little better. but still important.)

it's like the idea behind the book _The Mom Test_ suddenly got very important for programming

[−] jiggawatts 41d ago
As a meta activity, I like to run different codebases through the same bug-hunt prompt and compare the number found as a barometer of quality.

I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.

Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”

[−] kgwxd 41d ago

> so it can't just say "no bugs! all good!"

If anyone, or anything, ever answers a question like that, you should stop asking it questions.

[−] wat10000 41d ago
You just have to be careful because it will sometimes spot bugs you could never uncover because they’re not real. You can really see the pattern matching at work with really twisted code. It tends to look at things like lock free algorithms and declare it full of bugs regardless of whether it is or not.
[−] mattbee 40d ago
I have seen it start on a sentence, get lost and finish it with something like "Scratch that, actually it's fine."

And if it's not giving me a reason I can understand for a bug, I'm not listening to it! Mostly it is showing me I've mixed up two parameters, forgotten to initialise something, or referenced a variable from a thread that I shouldn't have.

The immediate feedback means the bug usually gets a better-quality fix than it would if I had got fatigued hunting it down! So variables get renamed to make sure I can't get them mixed up, a function gets broken out. It puts me in the mind of "well make sure this idiot can't make that mistake again!"

[−] dvfjsdhgfv 41d ago

> Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?"

It's actually the main way I use CC/codex.

[−] petesergeant 41d ago
I find Codex sufficiently better for it that I’ve taught Claude how to shell out to it for code reviews
[−] linsomniac 41d ago
Ditto, I made a "/codex-review" skill in Claude Code that reviews the last git commit and writes an analysis of it for Claude Code to then work. I've had very good luck with it.

One particularly striking example: I had CC do some work and then kicked off a "/codex-review" and while it was running went to test the changes. I found a deadlock but when I switched back to CC the Codex review had found the deadlock and Claude Code was already working on a fix.

[−] cmrdporcupine 41d ago
I think OpenAI has actually released an official version of exactly this: https://community.openai.com/t/introducing-codex-plugin-for-...

https://github.com/openai/codex-plugin-cc

I actually work the other way around. I have codex write "packets" to give to claude to write. I have Claude write the code. Then have Codex review it and find all the problems (there's usually lots of them).

Only because this month I have the $100 Claude Code and the $20 Codex. I did not renew Anthropic though.

[−] motbus3 41d ago
Yeah and it comes with the blood of children included
[−] vaginaphobic 41d ago
[dead]
[−] justinclift 41d ago

> It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.

Go has a built in race detector which may be useful for this too: https://go.dev/doc/articles/race_detector

Unsure if it's suitable for inclusion in CI, but seems like something worth looking into for people using Go.

[−] 9dev 41d ago
I usually do several passes of "review our work. Look for things to clean up, simplify, or refactor." It does usually improve the quality quite a lot; then I rewind history to before, but keep the changes, and submit the same prompt again, until it reaches the point of diminishing returns.
[−] trueno 40d ago
ive gone down this rabbit hole and i dunno, sometimes claude chases a smoking gun that just isn't a smoking gun at all. if you ask him to help find a vulnerability he's not gonna come back empty handed even if there's nothing there, he might frame a nice to have as a critical problem. in my exp you have to have build tests that prove vulnerabilities in some way. otherwise he's just gonna rabbithole while failing to look at everything.

ive had some remarkable successes with claude and quite a few "well that was a total waste of time" efforts with claude. for the most part i think trying to do uncharted/ambitious work with claude is a huge coinflip. he's great for guardrailed and well understood outcomes though, but im a little burnt out and unexcited at hearing about the gigantic-claude exercises.

[−] slig 41d ago

> "Codex wrote this, can you spot anything weird?"

[−] tosti 41d ago
[dead]
[−] aiedwardyi 41d ago
[dead]
[−] userbinator 41d ago
Not "hidden", but probably more like "no one bothered to look".

declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.

When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.

it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer

This is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.

[−] tptacek 41d ago
"No one bothered to look" is how most vulnerabilities work. Systems development produces code artifacts with compounding complexity; it is extraordinarily difficult to keep up with it manually, as you know. A solution to that problem is big news.

Static analyzers will find all possible copies of unbounded data into smaller buffers (especially when the size of the target buffer is easily deduced). It will then report them whether or not every path to that code clamps the input. Which is why this approach doesn't work well in the Linux kernel in 2026.

[−] rubendev 41d ago
With a capable static analyzer that is not true. In many common cases they can deduce the possible ranges of values based on branching checks along the data flow path, and if that range falls within the buffer then it does not report it.
[−] tptacek 41d ago
Be specific. Which analyzer are you talking about and which specific targets are you saying they were successful at?
[−] canucker2016 41d ago
Intrinsa's PREfix static source code analyzer would model the execution of the C/C++ code to determine values which would cause a fault.

IIRC they were using a C/C++ compiler front end from EDG to parse C/C++ code to a form they used for the simulation/analysis.

see https://web.eecs.umich.edu/~weimerw/2006-655/reading/bush-pr... for more info.

Microsoft bought Intrinsa several years ago.

[−] tptacek 41d ago
I'm sure this is very interesting work, but can you tell me what targets they've been successful surfacing exploitable vulnerabilities on, and what the experience of generating that success looked like? I'm aware of the large literature on static analysis; I've spent most of my career in vulnerability research.
[−] canucker2016 41d ago
PREfix wasn't designed specifically for finding exploitable bugs - it was aimed somewhere in between Purify (runtime bug detection) and being a better lint.

One of the articles/papers I recall was that the big problem for PREfix when simulating the behaviour of code was the explosion in complexity if a given function had multiple paths through it (e.g. multiple if's/switch statements). PREfix had strategies to reduce the time spent in these highly complex functions.

Here's a 2004 link that discusses the limitations of PREfix's simulated analysis - https://www.microsoft.com/en-us/research/wp-content/uploads/...

The above article also talks about Microsoft's newer (for 2004) static analysis tools.

There's a Netscape engineer endorsement in a CNet article when they first released PREfix. see https://www.cnet.com/tech/tech-industry/component-bugs-stamp...

[−] mrshadowgoose 41d ago

> Not "hidden", but probably more like "no one bothered to look".

Well yeah. There weren't enough "someones" available to look. There are a finite number of qualified individuals with time available to look for bugs in OSS, resulting in a finite amount of bug finding capacity available in the world.

Or at least there was. That's what's changing as these models become competent enough to spot and validate bugs. That finite global capacity to find bugs is now increasing, and actual bugs are starting to be dredged up. This year will be very very interesting if models continue to increase in capability.

[−] literalAardvark 41d ago
I was just thinking about this and what it means for closed source code.

Many people with skin in the game will be spending tokens on hardening OSS bits they use, maybe even part of their build pipelines, but if the code is closed you have to pay for that review yourself, making you rather uncompetitive.

You could say there's no change there, but the number of people who can run a Claude review and the number of people who can actually review a complicated codebase are several orders of magnitude apart.

Will some of them produce bad PRs? Probably. The battle will be to figure out how to filter them at scale.

[−] dolmen 41d ago
I have no doubt that LLMs can be as good at analyzing binaries than at analyzing source code.

An avalanche of 0-day in proprietary code is coming.

[−] NitpickLawyer 41d ago

> This is something a lot of static analysers can easily find.

And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...

I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.

[−] userbinator 41d ago
Remember Heartbleed in OpenSSL? That long predated LLMs, but same story: some bozo forgot how long something should/could be, and no one else bothered to check either.
[−] choeger 41d ago
It's much, much, easier to run an LLM than to use a static or dynamic analyzer correctly. At the very least, the UI has improved massively with "AI".
[−] wat10000 41d ago
There’s the classic case of the Debian OpenSSL vulnerability, where technically illegal but practically secure code was turned into superficially correct but fundamentally insecure code in an attempt to fix a bug identified by a (dynamic, in this case) analyzer.
[−] mrshadowgoose 41d ago
And even if that's true (and it frequently is!), detractors usually miss the underlying and immense impact of "sleeping dad capability" equivalent artificial systems.

Horizontally scaling "sleeping dads" takes decades, but inference capacity for a sleeping dad equivalent model can be scaled instantly, assuming one has the hardware capacity for it. The world isn't really ready for a contraction of skill dissemination going from decades to minutes.

[−] DGAP 41d ago
I replicated this experiment on several production codebases and got several crits. Lots of dupes, lots of false positives, lots of bugs that weren't actually exploitable, lots of accepted/ known risks. But also, crits!
[−] altern8 41d ago
Every time I read these titles, I wonder if people are for some reason pushing the narrative that Claude is way smarter than it really is, or if I'm using it wrong.

They want me to code AI-first, and the amount of hallucinations and weird bugs and inconsistencies that Claude produces is massive.

Lots of code that it pushes would NOT have passed a human/human code review 6 months ago.

[−] dist-epoch 41d ago

> "given enough eyeballs, all bugs are shallow"

Time to update that:

"given 1 million tokens context window, all bugs are shallow"

[−] PeterStuer 41d ago
Those 3 letter agencies are going to see their stash of 0-days dwindle so hard.
[−] fguerraz 41d ago
Interestingly, I think 3 or 4 out of the 5 bugs would have been prevented / mitigated quite well using https://github.com/anthraxx/linux-hardened patches...

(disabled io_uring, would have crashed the kernel on UAF, and made exploitation of the heap overflow very unreliable)

[−] summarity 41d ago
Related work from our security lab:

Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/

Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...

[−] jazz9k 42d ago
This does sound great, but the cost of tokens will prevent most companies from using agents to secure their code.
[−] cesaref 41d ago
I'm interested in the implications for the open source movement, specifically about security concerns. Anyone know is there has been a study about how well Claude Code works on closed source (but decompiled) source?
[−] misiek08 41d ago
Do not expect so many more reports. Expect so many more attacks ;)
[−] e12e 41d ago
I wonder about the "video running in the background" during qna of the talk:

https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5

Did he write an exploit for the NFS bug that runs via network over USB? Seems to be plugging in a SoC over USB...?

[−] eichin 42d ago
An explanation of the Claude Opus 4.6 linux kernel security findings as presented by Nicholas Carlini at unpromptedcon.
[−] simplesocieties 40d ago
Supposedly humans have become “100x”™ more productive with these AI tools, but nowhere to be seen are the benefits for the wielders of said tools. Is your salary 100x higher? Are you able to spend more time with your family/friends instead of at the office? Why are we still putting up with these outdated work practices if LLMs have made everybody so much more productive?
[−] skeeter2020 41d ago
And with AI generating vulnerabilities at an accelerated pace this business is only getting bigger. Welcome to the new antivirus!
[−] rixrax 41d ago
I hope next up is the performance and bloat that the LLMs can try and improve.

Especially on perf side I would wager LLMs can go from meat sacks what ever works to how do I solve this with best available algorithm and architecture (that also follows some best practises).

[−] alsanan2 41d ago
making public that AI is able of founding that kind of vulnerabilities is a big problem. In this case it's nice that the vulnerability has been closed before publishing but in case a cracker founds it, the result would be extremately different. This kind of news only open eyes for the crackers.
[−] jason1cho 41d ago
This isn't surprising. What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.
[−] Srinathprasanna 38d ago
[dead]
[−] jeremie_strand 41d ago
[dead]
[−] jeremie_strand 41d ago
[dead]
[−] pithtkn 41d ago
[dead]
[−] Serberus 40d ago
[dead]
[−] dfir-lab 41d ago
[flagged]
[−] LeonTing1010 41d ago
[flagged]
[−] adamsilvacons 41d ago
[flagged]
[−] redoh 41d ago
[flagged]
[−] helenazdenova 39d ago
[dead]
[−] roach54023 41d ago
[dead]
[−] noritaka88 40d ago
[flagged]
[−] claudexai 41d ago
[dead]
[−] skyskys 41d ago
[flagged]
[−] lnkl 41d ago
[flagged]
[−] yunnpp 41d ago
[flagged]
[−] _pdp_ 41d ago
The title is a little misleading.

It was Opus 4.6 (the model). You could discover this with some other coding agent harness.

The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.

Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.

[−] cookiengineer 41d ago

> Nicholas has found hundreds more potential bugs in the Linux kernel, but the bottleneck to fixing them is the manual step of humans sorting through all of Claude’s findings

No, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.

Just sayin'

[−] up2isomorphism 42d ago
But on the other hand, Claude might introduce more vulnerability than it discovered.
[−] desireco42 41d ago
A developer using Claude Code found this bug. Claude is a tool. It is used by developers. It should not sign commits. Neovim never tried to sign commits with me, nor Zed.