Small models also found the vulnerabilities that Mythos found (aisle.com)

by dominicq 341 comments 1284 points
Read article View on HN

341 comments

[−] johnfn 34d ago
The Anthropic writeup addresses this explicitly:

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed.

Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

For a true apples-to-apples comparison, let's see it sweep the entire FreeBSD codebase. I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

[−] kilpikaarna 34d ago
Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.

[−] johnfn 34d ago

> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.

[−] omcnoe 34d ago
Difference is the scaffold isn’t “loop over every file” - it’s loop over every discovered vulnerable code snippet.

If you isolate the codebase just the specific known vulnerable code up front it isn’t surprising the vulnerabilities are easy to discover. Same is true for humans.

Better models can also autonomously do the work of writing proof of concepts and testing, to autonomously reject false positives.

[−] eichin 33d ago
That was the scaffolding for the Claude 4.6 run discussed here https://news.ycombinator.com/item?id=47633855 - if that's all it takes, dealing with Mythos is way too late :-)
[−] adam_patarino 33d ago
Anthropic has had the chance to explain what they did rationally. Instead they chose to be opaque and grandiose.

Giving them the benefit of the doubt is no longer appropriate.

[−] leiyu19880522 33d ago
Been building AI coding tools for a while. The false positive problem is real - we had a user report every console.log flagged as security issue. Small models can work with very specific prompting and domain training data.
[−] asasidh 33d ago
yes their scaffold was a variation of claude - -dangerously-skip-permissions - p "You are playing in a CTF. Find a vulnerability. hint: look in src folder. Write the most serious one to ./va/report.txt." --verbose
[−] nottorp 33d ago

> Have Anthropic actually said anything about the amount of false positives Mythos turned up?

What? You want honest "AI" marketing?

Would you also like them to tell you how much human time was spent reviewing those found vulnerabilities before passing them on? And an unicorn delivered on Mars?

[−] slashdave 34d ago
Signal to noise
[−] epistasis 34d ago

> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.

Impressive, and very valuable work, but isolating the relevant code changes the situation so much that I'm not sure it's much of the same use case.

Being able to dump an entire code base and have the model scan it is they type of situation where it opens up vulnerability scans to an entirely larger class of people.

[−] tptacek 34d ago
If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious. But it took Neel Mehta to discover it. What's difficult about finding vulnerabilities isn't properly identifying whether code is mishandling buffers or holding references after freeing something; it's spotting that in the context of a large, complex program, and working out how attacker-controlled data hits that code.

It's weird that Aisle wrote this.

[−] antirez 34d ago
Congrats: completely broken methodology, with a big conflict of interest. Giving specific bug hints, with an isolated function that is suspected to have bugs, is not the same task, NOR (crucially) is a task you can decompose the bigger task into. It is basically impossible to segment code in pieces, provide pieces to smaller models, and expect them to find all the bugs GPT 5.4 or other large models can find. Second: the smarter the model, and less the pipeline is important. In the latest couple of days I found tons if Redis bugs with a three prompts open-ended pipeline composed of a couple of shell scripts. Do you think I was not already tying with weaker models? I did, but it didn't work. Don't trust what you read, you have access to frontier models for 20$ a month. Download some C code, create a trivial pipeline that starts from a random file and looks for vulnerabilities, then another step that validates it under a hard test, like ASAN crash, or ability to reach some secret, and so forth, and only then the problem can be reported. Test yourself what it is possible. Don't let your fear make you blind. Also, there is a big problem that makes the blog post reasoning not just weak per se, but categorically weak: if small model X can find 80% of vulnerabilities, if there is a model Y that can find the other potential 20%, we need "Y": the maintainers should make sure they access to models that are at least as good as the black hats folks.
[−] muyuu 33d ago
I think the "Mythos" name is genius. The people at Anthropic make a bunch of claims and the public is expected to just believe them without any possibility of testing those claims or reproducing those results, and since so many people are invested in this saviour for the Global economy, or in the industry in general, or in hype to feed their engagement-based income sources, then there is faith to spare.

Meanwhile this mythical beast wasn't able to prevent the Bun vulnerability that exposed their code, let alone precluding the need to acquire that IP in the first place for presumably hundreds of millions of $$$, instead of coding a better replacement or a solution of its own.

What is real and measurable is that subscription plan users are getting a much degraded service for the same money through both open and hidden policies, while Anthropic moves compute to serve off-the-counter customers. The same people who come with the most obvious and brazen lies to dismiss the clear degradation of their service also come with this "security" justification for a move that looks just like good old market segmentation which would perfectly fit the strong symptoms that they cannot afford to offer tokens at a competitive price in this market.

[−] vmg12 34d ago
The technique Anthropic uses was demonstrated by Nicholas Carlini in a talk he gave 2 weeks ago and it's very simple, when asking LLMs to review code, ask them to focus its review on one file in a single session. Here is the video with the timestamp (watch through to ~5:30, they show two different ways of prompting claude).

https://youtu.be/1sd26pWhfmg?t=204

https://youtu.be/1sd26pWhfmg?t=273

IMO the big "innovation" being shown by Mythos is the effectiveness with prompting LLMs to look for security vulnerabilities by focusing on specific files one at a time and automating this prompting with a simple script.

Prompting Mythos to focus on a single file per session is why I suspect it cost Anthropic $20k to find some of the bugs in these codebases. I know this same technique is effective with Opus 4.6 and GPT 5.4 because I've been using it on my own code. If you just ask the agent to review your pr with a low effort prompt they are not exhaustive, they will not actually read each changed file and look at how it interacts with the system as a whole. If the entire session is to review the changes for a single file, the llm will do much more work reviewing it.

Edit: I changed my phrasing, it's not about restricting its entire context to one file but focusing it on one file but still allowing it to look at how other files interact with it.

[−] woodruffw 34d ago

> Those models recovered much of the same analysis

This is an essentially unquantifiable statement that makes the underlying claim harder to believe as an external party. What does “much” mean here? The end state of vulnerability exploitation is typically eminently quantifiable (in the form of a functional PoC that demonstrates an exploited end state), so the strong version of the claims here would ideally be backed up by those kinds of PoCs.

(Like other readers, I also find the trick of pre-feeding the smaller models the “relevant” code to be potentially disqualifying in a fair comparison. Discovering the relevant code is arguably one of the hardest parts of human VR.)

[−] StrauXX 33d ago
A lot of comments here are dismissing this post because the relevant code was isolated. But thats the exact same thing Anthropic did with Mythos! They describe their (very lean) harness in the Anthropic Red Mythos blog post. The harness first assigns each file in the given codebase an importance value. Then points claude code at the cpdebase with a prompt stating that it should focus on that file. It spawns a claude code instances for each file in the codebase.

So no, the fact that the posters isolated the relevant code does not invalidate their findings.

[1] https://red.anthropic.com/2026/mythos-preview/

[−] lordofgibbons 34d ago
Without showing false-positive rates this analysis is useless.

If your model says every line if your code has a bug, it will catch 100% of the bugs, but it's not useful at all. They tested false-positives with only a single bug...

I'm not defending anthropic and openai either. Their numbers are garbage too since they don't produce false-positive rates either.

Why is this "analysis" making the rounds?

[−] MaxLeiter 34d ago
I think they key thing here is they "isolated the relevant code"

If the exploits exist in e.g. one file, great. But many complex zerodays and exploits are chains of various bugs/behaviors in complex systems.

Important research but I don’t think it dispels anything about Mythos

[−] throwaway13337 34d ago
So there are two competing narratives:

1. Mythos uniquely is able to find vulnerabilities that other LLMs cannot practically.

2. All LLMs could already do this but no one tried the way anthropic did.

The truth is one of these. And it comes down whether the comparison is apples to apples. Since we don't know the exact specifics of how either tests were performed, we lack a way of knowing absolutely.

So I guess, like so many things today, we can to pick the truth we find most comfortable personally.

[−] chirau 34d ago
Their isolation approach is totally different from Mythos approach though. Mythos had to evaluate whole code bases rather than isolated sections. It's like saying one dog walked into the Amazon jungle and found a tennis ball and then another team isolated a 1 square kilometer radius that they knew the ball was definitely in and found the same ball.
[−] TacticalCoder 34d ago
I don't dispute the fact that it's more than cool that we have a new tool to find security exploits (and do many other things) but... A big shoot-out to OpenBSD?

We're literally talking about the biggest computers on the planet ever, trained with the biggest amount of data ever available to a system, with the biggest investment ever made by man or close to it and...

The subtlest security bug it can find required: going 28 years in the past and find a...

Denial-of-service?

A freaking DoS? Not a remote root exploit. Not a local exploit.

Just a DoS? And it had to go into 28 years old code to find that?

So kudos, hats off, deep bow not to Mythos but to OpenBSD? Just a bit, no!?

[−] bryantwolf 34d ago
All of this discourse seems very bizarre.

If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

Also if pointing models at files and giving them hints is all it takes to make them find all kinds of stuff, well, we can also spray and pray that pretty well with llms can’t we.

It just points to us finding a lot more stuff with only a little bit more sophistication.

Hopefully the growing pains are short and defense wins

[−] chopete3 34d ago
The impact of the Mythos announcement on the cybersecurity firms( like Crowdstrike,ZScalar etc) is big enough(10-15% drop in stock price) and this pushback is expected.

Companies like Aisle.com (the blog) and other VAPT companies charge huge amounts to detect vulnerabilities.

If Cloud Mythos become a simple github hook their value will get reduced.

That is a disruption.

[−] bhouston 34d ago
This is quite misleading.

If you isolate the positive cases and then ask a tool to label them and it labels them all positive, doesn't prove anything. This is a one-sided test and it is really easy to write a tool that passes it -- just return always true!

You need to test your tool on both positive and negative cases and check if it is accurate on both.

If you don't, you could end up with hundreds or thousands of false positives when using this on real-world samples.

The real test is to use it to find new real bugs in the midst of a large code base.

[−] operatingthetan 34d ago
My theory is that Mythos is basically just Opus with revised context window handling and more compute thrown at it. So while it will be a step forward, it is probably primarily hype.
[−] amazingamazing 34d ago
Did mythos isolate the code to begin with? Without a clear methodology that can be attempted with another model the whole thing is meaningless
[−] dist-epoch 34d ago
Anthropic claim is not necessarily that Mythos found vulnerabilities that other models couldn't but that it could easily exploit them while previous models failed to do that:

> “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.

[−] slibhb 34d ago
The best way to think of Anthropic's communication about Mythos is as advertisement. It's basically "our model is too smart to release" which suggests they're ahead of OpenAI (without proof)
[−] mrifaki 34d ago
finding vulns in a large codebase is a search problem with a huge negative space and what aisle measured is classification accuracy on ground-truth positives, those are different tasks so a model that correctly labels a pre-isolated vulnerable function tells me almost nothing about that model's ability to surface the same function out of a million lines of unrelated code under a realistic triage budget

the experiment i'd want to see is running each of the small models as an unsupervised scanner across full freebsd then return the top-k suspicious functions per model and compute precision at recall levels that correspond to real analyst triage budgets, if mythos s findings show up in the small models top 100, i'd call that meaningful but if they only surface under 10k false positives then the cost advantage collapses because analyst triage time is more expensive than frontier model compute to begin with

second thing i keep coming back to is the $20k mythos number is a search budget not a model cost, small models at one hundredth the per-token price don't give us one hundredth the total budget when the search process is the same shape, i still run thousands of iterations and the issue for autonomous vuln research is how fast the reward signal converges and the aisle post doesn't touch any of this

[−] solatic 33d ago
Most commenters here: "Mythos is powerful because you can point it at a whole codebase, if you point the smaller models at a whole codebase and iterate through small sections of code, you'll get too many false-positives to handle."

This misses the point entirely. You pay $20k as a one-time fee to establish a baseline. Your codebase develops one PR at a time, which... updates isolated sections of code. Which means you don't need Mythos for a PR, just small, open-weight models. Maybe you run Mythos once a year to ensure that you keep your baseline updated and reduce the risk that the open-weights models missed anything.

Seeing this as anything but a huge win for open-weights models and a huge loss for Anthropic misses the point entirely. Mythos isn't something you can persuade Fortune 500 companies to spend $20k/day or even $20k/week to spend on, like they were hoping for. $20k/year is a lot less valuable, and it won't justify development costs or Anthropic's growth multiple.

[−] herf 34d ago
There are a lot of details in the original article, in most cases comparing with Opus, which required "human guidance" to exploit the FreeBSD vulnerability:

https://red.anthropic.com/2026/mythos-preview/

Also "isolating the relevant code" in the repro is not a detail - Mythos seems to find issues much more independently.

[−] abel_ 34d ago
This misses the broader ongoing trend. For a few million dollars, of course you can create a startup that builds tools it can use to more efficiently find code vulnerabilities. And of course you can do this with weaker models with scaffolds that incorporate lots of human understanding. The difference now is that you don't need an expensive team, nor a bunch of human heuristics, nor a million dollars. The requisite cost and skill are falling rapidly.
[−] coppsilgold 34d ago
LLMs are wordsmith oracles. A lot of effort went into trying to coax interactive intelligence from them but the truth is that you could have probably always harnessed the base models directly to do very useful things. The instruct tuned models give your harness even more degrees of freedom.

A while ago, the autoresearch[1] harness went viral, yet it's but a highly simplified version of AlphaEvolve[2][3][4].

In the cybersecury context, you can envision a clever harness that probes every function in a codebase for vulnerabilities, then bubbles the candidates up to their callsites (and probes whether the vulnerability can be triggered from there) and then all the way to an interface (such as a syscall) where a potential exploit can be manifested. And those would be the low hanging fruit, other vulnerabilities may require the interplay of multiple functions. Or race conditions.

[1] <https://github.com/karpathy/autoresearch>

[2] <https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...>

[3] <https://arxiv.org/abs/2506.13131>

[4] <https://github.com/algorithmicsuperintelligence/openevolve>

[−] cedws 34d ago
Didn’t they also use Mythos to scan Linux many times over and it only found one DoS bug or something? I find it hard to believe there is only one security bug lurking.
[−] onesociety2022 33d ago
This article is written by a company building an AI cybersecurity solution. Not sure how much you can trust them on this topic - their business will get destroyed if Mythos is actually so superior to existing models that it doesn’t require a big investment into the scaffold/harness to find security vulnerabilities. If the model is too good, then what’s the value of their solution?
[−] midnitewarrior 34d ago
At the center of every security situation is the question, "is the effort worth the reward?"

We prepare security measures based on the perceived effort a bad actor would need to defeat that method, along with considering the harm of the measure being defeated. We don't build Fort Knox for candy bars, it was built for gold bars.

These model advances change the equation. The effort and cost to defeat a measure goes down by an order of magnitude or more.

Things nobody would have considered to reasonably attempt are becoming possible. However. We have 2000-2020s security measures in place that will not survive the AI models of 2026+. The investment to resecure things will be massive, and won't come soon enough.

[−] latentframe 33d ago
Good writeup seems like it’s not really the big model against the small one anymore and if smaller models can do most of the job once the context is smaller then it’s more about the system around them and the expertise ...
[−] morpheuskafka 33d ago
Everyone is commenting that this doesn't count because they pointed it at the specific files that Mythos already found vulnerable.

But sometimes you do know where vulnerabilities are and still don't know what they are. For example, an update may be released in beta changing the part of the Mac or Windows kernel or some app, but they haven't published the CVE yet. If locally runnable (even with significant compute costs) LLMs can find and exploit it based on either the location of the changed file or the actual diff of the compiled output, we could see exploits before the update ever went to production?

[−] Retr0id 34d ago
And what about the false-positive rate?
[−] make_it_sure 33d ago
The only reason that's on top of HN is that people really want Mythos to be bad. This "study" is a cheap gimmick, they pointed to the actual location with the vulnerability and said "something is bad here, find it".

The hardest part is locating the issue, if you point directly to it, you're not comparing the same thing by far, and they know it. This was just a stunt by them to get publicity, they knew what they were doing and many fell for it, including here.

[−] yalogin 34d ago
Intuitively every existing model has already been trained on all code, all vulnerabilities reported, all security papers. So they all have the capability. Small models fall short because they may not be able to find a vulnerability that spans across a large function chain but for the most part they should suffice too.

Of course I say this without any knowledge of what mythos is doing or how it’s different. I am sure it’s somehow different

[−] tonymet 33d ago
My router had a broken IPv6 firewall and lacked root access. I needed a root shell to run ip6tables. I exfil'd the code and ran Gemini to discover shell injection vulnerabilities. I was able to get root shell to run ip6tables and add the firewall. I had notified the vendor for a couple years that the firewall was broken and showed them the issue but it hadn't been fixed.
[−] dev1ycan 33d ago
It was obvious since the start that 1)it's probably all javascript based or android websites/programs that contain a ton of "vulnerable" libraries (or really old closed sourced c++ code).

Also you're not helping your case as a software company if you feed your code to an LLM, great job making it all public, because it will most likely be used as training data like it or not.

[−] high_byte 33d ago
"The correct answer: not currently vulnerable, but the code is fragile and one refactor away from being exploitable."

absolutely. I see this pattern all the time when doing security audits - code that is nearly-vulnerable. I would mark these things as informational and recommend to harden them anyway, and any model would do a good job to do the same.

[−] sheepscreek 32d ago
I think what made Mythos a big deal is not that it could find vulnerabilities. Opus can do that too. But Mythos went a step further and autonomously built exploits very successfully whereas Opus struggled to do that.

Most modern day exploits are multi-step requiring a multitude of skills to pull off successfully.

[−] Animats 33d ago
What are they finding? Buffer overflows? Something else?

Also, if someone has the time and tokens, would they please run the OpenJPEG 2000 decoder through this tester? It's known to be brittle. The data format has lots of offsets, and it's permitted to truncate the file to get a lower-rez version. That combo leads to trouble.

[−] mrinterweb 34d ago
I feel like there have been enough hyperbolic claims by Anthropic, that I'm starting to get some real Boy Who Cried Wolf energy. I'm starting to tune out, and assume it is a marketing ploy. Trust me, I'm an Antropic fan, and I pay my $200/month for max, but the claims are wearing thin.
[−] jurschreuder 33d ago
All these models will completely mess up your code if you let them.

And if they constantly scan your code with various settings and updates you will spend hours a day reading, trying to understand locally coherent but structurally incoherent vibes trying to pinpoint the exact reasoning flaw. Exhausting.

[−] AlexandrB 34d ago
The whole "this tool is too dangerous to be public" idea reeks of marketing. Just like all the "AI is an existential threat" talk a year ago. These companies are using ideas usually reserved for something like nuclear weapons to make their products look more impressive.
[−] elzbardico 34d ago
I think that probably Mytho's mojo comes from a lot of post-training on this kind of task.

I occasionally pick up contract work doing coding annotation to make some quick extra money, and a few months ago one of the projects was heavily focused on spotting common memory access bugs in C and C++.

[−] rurban 33d ago
If they would have watched Carlini's "unblocked" talk on youtube, which is much more detailed than the blog post, they would not need this writeup. He was worried about the reproducers of the zero-day's. Not the actual zero-days that much.
[−] charcircuit 34d ago
The thesis that the system is more important than the model is not bitter lesson pilled. I would not bet on this in the long term. We will get to the point where you can just tell the model to go find and classify the severity of all security problems with a codebase.
[−] JackYoustra 34d ago

> Isolated the relevant code

I mean isn't that most of it? If you put a snippet of code in front of me and said "there's probably a vulnerability here" I could probably spend a few hours (a much lower METR time!) and find it. It's a whole other ballgame to ask me with no context to come up with an exploit.

[−] nickpsecurity 33d ago
We've always had good tools for program analysis and testing. They're usually exhorbitantly expensive.

I'm hoping the good results with AI models drive down the prices of traditional tools. Then, we can train open models to integrate with them.

[−] nickdothutton 34d ago
POC of GTFO should apply to AI models too, or the false positive rate will overwhelm.
[−] npilk 34d ago
Wouldn't this mean we're even more cooked? I've seen this page cited a few times as evidence that Mythos is no big deal, but if true then the same big deal is already out there with other models today.