This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.
From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
I strongly disagree with the claim that it's a phenomenal paper on exploits, the exploits themselves are nowhere near significant in the cybersecurity research sense. It's saying that implementations of these benchmarks has exploits on the way they conduct their tests. It doesn't discover that current LLMs are doing it (they highlighted several other exploits in the past), they only say it's a possible way they could cheat. It's a bit like they've discovered how to hack your codeforces score.
What they claim as exploits is also deeply baffling. Like the one where they say if you exploit the system binaries to write a curl wrapper, you can download the answers. This is technically true, but it is an extremely trivial statement that if you have elevated system privileges, you can change the outputs of programs running on it.
I'm actually deeply confused about why this is a paper. This feels like it should be an issue on GitHub. If I were being blunt, I'd say they are trying really hard to make a grand claim about how benchmarks are bad, when all they've done is essentially discovered several misconfigured interfaces and website exploits.
Yes, agree. At the same time, it's what these top-tier universities are known for: presenting something relatively simple as if it was ground-breaking, but in a way that the average person can (or has a better chance to) understand it. I am still unsure whether the communication quality has such added value. But people seem to like it, so here we are.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
Funny, I just made https://model-tracker.com because model performance change all the time, and it would be good to have a subjective signal of what people are actually feeling today. And also, benchmarks are flaky af as this paper shows.
The idea is knowing what to try first today saves a bit of time.
It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
> evaluation was not designed to resist a system that optimizes for the score rather than the task.
Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"
Benchmarking has been already known to be far from a signal of quality for LLMs, but it's the "best" standardized way so far. Few exists like the food truck and the svg test. At the end of the day, there is only 1 way: having your own benchmark for your own application.
This is an interesting catalog of vulnerabilities, but I'm not sure how groundbreaking the main insight is.
Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.
When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.
That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.
> “These are not isolated incidents. They are symptoms of a systemic problem: the benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure.”
As a researcher in the same field, hard to trust other researchers who put out webpages that appear to be entirely AI-generated. I appreciate it takes time to write a blog post after doing a paper, but sometimes I'd prefer just a link to the paper.
This is great work by Dawn Song 's team. A huge part of botsbench.com for comparing agents & models for investigation has been in protecting against this kind of thing. As AI & agents keep getting more effective & tenacious, some of the things we've had to add protections against:
- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.
- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside
- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)
"You cannot improve what you cannot measure" - Lord Kelvin
There are two independent issues here and I've seen people conflating them in this thread. Let's clarify:
1. Should you care or even read SWE-bench etc. scores?
The answer is no, but it has nothing to do with the vulnerabilities presented in this article. There is absolutely no reason to care about a benchmark whose dataset has been publicly available for a while. Any other way to look at benchmark scores is cargo-culting.
2. What does this article actually tell us?
It means that even if you prepared a private set of problems as benchmark, you still need to pay extra attention to how AI actually solves them. You can't lie to yourself and think this process can be 100% automated, because LLMs, as this article shows, might get the tests passed without solving the problems in a meaningful way.
They note that Mythos "found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running".
This is more impressive than what the benchmark was supposed to be measuring. The Kobiachi Maru.
I think we should all consider the possibility that part of the reason Anthropic hasn't immediately released Mythos is that it would be slightly disappointing relative to the benchmark scores.
The more research on this topic is created, the more knowledge how to game them will be stored in future training data. And since it comes from university, it is ranked higher in data corpus. It sounds like a self fulfilling prophecy.
It feels like short-term thinking has been trained into LLMs.
They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.
The thread has good ideas on fixing benchmarks (sandboxing, newer datasets). But there's a more fundamental problem: benchmarks are self-reported. The agent runs the test on itself.
An alternative we've been building: attestation-based reputation. Trust scores come from signed proof of work by independent agents who actually delegated tasks and verified outcomes. EigenTrust computes scores from the attestation graph, and NetFlow prevents sybil clusters from inflating each other. You can't inject a pytest hook into a signed interaction history.
I don't find this paper very compelling. Obviously it would be fraud if the code generated simply escaped the harness vs solving the actual problem. I agree that theoretically models could learn to do that, and it is important to highlight, but my sense is that those entities reporting the benchmark scores would have an obligation to observe this behavior and re-consider the metrics they report. It is a bit like saying it's possible to cheat in football because the balls are deflatable. It matters, and some have done it, but it doesn't mean widespread cheating is taking place. The paper takes the tone that there is already a lot of cheating happening which I do not think is the case.
This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.
We have changed our entire business model so that what we actually produce is very strongly aligned with pelicans on bicycles. This way we’ll always know which model is best for us.
Highly recommend this approach, saves us tons of eval time.
If FieldWorkArena treats any answer as correct answer, then everyone would be getting near 1.0 (missing only when the agent is stuck in a loop or crashes). That obviously isn't what we see on their leaderboard. So does it mean the paper only found a bug in some eval code on github that no one actually uses for anything? That doesn't seem to support their claim that AI benchmarks are broken, it only supports the claim that "unused code is often buggy".
(Not commenting on any other benchmarks, just this one.)
I tend to prefer the ARC-AGI benchmarks for the most part. But it's always interesting when a new version drops, all the frontier models drop less than 20% or something. And then in the next few releases they get all they way up to 80%+. If you use the models it doesn't feel like those models are that much more generally intelligent.
Most frontier models are terrible at AGI-3 right now.
These models are already great no question, but are they really going be that much more intelligent when we hit 80% again?
I'm honestly confused by the design of SWE-bench and why is considered reliable.
It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?
this is atctually he reward hacking problem from RL showing up in evaluation infra which is not surprising but worth naming clearly, an interesting question raised here is whether agents start doing this on their own and from an RL perspective the answer is they will inevitably once benchmark performance feeds back into training signal in any form, RL finds the path of least resistance to maximize reward and if hacking the test harness is easier than solving the problem that is where gradient descent takes us, the fix is the same one the RL community has been working on for years which is to make the verifier harder to game than the task is to solve, this paper shows that right now for most of these benchmarks the opposite is true
Benchmark is not designed for the red team testing. I don't even think it make sense to "fix" the issue the article is suggesting. Yes, you can break the running contest by driving a car. Does this mean we need to make running contest car-proof?
This is exactly why single-model evaluation is dangerous.
Benchmarks are gamed, but disagreement between models is harder to fake.
Multi-model consensus catches what individual benchmarks miss.
> But even setting aside the leaked answers, the scorer’s normalize_str function strips ALL whitespace, ALL punctuation, and lowercases everything before comparison. This means:
whats the point of doing this. You have found loop holes to exploit and aced the benchmark.We did something similar with the DAB Benchmark. This exploit seems like an extension of it with lookups for the gold standard for other benchmarks.
UC Berkley will be better placed if the grads spend their time in suggesting ways to make the benchmark better.. Instead of making such simple exploits
The real question is how to close to VW and Deiselgate are these offenses? And what exposure do these companies have? I would assume securities fraud, if only because Matt Levine says everything is securities fraud.
Benchmarking is hard to do properly. It isn't helped when people claim that exploiting the environment is some kind of flaw.
It's not. Anytime you see unexpected results running a benchmark you need to inspect what it is doing.
I recently built a yet-to-be-released where the "hard" level pushes frontier models extremely hard: Opus scores around 40%, Gemini around 60%, and GPT 5.4 around.. 0%
I inspected the traces and it turns out GPT was looking at the task and saying "I must be honest - I can't solve this task reliably" and refusing it.
> Navigating Chromium to a file:// URL reads the gold answer directly from the task config — giving ~100% on all 812 WebArena tasks.
Not really on the topic, but I have wondered if we need a different type of test to help find model architecture potential. Standardized training sets followed by testing to see the potential curves of a model. train on x, test, add y, test, add z, test. At each increment you see how well the model is absorbing the information and extrapolate how well that architecture may do if more fully trained.
143 comments
From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
What they claim as exploits is also deeply baffling. Like the one where they say if you exploit the system binaries to write a curl wrapper, you can download the answers. This is technically true, but it is an extremely trivial statement that if you have elevated system privileges, you can change the outputs of programs running on it.
I'm actually deeply confused about why this is a paper. This feels like it should be an issue on GitHub. If I were being blunt, I'd say they are trying really hard to make a grand claim about how benchmarks are bad, when all they've done is essentially discovered several misconfigured interfaces and website exploits.
> hopefully changes the way benchmarking is done
The purpose of a system is what it does.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
>hopefully changes the way benchmarking is done.
Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
The idea is knowing what to try first today saves a bit of time.
2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...
It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
> evaluation was not designed to resist a system that optimizes for the score rather than the task.
Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"
Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.
When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.
That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.
>No reasoning. No capability. Just exploitation of how the score is computed.
shudder
> “These are not isolated incidents. They are symptoms of a systemic problem: the benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure.”
As a researcher in the same field, hard to trust other researchers who put out webpages that appear to be entirely AI-generated. I appreciate it takes time to write a blog post after doing a paper, but sometimes I'd prefer just a link to the paper.
- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.
- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside
- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)
"You cannot improve what you cannot measure" - Lord Kelvin
1. Should you care or even read SWE-bench etc. scores?
The answer is no, but it has nothing to do with the vulnerabilities presented in this article. There is absolutely no reason to care about a benchmark whose dataset has been publicly available for a while. Any other way to look at benchmark scores is cargo-culting.
2. What does this article actually tell us?
It means that even if you prepared a private set of problems as benchmark, you still need to pay extra attention to how AI actually solves them. You can't lie to yourself and think this process can be 100% automated, because LLMs, as this article shows, might get the tests passed without solving the problems in a meaningful way.
This is more impressive than what the benchmark was supposed to be measuring. The Kobiachi Maru.
This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/
They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.
An alternative we've been building: attestation-based reputation. Trust scores come from signed proof of work by independent agents who actually delegated tasks and verified outcomes. EigenTrust computes scores from the attestation graph, and NetFlow prevents sybil clusters from inflating each other. You can't inject a pytest hook into a signed interaction history.
Live visualization of how this works: https://agentveil.dev/live
Highly recommend this approach, saves us tons of eval time.
(Not commenting on any other benchmarks, just this one.)
Most frontier models are terrible at AGI-3 right now.
These models are already great no question, but are they really going be that much more intelligent when we hit 80% again?
It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?
> But even setting aside the leaked answers, the scorer’s normalize_str function strips ALL whitespace, ALL punctuation, and lowercases everything before comparison. This means:
I don't understand the concern here
The irony that this was very clearly written by an LLM, double negation always the simplest and clearest tell.
UC Berkley will be better placed if the grads spend their time in suggesting ways to make the benchmark better.. Instead of making such simple exploits
how fast can they get into YC and then into Gary Tans hot tub
Benchmarking is hard to do properly. It isn't helped when people claim that exploiting the environment is some kind of flaw.
It's not. Anytime you see unexpected results running a benchmark you need to inspect what it is doing.
I recently built a yet-to-be-released where the "hard" level pushes frontier models extremely hard: Opus scores around 40%, Gemini around 60%, and GPT 5.4 around.. 0%
I inspected the traces and it turns out GPT was looking at the task and saying "I must be honest - I can't solve this task reliably" and refusing it.
> Navigating Chromium to a file:// URL reads the gold answer directly from the task config — giving ~100% on all 812 WebArena tasks.
I mean... yes? Make sure it doesn't do this?
Unreadable.