Claude Opus 4.7 Model Card

[−] bachittle 29d ago

So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.

[−] film42 29d ago

To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.

[−] tomaskafka 28d ago

You can support very long context windows if you don’t mind abysmal recall rate.

[−] enraged_camel 28d ago

No: https://x.com/bcherny/status/2044821690920980626

[−] freedomben 29d ago

Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.

[−] RobinL 29d ago

Could this be because they've found the 1m context uneconomical (ie costs too much to serve, or burns through users quota too quickly causing complaints), and so they're no longer targeting it as a goal

[−] Someone1234 28d ago

Opus 4.7 is also worse at 256K context. Go look at page 195 and page 196. It is across the board regression, not just 1M context.

[−] RobinL 28d ago

Thanks, interesting. Does this make it more surprising that the other benchmarks have improved? I'm not sure I understand the benchmarks well enough - but I'm wondering whether with agentic workflows it's possible to get away with a smaller more focussed context (and hence lower cost) whilst achieving the same or better performance, because of agentic model's ability to decide what the put in context as they work

[−] timvb 28d ago

what's all this mean in real world use?

[−] teaearlgraycold 28d ago

A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.

[−] msla 27d ago

Also, just to be clear: This links to a PDF, for some reason.

[−] jzig 29d ago

At what point along the 1M window does context become "long" enough that this degradation occurs?

[−] daemonologist 28d ago

The benchmark GP mentioned is measuring at 128k-256k context (there's another at 524k-1024k, where 4.6 scored 78.3% and 4.7 scored 32.2%).

The longer the context the worse the performance; there isn't really a qualitative step change in capability (if there is imo it happens at like 8k-16k tokens, much sooner than is relevant for multi-turn coding tasks - see e.g. this old benchmark https://github.com/adobe-research/NoLiMa ).

[−] the13 28d ago

Be brief. No one wants AI boyfriend users who drone on & on about their day.

[−] vessenes 28d ago

This is an interesting document, in that it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card.

I surmise that someone at the top put the Mythos release on hold, and the product team was told "ship this other interim step model instead. quickly."

I wonder if 4.7 will be seen as a net step-up in quality; there are some regressions noted in the document, and it's clearly substantially worse than Mythos, at least according to its own model card. Should be an interesting few months -- if I were at oAI I'd be rushing to get something out that's clearly better, and pressing for weakness here.

[−] the13 28d ago

What makes you think that? "it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card"

[−] vessenes 28d ago

There are more mentions of Mythos than 4.6. Mythos results are nearly everywhere, and vastly exceed 4.7's capacity in almost every case. There are sections that report only research on Mythos, none on 4.7. E.g. user surveys about how beneficial Mythos is internally at Anthropic.

[−] barneybooroo 28d ago

Yeah, the section expanding on how they evaluated Mythos internally is a bit baffling considering how irrelevant it is.

[−] koehr 29d ago

This reads more like an advertisement for Mythos, on the first glance

[−] kube-system 28d ago

> Chemical and biological weapons threat model 2 (CB-2): Novel chemical/biological weapons production capabilities. A model has CB-2 capabilities if it has the ability to significantly help threat actors (for example, moderately resourced expert-backed teams) create/obtain and deploy chemical and/or biological weapons with potential for catastrophic damages far beyond those of past catastrophes such as COVID-19.

That's an interesting choice of benchmark for measuring the risk of "Chemical and biological weapons"

[−] Symmetry 29d ago

> The technical error that caused accidental chain-of-thought supervision in some prior models (including Mythos Preview) was also present during the training of Claude Opus 4.7, affecting 7.8% of episodes.

>_>

[−] 100ms 29d ago

    $ pbpaste | wc -w 
    62508
    $ pbpaste | grep -oi mythos|wc -w
    331
    $ pbpaste | grep -oi opus|wc -w
    809

[−] aliljet 29d ago

Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?

[−] joeumn 29d ago

I'm actually surprised at how it performed compared to 4.6 and also compared to mythos. Will be fun to use.

[−] msla 28d ago

PDF, because it isn't marked.

[−] bicepjai 29d ago

This card is a 272 page report. So now we are redefining names :)

[−] nullc 28d ago

The model card doesn't mention if this revision will continue to make up and fan vicious conspiracy theories like the prior one does.

I've getting a small but steady stream of harassment from mentally ill people who get spun up on crazy conspiracy theories and claude is all too willing to tell them they are ABSOLUTELY RIGHT, encourage them to TAKE ACTION, and telling them that people who disagree are IN ON IT.

The other major AI LLM services will shut down the deflect to be less crazy or shut down conversation entirely, -- but it seems claude doesn't. Anthropic is probably the worst about prattling on about safety but it seems like their concern is mostly centered on insane movie plot threats and less concerned about things with more potential for real harm.

I've complained to anthropic with no response.

[−] STRiDEX 29d ago

Dumb question but why are chemical weapons always addressed as a risk with llms? Is the idea that they contain how to make chemical weapons or that they would guide someone on how?

Would there not already be websites that contain that information? How is an llm different, i guess, from some sort of anarchist cookbook thing.

[−] jmward01 29d ago

Haiku not getting an update is becoming telling. I suspect we are reaching a point where the low end models are cannibalizing high end and that isn't going to stop. How will these companies make money in a few years when even the smallest models are amazing?

[−] il-b 29d ago

Ironically, the website is down

[−] NickNaraghi 29d ago

232 pages is bullshit. Longer than the Mythos system card? What are you hiding.

Claude Opus 4.7 Model Card (anthropic.com)

84 comments