Claude Opus 4.7 Model Card (anthropic.com)

by adocomplete 84 comments 177 points
Read article View on HN

84 comments

[−] bachittle 28d ago
So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.
[−] film42 28d ago
To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.
[−] tomaskafka 28d ago
You can support very long context windows if you don’t mind abysmal recall rate.
[−] freedomben 28d ago
Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.
[−] teaearlgraycold 28d ago
A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.
[−] msla 27d ago
Also, just to be clear: This links to a PDF, for some reason.
[−] jzig 28d ago
At what point along the 1M window does context become "long" enough that this degradation occurs?
[−] the13 28d ago
Be brief. No one wants AI boyfriend users who drone on & on about their day.
[−] vessenes 28d ago
This is an interesting document, in that it reads like a Claude Mythos model card that was hastily edited to be an Opus 4.7 model card.

I surmise that someone at the top put the Mythos release on hold, and the product team was told "ship this other interim step model instead. quickly."

I wonder if 4.7 will be seen as a net step-up in quality; there are some regressions noted in the document, and it's clearly substantially worse than Mythos, at least according to its own model card. Should be an interesting few months -- if I were at oAI I'd be rushing to get something out that's clearly better, and pressing for weakness here.

[−] koehr 29d ago
This reads more like an advertisement for Mythos, on the first glance
[−] kube-system 28d ago

> Chemical and biological weapons threat model 2 (CB-2): Novel chemical/biological weapons production capabilities. A model has CB-2 capabilities if it has the ability to significantly help threat actors (for example, moderately resourced expert-backed teams) create/obtain and deploy chemical and/or biological weapons with potential for catastrophic damages far beyond those of past catastrophes such as COVID-19.

That's an interesting choice of benchmark for measuring the risk of "Chemical and biological weapons"

[−] Symmetry 28d ago

> The technical error that caused accidental chain-of-thought supervision in some prior models (including Mythos Preview) was also present during the training of Claude Opus 4.7, affecting 7.8% of episodes.

>_>

[−] 100ms 29d ago

    $ pbpaste | wc -w 
    62508
    $ pbpaste | grep -oi mythos|wc -w
    331
    $ pbpaste | grep -oi opus|wc -w
    809
[−] aliljet 29d ago
Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?
[−] joeumn 29d ago
I'm actually surprised at how it performed compared to 4.6 and also compared to mythos. Will be fun to use.
[−] msla 28d ago
PDF, because it isn't marked.
[−] bicepjai 29d ago
This card is a 272 page report. So now we are redefining names :)
[−] nullc 28d ago
The model card doesn't mention if this revision will continue to make up and fan vicious conspiracy theories like the prior one does.

I've getting a small but steady stream of harassment from mentally ill people who get spun up on crazy conspiracy theories and claude is all too willing to tell them they are ABSOLUTELY RIGHT, encourage them to TAKE ACTION, and telling them that people who disagree are IN ON IT.

The other major AI LLM services will shut down the deflect to be less crazy or shut down conversation entirely, -- but it seems claude doesn't. Anthropic is probably the worst about prattling on about safety but it seems like their concern is mostly centered on insane movie plot threats and less concerned about things with more potential for real harm.

I've complained to anthropic with no response.

[−] STRiDEX 29d ago
Dumb question but why are chemical weapons always addressed as a risk with llms? Is the idea that they contain how to make chemical weapons or that they would guide someone on how?

Would there not already be websites that contain that information? How is an llm different, i guess, from some sort of anarchist cookbook thing.

[−] jmward01 29d ago
Haiku not getting an update is becoming telling. I suspect we are reaching a point where the low end models are cannibalizing high end and that isn't going to stop. How will these companies make money in a few years when even the smallest models are amazing?
[−] il-b 28d ago
Ironically, the website is down
[−] NickNaraghi 28d ago
232 pages is bullshit. Longer than the Mythos system card? What are you hiding.