For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.
Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.
> when the model won't actually be able to provide one
This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.
That doesn't necessarily mean the reply is wrong because, as usual, a statistically plausible sounding answer sometimes also happens to be correct, but it has no fundamental truth value. I've gotten equally plausible answers just pasting the same session transcript into another LLM and asking why it did that.
From early GPT days to now, best way to get a decently scoped and reasonably grounded response has always been to ask at least twice (early days often 7 or 8 times).
Because not only can it not reflect, it cannot "think ahead about what it needs to say and change its mind". It "thinks" out loud (as some people seem to as well).
It is a "continuation" of context. When you ask what it did, it still doesn't think, it just* continues from a place of having more context to continue from.
The game has always been: stuff context better => continue better.
Humans were bad at doing this. For example, asking it for synthesis with explanation instead of, say, asking for explanation, then synthesis.
You can get today's behaviors by treating "adaptive thinking" like a token budgeted loop for context stuffing, so eventually there's enough context in view to produce a hopefully better contextualized continuation from.
It seems no accident we've hit on the word "harness" — so much that seems impressive by end of 2025 was available by end of 2023 if "holding it right". If (and only if!) you are an expert in an area you need it to process: (1) turn thinking off, (2) do your own prompting to "prefill context", and (3) you will get superior final response. Not vibing, just staff-work.
---
* “just” – I don't mean "just" dismissively. Qwen 3.5 and Gemma 4 on M5 approaches where SOTA was a year ago, but faster and on your lap. These things are stunning, and the continuations are extraordinary. But still: Garbage in, garbage out; gems in, gem out.
> In a subsequent round, it generally can't meaningfully introspect on its prior internal state
It can't do any better in the moment it's making the choices. Introspection mostly amounts to back-rationalisation, just like in humans. Though for humans, doing so may help learning to make better future decisions in similar situations.
I don't understand why people don't just say "This is wrong. try again." or "This is wrong because xyz. try again." This anthropologizing by asking why seems a bit pointless when you know how LLMs work, unless you've empirically had better results from a specific make and version of LLM by asking why in the past. It's theoretically functionally equivalent to asking a brand new LLM instance with your chat history why the original gave such an answer...Do you want the correct result or do you actually care about knowing why?
>Introspection mostly amounts to back-rationalisation, just like in humans.
That's the best case scenario. Again, let's stop anthropologizing. The given reasons why may be incompatible with the original answer upon closer inspection...
I definitely do this, along with the compulsion sometimes to tell the agent how a problem was fixed in the end, when investigating myself after the model failing to do so. Just common courtesy after working on something together. Let’s rationalize this as giving me an opportunity to reflect and rubberduck the solution.
Regarding not just telling „try again“: of course you are right to suggest that applying human cognition mechanisms to llm is not founded on the same underlying effects.
But due to the nature of training and finetuning/rf I don’t think it is unreasonable that instructing to do backwards reflection could have a positive effect. The model might pattern match this with and then exhibit a few positive behaviors. It could lead it doing more reflection within the reasoning blocks and catch errors before answering, which is what you want. These will have attention to the question of „what caused you to make this assumption“, also, encouraging this behavior. Yes, both mechanisms are exhibited through linear forward going statical interpolation, but the concept of reasoning has proven that this is an effective strategy to arrive at a more grounded result than answering right away.
Lastly, back to anthro. it shows that you, the user, is encouraging of deeper thought an self corrections. The model does not have psychological safety mechanisms which it guards, but again, the way the models are trained causes them to emulate them. The RF primes the model for certain behavior, I.e. arriving at answer at somepoint, rather than thinking for a long time. I think it fair to assume that by „setting the stage“ it is possible to influence what parts of the RL activate.
While role-based prompting is not that important anymore, I think the system prompts of the big coding agents still have it, suggesting some, if slight advantage, of putting the model in the right frame of mind. Again, very sorry for that last part, but anthro. does seem to be a useful analogy for a lot of concepts we are seeing (the reason for this being in the more far of epistemological and philosophical regions, both on the side of the models and us)
> This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.
Yep, I've gotten used to treating the model output as a finished, self-contained thing.
If it needs to be explained, the model will be good at that, if it has an issue, the model will be good at fixing it (and possibly patching any instructions to prevent it in the future). I'm not getting out the actual reason why things happened a certain way, but then again, it's just a token prediction machine and if there's something wrong with my prompt that's not immediately obvious and perhaps doesn't matter that much, I can just run a few sub-agents in a review role and also look for a consensus on any problems that might be found, for the model to then fix.
This can work, but it's sort of not the same as providing actual reasoning behind "why did you do/say X?" -- this is basically asking them model to read the conversation, from the conversation try to understand "why" something happened, and add information to prevent it from being wrong next time. That "why" something went wrong is not really the same as "why" the model output something.
> This is key. In my experience, asking an LLM why it did something is usually pointless.
That kind of strikes me as a huge problem. Working backwards from solutions (both correct and wrong) can yield pretty critical information and learning opportunities. Otherwise you’re just veering into “guess and check” territory.
The K/V Cache is just an optimization. But yeah you would expect the attention for the model producing "Ok im doing X" and you asking "Why did you do X?" be similar. So i don't see a reason why introspection would be impossible. In fact trying to adapt a test skill where the agent would write a new test instead of adapting a new one i asked it why and it gave the reasoning it used. We then adapted the skill to specifically reject that reasoning and then it worked and the agent adapted the existing test instead.
That's good advice. I managed to get the session back on track by doing that a few turns later. I started making it very explicit that I wanted it to really think things through. It kept asking me for permission to do things, I had to explicitly prompt it to trace through and resolve every single edge case it ran into, but it seems to be doing better now. It's running a lot of adversarial tests right now and the results at least seem to be more thorough and acceptable. It's gonna take a while to fully review the output though.
It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.
I have to give it to Opus 4.7 though: it recovered much better than 4.6.
Yeah for anyone seriously using these models I highly reccomend reading the Mythos system card, esp the sections on analyzing it's internal non verbalized states. Save a lot of head wall banging.
This is frankly one of the most frustrating things about LLMs: sometimes I just want to drive it into a corner. “Why the f** did you do X when I specifically told you not to?”
It never leads to anything helpful. I don’t generally find it necessary to drive humans into a corner. I’m not sure it’s because it’s explicitly not a human so I don’t feel bad for it, though I think it’s more the fact that it’s always so bland and is entirely unable to respond to a slight bit of negative sentiment (both in terms of genuinely not being able to exert more effort into getting it right when someone is frustrated with it, but also in that it is always equally nonchalant and inflexible).
You might be surprised how well 5.3-codex follows your instructions. When it hits a wall with your request, it usually emits the final turn and says it can’t do it.
> What works much better is to tell the model to take a step back and re-evaluate.
I desperately hate that modern tooling relies on “did you perform the correct prayer to the Omnissiah”
> to add some entropy to get it away from the local optimum
Is that what it does? I don't think thats what it does, technically.
I think thats just anthropomorphizing a system that behaves in a non deterministic way.
A more menaingful solution is almost always “do it multiple times”.
That is a solution that makes sense sometimes because the system is prob based, but even then, when youre hitting an opaque api which has multiple hidden caching layers, /shrug who knows.
This is way I firmly believing prompt engineering and prompt hacking is just fluff.
Its both mostly technically meaningless (observing random variance over a sample so small you cant see actual patterns) and obsolete once models/apis change.
Just ask Claude to rewrite your
request “as a prompt for claude
code” and use that.
I bet it wont be any worse than the prompt you write by hand.
Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
> For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6
Does it? Anthropic's own announcement says that for the same "effort level" 4.7 does more thinking (i.e uses more output tokens) than 4.6, and they've also increased the default effort level from 4.6 high to 4.7 xhigh.
I'm not sure what dominates the cost for a typical mix of agentic coding tasks - input tokens or output ones, but if you are working on an existing project rather than a brand new one, then file input has to be a significant factor and preliminary testing says that the new tokenizer is typically generating 40% or so more tokens for the exact same input.
I really have to wonder how much of 4.7's increase in benchmark scores over 4.6 is because the model is actually better trained for these cases, or just because it is using more tokens - more compute and thinking steps - to generate the output. It has to be a mix of the two.
The bump from 4.6 to 4.7 is not very noticeable to me in improved capabilities so far, but the faster consumption of limits is very noticeable.
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.
And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.
These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.
AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.
For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.
It's increasingly looking naive to assume scaling LLMs is all you need to get to full white-collar worker replacement. The attention mechanism / hopfield network is fundamentally modeling only a small subset of the full human brain, and all the increasing sustained hype around bolted-on solutions for "agentic memory" is, in my opinion, glaring evidence that these SOTA transformers alone aren't sufficient even when you just limit the space to text. Maybe I'm just parroting Yann LeCun.
My impression is that the quality of the conversation is unexpectedly better: more self-critical, the suggestions are always critical, the default choices constantly best. I might not have as many harnesses as most people here, so I suspect it’s less obvious but I would expect this to make it far more valuable for people who haven’t invested as much.
After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.
We dropped Claude. It's pretty clear this is a race to the bottom, and we don't want a hard dependency on another multi-billion dollar company just to write software
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
Comments here overall do not reflect my experience -- i'm puzzled how the vast majority are using this technology day to day. 4.7 is absolute fire and an upgrade on 4.6.
My initial experience with Opus 4.7 has been pretty bad and I'm sticking to Codex. But these results are meaningless without comparing outcome. Wether the extra token burn is bad or not depends on whether it improves some quality / task completion metric. Am I missing something?
I have been seeing this messaging everywhere and I have not noticed this. I have had the inverse with 4.7 over 4.6.
I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.
Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.
Brutal. I've been noticing that 4.7 eats my Max Subscription like crazy even when I do my best to juggle tasks (or tell 4.7 to use subagents with) Sonnet 4.6 Medium and Haiku. Would love to know if anybody's found ideal token-saving approaches.
It’s really funny how people are surprised or upset about the pricing “anomalies” of these SaaS models. If you’ve been around in tech, you know it’s probably designed to keep you outraged about it to keep engagement up and essentially free ads. The advice, as always, is to not lock yourself into it.
For sure Opus 4.7 is more chatty and talkative, I had to explicitly state a "be concise" preference in the settings. Is anyone experiencing some (very very rare) glitch in the output? Broken words, I mean. I'm using the WEB interface extensively, adaptive thinking ON, PRO plan.
I'm a retired mathematician hoping to finish a second proof of a major theorem before I die. AI needs to understand my math and help me code. What I spend on AI isn't going to deplete my retirement savings.
So far, Opus 4.7 seems a bit smarter than Opus 4.6 for my use case. That's my only concern. Is an $80 bottle of wine a better value than a $20 or $40 bottle of wine? Pretty much never. If there are those of us willing to buy $80 bottles of wine, of course the market will facilitate this.
People can use whatever model they want. I'm too worried about worms crawling through my dead body to waste time on any but the smartest model any moment can offer.
Price is now getting to be more in line with the actual cost. Th models are dumber, slower and more expensive than what we’ve been paying up until now. OpenAI will do it too, maybe a bit less to avoid pissing people off after seeing backlash to Anthropic’s move here. Or maybe they won’t make it dumber but they’ll increase the price while making a dumber mode the baseline so you’re encouraged to pay more. Free ride is over. Hope you have 30k burning a hole in your pocket to buy a beefy machine to run your own model. I hear Mac Studios are good for local inference.
Token consumption is huge compared to 4.6 even for smaller tasks. Just by "reasoning" after my first prompt this morning I went over 50% over the 5 hours quota.
is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money
I wonder if this is like when a restaurant introduces a new menu to increase prices.
Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?
I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.
I've spent the past 4+ months building an internal multi-agent orchestrator for coding teams. Agents communicate through a coordination protocol we built, and all inter-agent messages plus runtime metrics are logged to a database.
Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.
I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.
Opus was launched with:
export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35
claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized
What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.
I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.
So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.
Anthropic is playing a strange game. It's almost like they want you to cancel the subscription if you're an active user and only subscribe if you only use it once per month to ask what the weather in Berlin is.
First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.
Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.
Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.
Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.
And then this.
Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.
EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".
I spent one day with Opus 4.7 to fix a bug. It just ran in circles despite having the problem "in front of its eyes" with all supporting data, thorough description of the system, test harness that reproduces the bug etc. While I still believe 4.7 is much "smarter" than GPT-5.4 I decided to give it ago. It was giving me dumb answers and going off the rails. After accusing it many times of being a fraud and doing it on purpose so that I spend more money, it fixed the bug in one shot.
Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.
It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.
The whole version naming for models is very misleading. 4 and 4.1 seem to come from a different "line" than 4.5 and 4.6, and likewise 4.7 seems like a new shape of model altogether. They aren't linear stepwise improvements, but I think overall 4.7 is generally "smarter" just based on conversational ability.
If anyone's had 4.7 update any documents so far - notice how concise it is at getting straight to the point. It rewrote some of my existing documentation (using Windsurf as the harness), not sure I liked the decrease in verbosity (removed columns and combined / compressed concepts) but it makes sense in respect to the model outputting less to save cost.
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
This, the push towards per-token API charging, and the rest are just a sign of things to come when they finally establish a moat and full monoply/duopoly, which is also what all the specialized tools like Designer and integrations are about.
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
Opus 4.7 seems smarter not wiser. More knowledge, maybe, but less grit. It often has been asking me to wrap it up or just be happy with current state, instead of working out a problem.
Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.
We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.
One thing I don't see often mentioned - OpenAI API's auto token caching approach results in MASSIVE cost savings on agent stuff. Anthropic's deliberate caching is a pain in comparison. Wish they'd just keep the KV cache hot for 60 seconds or so, so we don't have to pay the input costs over and over again, for every growing conversation turn.
the tokenizer change is the real story here imo. same text, same prompt, but 4.7 maps it to 1.0-1.35x more tokens at the same per-token price. that's a stealth price increase that doesn't show up on any pricing page.
what makes it worse is it compounds with two other things: thinking tokens (invisible but counted against limits) and the more verbose output style. so the effective cost delta is closer to 1.5-2x, not just the 1.35x from the tokenizer alone.
practically the only mitigation right now is to keep using 4.6 for tasks where you don't need the reasoning improvements and only use 4.7 when you actually need it. but that means maintaining model selection logic per-task, which most people won't bother with.
I’m trying to understand how this is useful information on its own?
Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?
I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.
576 comments
Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
https://news.ycombinator.com/item?id=47668520
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
Can provide session feedback IDs if needed.
> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.
> when the model won't actually be able to provide one
This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.
That doesn't necessarily mean the reply is wrong because, as usual, a statistically plausible sounding answer sometimes also happens to be correct, but it has no fundamental truth value. I've gotten equally plausible answers just pasting the same session transcript into another LLM and asking why it did that.
From early GPT days to now, best way to get a decently scoped and reasonably grounded response has always been to ask at least twice (early days often 7 or 8 times).
Because not only can it not reflect, it cannot "think ahead about what it needs to say and change its mind". It "thinks" out loud (as some people seem to as well).
It is a "continuation" of context. When you ask what it did, it still doesn't think, it just* continues from a place of having more context to continue from.
The game has always been: stuff context better => continue better.
Humans were bad at doing this. For example, asking it for synthesis with explanation instead of, say, asking for explanation, then synthesis.
You can get today's behaviors by treating "adaptive thinking" like a token budgeted loop for context stuffing, so eventually there's enough context in view to produce a hopefully better contextualized continuation from.
It seems no accident we've hit on the word "harness" — so much that seems impressive by end of 2025 was available by end of 2023 if "holding it right". If (and only if!) you are an expert in an area you need it to process: (1) turn thinking off, (2) do your own prompting to "prefill context", and (3) you will get superior final response. Not vibing, just staff-work.
---
* “just” – I don't mean "just" dismissively. Qwen 3.5 and Gemma 4 on M5 approaches where SOTA was a year ago, but faster and on your lap. These things are stunning, and the continuations are extraordinary. But still: Garbage in, garbage out; gems in, gem out.
> In a subsequent round, it generally can't meaningfully introspect on its prior internal state
It can't do any better in the moment it's making the choices. Introspection mostly amounts to back-rationalisation, just like in humans. Though for humans, doing so may help learning to make better future decisions in similar situations.
>Introspection mostly amounts to back-rationalisation, just like in humans.
That's the best case scenario. Again, let's stop anthropologizing. The given reasons why may be incompatible with the original answer upon closer inspection...
Regarding not just telling „try again“: of course you are right to suggest that applying human cognition mechanisms to llm is not founded on the same underlying effects.
But due to the nature of training and finetuning/rf I don’t think it is unreasonable that instructing to do backwards reflection could have a positive effect. The model might pattern match this with and then exhibit a few positive behaviors. It could lead it doing more reflection within the reasoning blocks and catch errors before answering, which is what you want. These will have attention to the question of „what caused you to make this assumption“, also, encouraging this behavior. Yes, both mechanisms are exhibited through linear forward going statical interpolation, but the concept of reasoning has proven that this is an effective strategy to arrive at a more grounded result than answering right away.
Lastly, back to anthro. it shows that you, the user, is encouraging of deeper thought an self corrections. The model does not have psychological safety mechanisms which it guards, but again, the way the models are trained causes them to emulate them. The RF primes the model for certain behavior, I.e. arriving at answer at somepoint, rather than thinking for a long time. I think it fair to assume that by „setting the stage“ it is possible to influence what parts of the RL activate. While role-based prompting is not that important anymore, I think the system prompts of the big coding agents still have it, suggesting some, if slight advantage, of putting the model in the right frame of mind. Again, very sorry for that last part, but anthro. does seem to be a useful analogy for a lot of concepts we are seeing (the reason for this being in the more far of epistemological and philosophical regions, both on the side of the models and us)
> This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.
Yep, I've gotten used to treating the model output as a finished, self-contained thing.
If it needs to be explained, the model will be good at that, if it has an issue, the model will be good at fixing it (and possibly patching any instructions to prevent it in the future). I'm not getting out the actual reason why things happened a certain way, but then again, it's just a token prediction machine and if there's something wrong with my prompt that's not immediately obvious and perhaps doesn't matter that much, I can just run a few sub-agents in a review role and also look for a consensus on any problems that might be found, for the model to then fix.
"Why did you guess at the functions signature and get it wrong, what information were you using and how can we prevent it next time."
Is that not the right approach?
> This is key. In my experience, asking an LLM why it did something is usually pointless.
That kind of strikes me as a huge problem. Working backwards from solutions (both correct and wrong) can yield pretty critical information and learning opportunities. Otherwise you’re just veering into “guess and check” territory.
> In a subsequent round, it generally can't meaningfully introspect on its prior internal state
It has the K/V cache, no?
It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.
I have to give it to Opus 4.7 though: it recovered much better than 4.6.
> Opus 4.6 DISABLE_ADAPTIVE_THINKING=1
Strangely this option was not working for many of us on a team plan
It never leads to anything helpful. I don’t generally find it necessary to drive humans into a corner. I’m not sure it’s because it’s explicitly not a human so I don’t feel bad for it, though I think it’s more the fact that it’s always so bland and is entirely unable to respond to a slight bit of negative sentiment (both in terms of genuinely not being able to exert more effort into getting it right when someone is frustrated with it, but also in that it is always equally nonchalant and inflexible).
If you ask the average human "Why?", they will generally get defensive, especially if you are asking them to justify their own motivation.
However, if you ask them to describe the thinking and actions that led to their result, they often respond very differently.
> What works much better is to tell the model to take a step back and re-evaluate.
I desperately hate that modern tooling relies on “did you perform the correct prayer to the Omnissiah”
> to add some entropy to get it away from the local optimum
Is that what it does? I don't think thats what it does, technically.
I think thats just anthropomorphizing a system that behaves in a non deterministic way.
A more menaingful solution is almost always “do it multiple times”.
That is a solution that makes sense sometimes because the system is prob based, but even then, when youre hitting an opaque api which has multiple hidden caching layers, /shrug who knows.
This is way I firmly believing prompt engineering and prompt hacking is just fluff.
Its both mostly technically meaningless (observing random variance over a sample so small you cant see actual patterns) and obsolete once models/apis change.
Just ask Claude to rewrite your request “as a prompt for claude code” and use that.
I bet it wont be any worse than the prompt you write by hand.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
> For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6
Does it? Anthropic's own announcement says that for the same "effort level" 4.7 does more thinking (i.e uses more output tokens) than 4.6, and they've also increased the default effort level from 4.6 high to 4.7 xhigh.
I'm not sure what dominates the cost for a typical mix of agentic coding tasks - input tokens or output ones, but if you are working on an existing project rather than a brand new one, then file input has to be a significant factor and preliminary testing says that the new tokenizer is typically generating 40% or so more tokens for the exact same input.
I really have to wonder how much of 4.7's increase in benchmark scores over 4.6 is because the model is actually better trained for these cases, or just because it is using more tokens - more compute and thinking steps - to generate the output. It has to be a mix of the two.
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.
These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.
The entire HTMl/CSS/JS was less than 300 lines of code.
I was shocked how fast it exhausted my usage limits.
> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.
[Opus 4.6] 3% context | last: 5.2k in / 1.1k out
add this to .claude/settings.json
"statusLine": { "type": "command", "command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context | last: \\(((.context_window.current_usage.input_tokens // 0) / 1000 * 10 | floor / 10))k in / \\(((.context_window.current_usage.output_tokens // 0) / 1000 * 10 | floor / 10))k out\"'" }
After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.
Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.
I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.
So far, Opus 4.7 seems a bit smarter than Opus 4.6 for my use case. That's my only concern. Is an $80 bottle of wine a better value than a $20 or $40 bottle of wine? Pretty much never. If there are those of us willing to buy $80 bottles of wine, of course the market will facilitate this.
People can use whatever model they want. I'm too worried about worms crawling through my dead body to waste time on any but the smartest model any moment can offer.
Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?
I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.
Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.
I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.
Opus was launched with:
export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarizedCodex was launched with:
codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-highWhat surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.
I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.
So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.
First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.
Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.
Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.
Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.
And then this.
Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.
EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".
Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.
It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.
We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.
It was on the higher end of Anthropics range - closer to 30-40% more tokens
https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
what makes it worse is it compounds with two other things: thinking tokens (invisible but counted against limits) and the more verbose output style. so the effective cost delta is closer to 1.5-2x, not just the 1.35x from the tokenizer alone.
practically the only mitigation right now is to keep using 4.6 for tasks where you don't need the reasoning improvements and only use 4.7 when you actually need it. but that means maintaining model selection logic per-task, which most people won't bother with.
Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?
I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.