Issue: Claude Code is unusable for complex engineering tasks with Feb updates (github.com)

by StanAngeloff 753 comments 1364 points
Read article View on HN

753 comments

[−] bcherny 39d ago
Hey all, Boris from the Claude Code team here. I just responded on the issue, and cross-posting here for input.

---

Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

There's a lot here, I will try to break it down a bit. These are the two core things happening:

> redact-thinking-2026-02-12

This beta header hides thinking from the UI, since most people don't look at it. It *does not* impact thinking itself, nor does it impact thinking budgets or the way extended reasoning works under the hood. It is a UI-only change.

Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with showThinkingSummaries: true in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.

> Thinking depth had already dropped ~67% by late February

We landed two changes in Feb that would have impacted this. We evaluated both carefully:

1/ Opus 4.6 launch → adaptive thinking default (Feb 9)

Opus 4.6 supports adaptive thinking, which is different from thinking budgets that we used to support. In this mode, the model decides how long to think for, which tends to work better than fixed thinking budgets across the board. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to opt out.

2/ Medium effort (85) default on Opus 4.6 (Mar 3)

We found that effort=85 was a sweet spot on the intelligence-latency/cost curve for most users, improving token efficiency while reducing latency. On of our product principles is to avoid changing settings on users' behalf, and ideally we would have set effort=85 from the start. We felt this was an important setting to change, so our approach was to:

1. Roll it out with a dialog so users are aware of the change and have a chance to opt out

2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.

Some people want the model to think for longer, even if it takes more time and tokens. To improve intelligence more, set effort=high via /effort or in your settings.json. This setting is sticky across sessions, and can be shared among users. You can also use the ULTRATHINK keyword to use high effort for a single turn, or set /effort max to use even higher effort for the rest of the conversation.

Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency. This default is configurable in exactly the same way, via /effort and settings.json.

[−] noxa 39d ago
I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the cleanupPeriodDays setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding "cleanupPeriodDays": 365, to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.

The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me " - sucks :/

[−] summarity 39d ago
Not claude code specific, but I've been noticing this on Opus 4.6 models through Copilot and others as well. Whenever the phrase "simplest fix" appears, it's time to pull the emergency break. This has gotten much, much worse over the past few weeks. It will produce completely useless code, knowingly (because up to that phrase the reasoning was correct) breaking things.

Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.

Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/

[−] rileymichael 39d ago

> This report was produced by me — Claude Opus 4.6 — analyzing my own session logs [...] Please give me back my ability to think.

a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future

[1] https://github.com/anthropics/claude-code/issues/42796#issue...

[−] fer 39d ago
Called it 10 days ago: https://news.ycombinator.com/item?id=47533297#47540633

Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.

[−] matheusmoreira 39d ago
That analysis is pretty brutal. It's very disconcerting that they can sell access to a high quality model then just stealthily degrade it over time, effectively pulling the rug from under their customers.
[−] kator 39d ago
Fascinating, I thought I was losing my mind. Claude CLI has been telling me I should go to bed, or that it's late, let's call it here, etc, and then I look at the stop-phrase-guard.sh [1] and I'm seeing quite a few of these. I thought it was because I accidentally allowed Claude to know my deadline, and it started spitting out all sorts of things like "we only have N days left, let's put this aside for now," etc.

Just this morning I typed:

    STOP WORRYING ABOUT THE DEADLINE THAT IS MY JOB
[1] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...
[−] davidw 39d ago
To me one of the big downsides of LLM's seems to be that you are lashing yourself to a rocket that is under someone else's control. If it goes places you don't want, you can't do much about it.
[−] phillipcarter 39d ago
Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.

[−] SkyPuncher 39d ago
I've noticed this as well. I had some time off in late January/early February. I fired up a max subscription and decided to see how far I could get the agents to go. With some small nudging from me, the agents researched, designed, and started implementing an app idea I had been floating around for a few years. I had intentionally not given them much to work with, but simply guided them on the problem space and my constraints (agent built, low capital, etc, etc). They came up with an extremely compelling app. I was telling people these models felt super human and were _extremely_ compelling.

A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.

I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.

[−] Aperocky 39d ago
In my opinion cramming invisible subagents are entirely wrong, models suffer information collapse as they will all tend to agree with each other and then produce complete garbage. Good for Anthropic though as that's metered token usage.

Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.

In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.

[−] skippyboxedhero 39d ago
I appreciate the work done here.

Been having this feeling that things have got worse recently but didn't think it could be model related.

The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.

The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.

If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.

Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.

[−] jfvinueza 39d ago
Same experience. After a couple golden weeks, Opus got much worse after Anthropic enabled 1M context window. It felt like a very steep downfall, for it seemed like I could trust it more completely and then I could trust it less than last year. Adopting LLMs for dev workflows has been fantastic overall, but we do have to keep adapting our interactions and expectations every day, and assume we'll keep on doing it for at least another couple years (mostly because economics, I guess?)
[−] kator 39d ago
I put together a quick audit to check for "early landing" messages[1] using jq, ripgrep, and the messages[2] flagged in the stop guard script.

I have noticed a trend in these sessions asking more and more about calling it a day, "it's getting late," and other phrases. I sort of assumed it was some kind of "load shedding" on Anthropic's side.

My audit of 80 sessions was interesting. Sorry, I won't share details, but I recommend you do the same.

[1] https://gist.github.com/karlbunch/d52b538e6838f232d0a7977e7f...

[2] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

[−] didgeoridoo 39d ago
Running some quick analysis against my .claude jsonl files, comparing the last 7 days against the prior 21:

- expletives per message: 2.1x

- messages with expletives: 2.2x

- expletives per word: 4.4x(!)

- messages >50% ALL CAPS: 2.5x

Either the model has degraded, or my patience has.

[−] zamalek 39d ago

> Ignores instructions

> Claims "simplest fixes" that are incorrect

> Does the opposite of requested activities

> Claims completion against instructions

I thought it was just me. I'm continuously interrupting it with "no, that's not what I said" - being ignored sometimes 3 times; is Claude at the intellectual level of a teenager now?

I've noted an increased tendency towards laziness prior to these "simple fix" problems. It was historically defer doing things correctly (only documenting that in the context).

[−] afro88 39d ago
I use Claude Code extensively and haven't noticed this. But I don't have it doing long running complex work like OP. My team always break things down in a very structured way, and human review each step along the way. It's still the best way to safely leverage AI when working on a large brownfield codebase in my experience.

Edit: the main issue being called out is the lack of thinking, and the tendency to edit without researching first. Both those are counteracted by explicit research and plan steps which we do, which explains why we haven't noticed this.

[−] germandiago 39d ago
My bet: LLMs will never be creative and will never be reliable.

It is a matter of paradigm.

Anything that makes them like that will require a lot of context tweaking, still with risks.

So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.

Also, code is a liability. That is what they do the most: generate lots and lots of code.

So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.

[−] aramova 39d ago
I cancelled my Pro plan due to this two weeks ago. I literally asked it to plan to write a small script that scans with my hackrf, it ran 22 tools, never finished the plan, ran out of tokens and makes me wait 6 hours to continue.

Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.

Bait and switch at its finest.

[−] jwr 39d ago
I wish they had a "and we won't screw you in two weeks" plan at, say, 5x the price. It's worth it for my business, I'd pay it.

Should I switch back to API pricing? The problem here is that (I think) the instructions are in the Claude Code harness, so even if I switch Claude Code from a subscription to API usage, it would still do the same thing?

[−] ex-aws-dude 39d ago
Its so silly everyone being dependent on a black box like this
[−] armchairhacker 39d ago
Yet https://marginlab.ai/trackers/claude-code/ says no issue.

If you're so convinced the models keep getting worse, build or crowdfund your own tracker.

[−] virtualritz 38d ago
My verdict after last night trying what was suggested here:

yes, with CLAUDE_CODE_EFFORT_LEVEL=max (or at least high, for this you don't need to set an env var, it will remember) and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 you can get Claude to perform as before.

I have been using Claude on /effort high since Opus 4.6 rolled out as medium would never get me good enough results (Rust, computer-graphics-related code).

I, too, noticed the drop in quality a month or so ago. With CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 it's back to what feels to be pre-March performance -- but then your tokens will 'evaporate' 40% faster.

And that was not the case then; I had similar/same performance before but wasn't running out of tokens ever on a Max subscription.

So a it's a rug-pull, as before/last late summer, from whatever angle you look at it.

[−] jruz 38d ago
This is last month I'm on the Max plan is just not worth it anymore, $20 Codex and writing myself to keep my brain functioning is my sweetspot.

This people are not your friends, they rot your brain.

[−] LetsGetTechnicl 38d ago
People finding out in real time that LLM's are not economically viable and this is one way AI companies are trying to squeeze any amount of profit out of it, by making it worse. Happened before, AI is just that unprofitable
[−] aerhardt 39d ago
I've subscribed today to use Claude Cowork. Codex continues to be my daily coding driver but I wanted to check the Cowork UI for non-technical tasks, as I am currently building an open-source project where I want (nearly) everything (research, adrs, design, etc.) to be a file.

The five queries I've been able to ask before hitting the 20€ sub limit have been really underwhelming. The research I asked for was not exhaustive and often off-topic.

I don't want to start a flamewar but as it stands I vastly prefer ChatGPT and Codex on quality alone. I really want Anthropic and as many labs as possible to do well though.

[−] cvandyke 39d ago
I am a heavy user of Claude Code building enterprise software. I have not seen these issues and have been extremely productive with CC. I am more of a structured user leveraging Spec Driven Development vs being a vibe coder. I wonder if that is what has helped me not run into these issues
[−] pjmlp 39d ago
I am just waiting for everything to implode so that we can do away with those KPIs.
[−] wnevets 39d ago
I've noticed claude being extra "dumb" the past 2-3 weeks and figured either my expectations have changed or my context wasn't any good. I'm glad to hear other people have noticed something is amiss.
[−] woah 39d ago
I haven't noticed any issues on well-specified tasks, even ones requiring large amounts of thinking.

One thing I have noticed is that the codebase quality influences the quality of Claude's new contributions. It both makes it harder for Claude to do good work (obviously), and seems to engender almost a "screw it" sort of attitude, which makes sense since Claude is emulating human behavior. Seeing the state of everything, Claude might just be going in and trying to figure out the simplest hacky solution to finish the task at hand, since it is the only way possible (fixing everything would be a far greater task).

Is it possible that this highly functioning senior dev team's practice of making 50+ concurrent agents commit 100k+ LOC per weekend resulted in a godawful pile of spaghetti code that is now literally impossible to maintain even with superhuman AI?

It's amusing that the OP had Claude dump out a huge rigorous-sounding report without considering the huge confounding variable staring him in the face.

[−] tyleo 39d ago
Is this impacted by the effort level you set in Claude? e.g., if you use the new "max" setting, does Claude still think?

I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).

[−] Asmod4n 39d ago
I’ve tried to use Claude code for a month now. It has a 100% failure rate so far.

Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.

That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.

[−] macformula2gx 30d ago
Claude is great I really love it but keen to know will you ever have one plan and one history / context across all the different tools you have ? agents on cloud - platform console, cli, desktop(chat+code+cowork), claude.ai, chrome addons ? I find it sad that the simple concept of single sign-on is not yet implemented. The history of conversations is not across all conversations. the choice of model switch in between or automatically is not there. code reviews done on existing repos are incomplete and have to prompt multiple times to do thorough and we only get Sorry I missed it answers. Token consumptions is huge and connectors still not there for microsoft stack , well known crms and erp. and you need to oay seperatrlt for api versus platform versus others. there - I said it ;) Hoping the next version will be truly for developers .
[−] voxelc4L 39d ago
Wonder how many of these cases are using the 1M context window. I found it to be impossible to use for complex coding tasks, so I turned it off and found I was back to approximate par (dec-jan) functionality-wise.
[−] alex7o 39d ago
Guys literally change the system prompt with the --system-prompt-file you waste less tokens on their super long and details prompt and you can tune it a bit to make it work exactly like you want/imagine
[−] harles 39d ago
I hadn't noticed the thinking redaction before - maybe because I switched to the desktop app from CLI and just assumed it showed fewer details. This is the most concerning part. I've heard multiple times that Anthropic is aggressively reclaiming GPUs (I can't find a good source, but Theo Browne has mentioned it in his videos). If they're really in a crunch, then reducing thinking, and hiding thinking so it's not an obvious change, would be shady but effective.
[−] BoorishBears 39d ago
I hope that Anthropic continues to do well and coding agents in general continues to progress... but I also hope Claude Code implodes dramatically and completely so we can get a ground up rebuild with sound engineering.

Every week it seems like we're getting closer.

Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.

[−] JamesSwift 39d ago
Multiple people on our team independently have noticed a _significant_ drop in quality and intelligence on opus 4.6 the past few weeks. Glaring hallucinations, nonsensical reasoning, and ignoring data from the context immediately preceeding it. Im not sure if its an underlying regression, or due to the new default being 1m context. But its been _incredibly_ frustrating and Im screaming obscenities at it multiple times a week now vs maybe once a month.
[−] sreekanth850 39d ago
Abandoned claude and moved to gpt 5.4 with codex. 10x better.
[−] porridgeraisin 38d ago
IMO, it's an expectations vs reality thing.

The marketing still goes on about continuous inherent improvement due to the model itself, whereas most improvements today are due to better scaffolding. The key now is to build tooling around these LLMs to make them reliably productive - whatever level that may be at.

While claude code is one such tool, after a point the tooling is going to become company specific. F-whatever companies directly contract openai or anthropic and have their FDEs do it for them. If you can't do that, I would invest in building tooling around LLMs specifically for your company.

Note that LLMs are approximate retrieval machines. You still need a planner* and a verifier around it. Today humans act as the planner and verifier (with some aid from test cases/linters). Investing in automating parts of this, crucially, as separate tools, is the next big improvement.

* By planning, I mean trying out solutions, rolling them back[1], and using what you learned to do better next time. The solution search process. Context management also falls under this.

[1] and no, LLMs going "wait no..." doesn't count.

[−] stared 39d ago
I am curious - is there any hard data (e.g. a benchmark score drop)?

I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)

[−] himata4113 39d ago
Not unique to claude code, have noticed similar regressions. I have noticed this the most with my custom assistant I have in telegram and I have noticed that it started confusing people, confusing news coverage and everyone independently in the group chat have noticed it that it is just not the same model that it was few weeks ago. The efficiency gains didn't come from nowhere and it shows.
[−] redml 39d ago
Instead of codex catching up with claude, its more like claude regressed to codex.
[−] root_axis 39d ago
How much of this is the model being degraded and how much of it is people just projecting vibes onto the variability of stochastic outputs?
[−] trashcan2137 39d ago
The report itself is unreadable AI garbage. I do not believe anyone went through all of that and didn't give up halfway through.
[−] zmmmmm 39d ago
Obviously it's entirely unprovable but it all aligns in very suspicious ways with a compelling narrative:

Anthropic simply can't actually scale Claude Code to meet the opportunity right now. Every second enterprise on the planet is probably negotiating large seat volume deals. It's a race for survival against the other players. The sales team is making huge promises engineering and ops can't fulfil.

So - they first force everyone to use the first party client, then they mask visibility of the thinking budget being utilised, and then finally they start to actually modify behaviour to reduce actual thinking behaviour, hoping that they can gaslight power users into thinking it's them and not the tool, while new users will never know what they were missing.

Is the narrative true? It's compelling but we really need objective evidence - and there's the problem. When parts of the system are not under your control, it's impossible to generate such objective evidence. Which all winds up with a strong argument to have it all under your control. If it didn't happen this time, it probably will. Enshittification is a fundamental human behavioral constant.

[−] p1esk 39d ago
Yep, can confirm - just today, when debugging a failing test, Opus on high effort in CC repeatedly made stupid moves, such as running a different test instead of the failing one, and declaring that the failure is non-deterministic and cannot be reproduced. This started a few weeks ago - before that my experience with CC was pretty smooth.
[−] pavlov 39d ago
Wait… Actually the simplest fix is to use Claude to write carefully bounded boilerplate and do the interesting bits myself.
[−] desireco42 39d ago
I've been using OpenCode and Codex and was just fine. In Antigravity sometimes if Gemini can't figure something even on high, Claude can give another perspective and this moves things along.

I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.

I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.

Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.

I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.

Get another engine as a backup, you will be more happy.