Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.
Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!
The 40-line cap not moving rate limits makes sense - plan text is cheap. The cost is in Phase 1 exploration.
Plan mode spins up to 3 explore subagents before the planner even starts, and the heuristic is "use multiple when scope is uncertain." It won't choose fewer - it's being asked to plan, so scope is always uncertain. Nothing penalizes claude for over-exploring and nothing rewards restraint.
Plan mode also ignores session state. A cold start gets the same fanout as a warm session where the relevant files are already in context. In a warm session the explore pass is pure waste - it re-reads loaded files and feeds the planner lossy summaries that conflict with what it already knows.
More tokens, worse plan.
If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.
Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.
How can we opt-out of these tests? The behavior foibles I've been experiencing over the past month might be directly attributable to these experiments! It can be extreme frustrating. I don't want to be in the beta channel. Please change this to be opt-in.
I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.
As a divergent thinker with extensive hard constraints in claude.mds and on-boarding commands that force claude to internalize my constraints, that you or some other employee of Anthropic could randomly select me for testing is horrifying. Each unexpected behavior and my corresponding reaction to it can wipe me out, my brain out, completely for hours, days, even weeks. I have in the last year spend tens (estimating around 400) of hours establishing and reestablishing a system to protect myself from psychological harm and financial harm. It is twisted that you Anthropic employees do not consider the impact your work has on divergent thinking Claude users, let alone that real work is severly impacted by your work. Totally irresponsible. Offensively so.
And how does one address the fragility of probabilities? Engineering. Study weaknesses and harden them. Control the probabilities. It is NOT "completely" probabilistic.
I don't mind you testing stuff out - it's the only sensible way to make the app better - but you need to give people choices to switch to different behaviours if the behaviour you're testing on them isn't working out well for them.
In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:
Authorization failed
Redirect URI http:/localhost:53025/callback is not supported by client.
MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).
This just started happening as of this evening. I've tried restarting everything, and it doesn't help.
A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.
The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).
The complaint is about A/B tests with no visible warnings, not AI.
There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.
It’s an assistant, answering your question and running some errands for you. If you give it blind permission to do a task, then you’re not worrying about what it does.
That's not what they're doing, they are trying to use plan mode to plan out a task. I don't know where you could have got the idea that they were blindly doing anything.
The plan mode uses sub tasks to read, find, list, or grow out certain information. I would imagine those tasks would be covered by the dubiously discovered a/b testing
Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.
Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.
A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.
This isn’t what I want from a professional tool. For business, we need consistency and reliability.
Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.
Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).
210 comments
Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!
Plan mode spins up to 3 explore subagents before the planner even starts, and the heuristic is "use multiple when scope is uncertain." It won't choose fewer - it's being asked to plan, so scope is always uncertain. Nothing penalizes claude for over-exploring and nothing rewards restraint.
Plan mode also ignores session state. A cold start gets the same fanout as a warm session where the relevant files are already in context. In a warm session the explore pass is pure waste - it re-reads loaded files and feeds the planner lossy summaries that conflict with what it already knows.
More tokens, worse plan.
If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.
Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.
I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.
Even without Anthropic's experimentation, anything in the context is completely probabilistic.
You cannot rely on it no matter how/how much you prompt the model
In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:
Authorization failed Redirect URI http:/localhost:53025/callback is not supported by client.
MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).
This just started happening as of this evening. I've tried restarting everything, and it doesn't help.
The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).
The complaint is about A/B tests with no visible warnings, not AI.
A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.
This isn’t what I want from a professional tool. For business, we need consistency and reliability.
Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.
I review everything that my models produce the same way I review work from my coworkers: Trust but verify.
How often were features changed or deactivated by cloud services?