Please do not A/B test my workflow (backnotprop.com)

by ramoz 210 comments 169 points
Read article View on HN

210 comments

[−] chrislloyd 63d ago
Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.

Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!

[−] nextzck 63d ago
The 40-line cap not moving rate limits makes sense - plan text is cheap. The cost is in Phase 1 exploration.

Plan mode spins up to 3 explore subagents before the planner even starts, and the heuristic is "use multiple when scope is uncertain." It won't choose fewer - it's being asked to plan, so scope is always uncertain. Nothing penalizes claude for over-exploring and nothing rewards restraint.

Plan mode also ignores session state. A cold start gets the same fanout as a warm session where the relevant files are already in context. In a warm session the explore pass is pure waste - it re-reads loaded files and feeds the planner lossy summaries that conflict with what it already knows.

More tokens, worse plan.

If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.

Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.

[−] okwhateverdude 63d ago
How can we opt-out of these tests? The behavior foibles I've been experiencing over the past month might be directly attributable to these experiments! It can be extreme frustrating. I don't want to be in the beta channel. Please change this to be opt-in.
[−] ramoz 63d ago
Thanks for the transparency. Sorry for the noise.

I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.

[−] BAM-DevCrew 63d ago
As a divergent thinker with extensive hard constraints in claude.mds and on-boarding commands that force claude to internalize my constraints, that you or some other employee of Anthropic could randomly select me for testing is horrifying. Each unexpected behavior and my corresponding reaction to it can wipe me out, my brain out, completely for hours, days, even weeks. I have in the last year spend tens (estimating around 400) of hours establishing and reestablishing a system to protect myself from psychological harm and financial harm. It is twisted that you Anthropic employees do not consider the impact your work has on divergent thinking Claude users, let alone that real work is severly impacted by your work. Totally irresponsible. Offensively so.
[−] shepherdjerred 63d ago
What?

Even without Anthropic's experimentation, anything in the context is completely probabilistic.

You cannot rely on it no matter how/how much you prompt the model

[−] BAM-DevCrew 63d ago
And how does one address the fragility of probabilities? Engineering. Study weaknesses and harden them. Control the probabilities. It is NOT "completely" probabilistic.
[−] PufPufPuf 63d ago
I can't tell whether something is satire anymore.
[−] oakwhiz 63d ago
Shouldn't you be giving people their tokens back when you used their tokens to test on their environment?
[−] bartread 63d ago
I don't mind you testing stuff out - it's the only sensible way to make the app better - but you need to give people choices to switch to different behaviours if the behaviour you're testing on them isn't working out well for them.

In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:

Authorization failed Redirect URI http:/localhost:53025/callback is not supported by client.

MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).

This just started happening as of this evening. I've tried restarting everything, and it doesn't help.

[−] reconnecting 63d ago
A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.
[−] onion2k 63d ago
A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).

The complaint is about A/B tests with no visible warnings, not AI.

[−] reconnecting 63d ago
There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.
[−] doc_ick 63d ago
It’s an assistant, answering your question and running some errands for you. If you give it blind permission to do a task, then you’re not worrying about what it does.
[−] some_random 63d ago
That's not what they're doing, they are trying to use plan mode to plan out a task. I don't know where you could have got the idea that they were blindly doing anything.
[−] doc_ick 62d ago
The plan mode uses sub tasks to read, find, list, or grow out certain information. I would imagine those tasks would be covered by the dubiously discovered a/b testing
[−] duskdozer 63d ago
Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.
[−] dkersten 63d ago
Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.

A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.

This isn’t what I want from a professional tool. For business, we need consistency and reliability.

[−] ordersofmag 63d ago
Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.
[−] Mtinie 63d ago
What would you do differently if LLM outputs were deterministic?

Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.

I review everything that my models produce the same way I review work from my coworkers: Trust but verify.

[−] WillAdams 63d ago
Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).
[−] danielbln 63d ago
I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.
[−] _heimdall 63d ago
LLMs are nondeterministic by design, but that has nothing to do with A/B testing.
[−] croes 63d ago
That’s not a problem of LLMs but of using services provided by others.

How often were features changed or deactivated by cloud services?

[−] NotGMan 63d ago
By that definition humans are not professional since we hallucinate and make mistakes all the time.
[−] hrmtst93837 63d ago
[flagged]