Please do not A/B test my workflow (backnotprop.com)

by ramoz • 210 comments • 169 points

210 comments

[−] chrislloyd 63d ago

Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.

Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!

[−] nextzck 63d ago

The 40-line cap not moving rate limits makes sense - plan text is cheap. The cost is in Phase 1 exploration.

Plan mode spins up to 3 explore subagents before the planner even starts, and the heuristic is "use multiple when scope is uncertain." It won't choose fewer - it's being asked to plan, so scope is always uncertain. Nothing penalizes claude for over-exploring and nothing rewards restraint.

Plan mode also ignores session state. A cold start gets the same fanout as a warm session where the relevant files are already in context. In a warm session the explore pass is pure waste - it re-reads loaded files and feeds the planner lossy summaries that conflict with what it already knows.

More tokens, worse plan.

If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.

Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.

[−] okwhateverdude 63d ago

How can we opt-out of these tests? The behavior foibles I've been experiencing over the past month might be directly attributable to these experiments! It can be extreme frustrating. I don't want to be in the beta channel. Please change this to be opt-in.

[−] ramoz 63d ago

Thanks for the transparency. Sorry for the noise.

I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.

[−] BAM-DevCrew 63d ago

As a divergent thinker with extensive hard constraints in claude.mds and on-boarding commands that force claude to internalize my constraints, that you or some other employee of Anthropic could randomly select me for testing is horrifying. Each unexpected behavior and my corresponding reaction to it can wipe me out, my brain out, completely for hours, days, even weeks. I have in the last year spend tens (estimating around 400) of hours establishing and reestablishing a system to protect myself from psychological harm and financial harm. It is twisted that you Anthropic employees do not consider the impact your work has on divergent thinking Claude users, let alone that real work is severly impacted by your work. Totally irresponsible. Offensively so.

[−] shepherdjerred 63d ago

What?

Even without Anthropic's experimentation, anything in the context is completely probabilistic.

You cannot rely on it no matter how/how much you prompt the model

[−] BAM-DevCrew 63d ago

And how does one address the fragility of probabilities? Engineering. Study weaknesses and harden them. Control the probabilities. It is NOT "completely" probabilistic.

[−] PufPufPuf 63d ago

I can't tell whether something is satire anymore.

[−] oakwhiz 63d ago

Shouldn't you be giving people their tokens back when you used their tokens to test on their environment?

[−] bartread 63d ago

I don't mind you testing stuff out - it's the only sensible way to make the app better - but you need to give people choices to switch to different behaviours if the behaviour you're testing on them isn't working out well for them.

In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:

Authorization failed Redirect URI http:/localhost:53025/callback is not supported by client.

MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).

This just started happening as of this evening. I've tried restarting everything, and it doesn't help.

[−] reconnecting 63d ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

[−] onion2k 63d ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).

The complaint is about A/B tests with no visible warnings, not AI.

[−] reconnecting 63d ago

There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.

[−] doc_ick 63d ago

It’s an assistant, answering your question and running some errands for you. If you give it blind permission to do a task, then you’re not worrying about what it does.

[−] some_random 63d ago

That's not what they're doing, they are trying to use plan mode to plan out a task. I don't know where you could have got the idea that they were blindly doing anything.

[−] doc_ick 62d ago

The plan mode uses sub tasks to read, find, list, or grow out certain information. I would imagine those tasks would be covered by the dubiously discovered a/b testing

[−] duskdozer 63d ago

Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.

[−] applfanboysbgon 63d ago

This is in no way standard practice for proprietary software, WTF is with you dystopian weirdos trying to gaslight people? Adobe's suite incl. Photoshop does not do this, Microsoft Office incl. Excel does not do this, professional video editing software does not do this, professional music production software does not do this, game engines do not do this. That short list probably covers 80-90% of professional software usage alone. People do this when serving two versions of a website, but doing this on software that runs on my machine is frankly completely unacceptable and in no way normal.

[−] duskdozer 63d ago

Maybe then, it's just my expectation of what they would be doing. What else is all the telemetry for? As a side note, my impression is that this is less of a photoshop and more of a website situation in that most of the functionality is input and response to/from their servers.

[−] applfanboysbgon 63d ago

Telemetry is, ideally, collected with the intention of improving software, but that doesn't necessitate doing live A/B tests. A typical example: report hardware specs whenever the software crashes. Use that to identify some model of GPU or driver version that is incompatible with your software and figure out why. Ship a fix in the next update. What you don't do with telemetry is randomly do live experiments on your user's machines and possibly induce more crashing.

Regarding the latter point, the Claude Code software controls what is being injected into your own prompt before it is sent to their servers. That is indeed the only reason the OP could discover it -- if the prompt injection was happening on their servers, it would not be visible to you. To be clear, the prompt injection is fine and part of what makes the software useful; it's natural the company does research into what prompts get desirable output for their users without making users experiment[1]. But that should really not be changing without warning as part of experiments, and I think this does fall closer to a professional tool like Photoshop than a website given how it is marketed and the fact that people are being charged $20~200/mo or more for the privilege of using it. API users especially are paying for every prompt, so being sabotaged by a live experiment is incredibly unethical.

[1] That said, I think it's an extremely bad product. A reasonable product would allow power users to config their own prompt injections, so they have control over it and can tune it for their own circumstances. Having worked for an LLM startup, our software allowed exactly that. But our software was crafted with care by human devs, while by all accounts Claude Code is vibe coded slop.

[−] doc_ick 63d ago

Telemetry is “ideally” this. What makes you think other hosted llms (grok/xAI) don’t do this?

You also got the information from asking Claude questions about its prompt, maybe it hallucinated this?

[−] applfanboysbgon 63d ago

I have no idea what you're talking about or why you think I got any information from asking Claude anything. The telemetry comment was about software in general, Photoshop etc., since the person I was replying to was asking what telemetry could be for if not A/B tests. That things are injected into your prompt before sending it to their servers is trivially verified by inspecting your own outgoing packets.

[−] dkersten 63d ago

Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.

A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.

This isn’t what I want from a professional tool. For business, we need consistency and reliability.

[−] r_lee 63d ago

> vibe coded Claude code cli releases are a buggy mess too

this is what gets me.

are they out of money? are so desperate to penny pinch that they can't just do it properly?

what's going on in this industry?

[−] dkersten 63d ago

I get the value of dogfooding, but I feel that in this case, a solid trustworthy foundation is much more important than dogfooding.

[−] macNchz 63d ago

I’m a huge user of AI coding tools but I feel like there has been some kind of a zeitgeist shift in what is acceptable to release across the industry. Obviously it’s a time of incredibly rapid change and competition, but man there is some absolute garbage coming out of companies that I’d expect could do better without much effort. I find myself asking, like, did anyone even do 5 minutes of QA on this thing?? How has this major bug been around for so long?

“It’s kind of broken, maybe they will fix it at some point,” has become a common theme across products from all different players, from both a software defect and service reliability point of view.

[−] r_lee 63d ago

I mean it's like, really they don't even need agentic AI or whatever, they could literally just employ devs and it wouldn't make a difference

like, they'll drop $100 billion on compute, but when it comes to devs who make their products, all of a sudden they must desperately cut costs and hire as little as possible

to me it makes no sense from a business perspective. Same with Google, e.g. YouTube is utterly broken, slow and laggy, but I guess because you're forced to use it, it doesn't matter. But still, if you have these huge money stockpiles, why not deploy it to improve things? It wouldn't matter anyways, it's only upside

[−] dkersten 63d ago

I don’t think they’re even saving much on vibe coding it, given how many tokens they claim they’re using. I know the token cost to them is much, much lower than the token cost to us, but it still has a cost in terms of gpus running.

Plus it’s not something we can replicate since we don’t have access to infinite tokens, so it’s not even a good dogfooding case study.

[−] ordersofmag 63d ago

Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.

[−] johnisgood 63d ago

Well put. I would upvote this many times if I could.

[−] Mtinie 63d ago

What would you do differently if LLM outputs were deterministic?

Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.

I review everything that my models produce the same way I review work from my coworkers: Trust but verify.

[−] WillAdams 63d ago

Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).

[−] danielbln 63d ago

I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.

[−] reconnecting 63d ago

This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.

Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.

[−] orf 63d ago

It’s exactly the same as A/B testing an interface. This is just testing 4 variants of a “page” (the plan), measuring how many people pressed “continue”.

[−] stavros 63d ago

You've groupped LLMs into the wrong set. LLMs are closer to people than to machines. This argument is like saying "I want my tools to be reliable, like my light switch, and my personal assistant wasn't, so I fired him".

Not to mention that of course everyone A/B tests their output the whole time. You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?

[−] applfanboysbgon 63d ago

> LLMs are closer to people than to machines.

jfc. I don't have anything to say to this other than that it deserves calling out.

> You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?

I have never in my life seen or implemented an a/b test on a tool used by professionals. I see consumer-facing tests on websites all the time, but nothing silently changing the software on your computer. I mean, there are mandatory updates, which I do already consider to be malware, but those are, at least, not silent.

[−] doc_ick 63d ago

As far as I can tell, llms never give the exact same output every time.

[−] johnisgood 63d ago

> same input, different result.

What is your point? You get this from LLMs. It does not mean that it is not useful.

[−] huflungdung 63d ago

[dead]

[−] freeone3000 63d ago

Yes! And it was bad then too!!

I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.

[−] _heimdall 63d ago

LLMs are nondeterministic by design, but that has nothing to do with A/B testing.

[−] croes 63d ago

That’s not a problem of LLMs but of using services provided by others.

How often were features changed or deactivated by cloud services?

[−] NotGMan 63d ago

By that definition humans are not professional since we hallucinate and make mistakes all the time.

[−] hrmtst93837 63d ago

[flagged]