A case study in testing with 100+ Claude agents in parallel

[−] shubhamintech 39d ago

The debugging part at this scale is harder than you would expect - behavioral drift between parallel agent instances is nearly invisible without something aggregating what they are actually doing across runs. We hit this ourselves: two agents completing the same task successfully via completely different paths, one of which quietly broke edge cases in prod. The only thing that caught it was treating the conversation traces as a dataset, not just logs.

[−] qi_imbue 38d ago

Imbue team member here - that's an interesting problem in general, but we haven't really run into this a lot here. Each testing agent is asked to work on one single issue and, to our slight surprise, most of the changes merge cleanly.

When they don't merge cleanly, it is time for human intervention, and the integration step would leave traces on which branches failed to merge.

Finally, when you do need to debug individual agents:

- Because mngr is, at the low level, just managed tmux sessions (local and remote), it's very easy to just attach to those sessions (mngr connect). It works even if the agent has been stopped, because mngr remembers enough about an agent to resurrect it.

- mngr message also allows you batch-message a bunch of agents. So if you do need to resume a lot of agents, you can experiment on one agent, figure out a good prompt, and then batch-message every other agent.

In this testing scenario, most agents don't actually require human intervention, and we've found that just connecting to a few individual agents to resolve problems is smooth and easy enough.

[−] khazhoux 40d ago

Me: has to babysit every feature for hours in Claude Code, building a good plan but then still iterating many many times over things that need to be fixed and tweaked until the feature can be called done.

Bloggers: Here's how we use 3,000 parallel agents to write, test, and ship a new feature to production every 17 minutes in an 8M-LOC codebase (all agent-generated!).

... I'm doing something wrong, or other people are doing something wrong?

[−] jiffy_lubricant 40d ago

> 8M-LOC codebase

I think this is the difference. These toy examples of using parallel agents are *not* running against large codebases, allowing them to iterate more effectively. Once you are in real codebases (>1M LoC), these systems break down.

[−] thejash 40d ago

(author here) I strongly agree that these systems start to break down once the code base gets larger (we've seen that with our own projects)

But our reaction to it has been to say "ok, well the best practice in software engineering is to make small, well-isolated components anyway, so what if we did that?"

We've been trying to really break things apart into smaller pieces (and that's even evident in mngr, where much of the code is split out into separate plugins), and have been having a ton of success with it.

I realize that that might not be an option for more brownfield / existing / legacy projects, but when making something new, I've really been enjoying this way of building things.

[−] tossandthrow 40d ago

To an extend you are likely doing something wrong.

I understand that the natural instinct is to correct the output when you see your agent doing something wrong.

That is not productive.

The instinct should be to tweak the agent to do it right.

At this point I am almost not writing any code in an enterprise code base.

[−] npodbielski 40d ago

If this will be future of software in 20 years nobody will understand what the hell software actually does. If nobody will things will get to implode quickly.

[−] dakolli 41d ago

this is a pitch to sell an agent orchestration product and services.

[−] Yokohiii 41d ago

> Finally, remember that mngr runs your agent in a tmux session

what the hell?

[−] maxbeech 41d ago

the thing that actually burns token budget at scale isn't the agent count itself—it's understanding the cost model of orchestrating them. 100 agents running in parallel is fine if they're short-lived queries. but once you start running them on a schedule (hourly checks, overnight batch work), the math changes fast.

each agent run against a real codebase probably spends 20-50k tokens just on context: repo structure, relevant files, recent changes. multiply that by 100 agents running every hour across 10-20 repos, and you're already hitting millions of tokens a day before any actual work happens. add in re-runs for failures or retries, and the cost curve gets steep quickly.

the harder problem is observability. with one agent you can read logs and understand what went wrong. with 100 agents you need aggregation, pattern detection, alerting on the common failure modes. if 3 agents fail silently but identically, was that a real issue or just rate limiting? if 40 agents all timeout at the same step, was it a dependency problem or infrastructure saturation? at scale you're debugging distributions, not individual runs.

also helps to be ruthless about concurrency. the async pattern isn't "run as many as possible at once"—it's "run exactly as many as the API and your budget can support without making the failure modes harder to diagnose." for claude api work that's usually smaller than people expect.

[−] meidad_g 41d ago

[flagged]

[−] petcat 41d ago

Curious how people and companies like this are approaching matters of intellectual property now that the courts have ruled that basically no part of AI generated content or code is copyrightable and is therefore impossible to claim ownership of.

Are people just not going to open source anything anymore since licenses don't matter? Might as well just keep the code secret, right?

A case study in testing with 100+ Claude agents in parallel (imbue.com)

54 comments