The hooks performance finding matches what I've seen. I run multiple Claude Code agents in parallel on a remote VM and the first thing I learned was that anything blocking in the agent's critical path kills throughput. Even a few hundred milliseconds per hook call compounds fast when you have agents making dozens of tool calls per minute.
The docker-based service pattern is smart too. I went a different direction for my own setup -- tmux sessions with worktree isolation per agent, which keeps things lightweight but means I have zero observability into what each agent is actually doing beyond tailing logs manually. This solves that gap in a way that doesn't add overhead to the agent itself, which is the right tradeoff.
Curious about one thing -- how does the dashboard handle the case where a sub-agent spawns its own sub-agents? Does it track the full tree or just one level deep?
Sub-agent trees are fully tracked by the dashboard. When an agent is spawned, it always has a parent agent id - claude is sending this in the hooks payload. When you mouse over an agent in the dashboard, it shows what agent spawned it. There currently isn't a tree view of agents in the UI, but it would be easy to add. The data is all there.
[Edit] When claude spawns sub-agents, they inherit the parent's hooks. So all sub-agents activity gets logged by default.
Are you guys spending hundreds (or thousands) of dollars a day on Claude tokens? Holy crap. I can't get more than one or two agents to do anything useful for very long before I'm hitting my usage limits.
I'm in a great situation where I've been piloting Claude for the company among a small group of others. I've been obsessed with pushing the limits of how many sessions and agents I can working at a time. We threw some work at Gas Town and another Orchestrator but they felt too rigid and opinionated for my liking. But I'm biased, I want to make my own eventually.
When I go home to my $20 plan I am sad and annoyed but I don't want to put more in for what is a good enough for me to work a bit at a time, a good pomodoro timer for me personally.
Something like this is perfect for some of the issues that I've wanted to solve as a command and control tool with malleable visuals.
I hit a lot of limits on Pro plan. Upgraded to Max $200/mo plan and haven't hit limits for awhile.
It's super important to check your plugins or use a proxy to inspect raw prompts. If you have a lot of skills and plugins installed, you'll burn through tokens 5-10x faster than normal.
Also have claude use sub-agents and agent teams. They're significantly lighter on token usage when they're spawned with fresh context windows. You can see in Agents Observe dashboard exactly what prompt and response claude is using for spawning sub-agents.
I tried using hooks for setting up my DIYed version of what channels is now in Claude. I had Claude writing them and not really looking at the results cause the vibes are strong. It struggled with odd behaviors around them. Nice to see some of the possible reasons, I ended up killing that branch of work so I never figured out exactly what was happening.
Now I'm regretting not going deeper on these. This is the type of interface that I think will be perfect for some things I want to demonstrate to a greater audience.
Now that we have the actual internals I have so many things I want to dig through.
This is exactly what I needed. Running 4 autonomous marketing agents (content,
engagement, learning, strategy) and the hardest part is visibility into what
they're doing. Currently built a custom daily activity summary but it's basic.
How do you handle the case where agents are running fine but producing bad
outputs? We had an issue where the quality scorer's centroids went stale and
the agent kept posting content that scored "ok" internally but got zero real
engagement.
This is what I've been missing running multi-agent ops through OpenClaw.
The opacity problem is the one I hit hard: when a coordinator spawns 3-4 agents in parallel (builder, reviewer, tester, each with their own tool calls), the only visibility you have is what they choose to report back. Which is often sanitised and … dangerously optimistic.
The role separation / independent verification structure I run helps catch bad outputs, but it doesn't give me the live timeline of HOW an agent got to a conclusion. That's why I find this genuinely useful.
Noticed OpenClaw is already on the roadmap - had my hands tingling to fork and adapt it. Starring it for now and added to my watchlist. The hook architecture should translate … OpenClaw fires session events that could feed the same pipeline. Looking forward to seeing that happen.
This looks to solve something I've been struggling with in my project, Sugar (1). Using the SDK and having sub agents running I found it difficult to have real- time insight into exactly what they were doing.
You can create a huge task list and Ralph mode can crank through it and also store persistent memory.
> Claude code hooks are blocking - performance degrades rapidly if you have a lot of plugins that use hooks
can confirm. ended up being really careful about what runs synchronously vs in the background.
IMHO the "sanitised optimism" thing others mention here is real too. had to add explicit verification steps because Claude kept reporting success when it just silenced the error. now I always make it prove things actually worked before moving on.
The "sanitised optimism" problem is real. I've seen agents report "fixed!" when they just suppressed the error.
Role separation (builder/reviewer/tester) helps but the reviewer agent also tends to be too polite. Making the reviewer explicitly output PASS/FAIL/UNKNOWN with no room for "looks good overall" is the only thing that worked for me.
The sanitised optimism problem mentioned upthread is the real gap here. Event stream logging tells you what tools were called and in what order, but it doesn't tell you whether the agent's self-reported outcome matches reality.
Good to know background hooks make that much of a difference. How are you handling the case where multiple agent teams are writing to the same jsonl files simultaneously?
28 comments
The docker-based service pattern is smart too. I went a different direction for my own setup -- tmux sessions with worktree isolation per agent, which keeps things lightweight but means I have zero observability into what each agent is actually doing beyond tailing logs manually. This solves that gap in a way that doesn't add overhead to the agent itself, which is the right tradeoff.
Curious about one thing -- how does the dashboard handle the case where a sub-agent spawns its own sub-agents? Does it track the full tree or just one level deep?
[Edit] When claude spawns sub-agents, they inherit the parent's hooks. So all sub-agents activity gets logged by default.
When I go home to my $20 plan I am sad and annoyed but I don't want to put more in for what is a good enough for me to work a bit at a time, a good pomodoro timer for me personally.
Something like this is perfect for some of the issues that I've wanted to solve as a command and control tool with malleable visuals.
OP: This is cool, thank you for sharing.
It's super important to check your plugins or use a proxy to inspect raw prompts. If you have a lot of skills and plugins installed, you'll burn through tokens 5-10x faster than normal.
Also have claude use sub-agents and agent teams. They're significantly lighter on token usage when they're spawned with fresh context windows. You can see in Agents Observe dashboard exactly what prompt and response claude is using for spawning sub-agents.
Now I'm regretting not going deeper on these. This is the type of interface that I think will be perfect for some things I want to demonstrate to a greater audience.
Now that we have the actual internals I have so many things I want to dig through.
The opacity problem is the one I hit hard: when a coordinator spawns 3-4 agents in parallel (builder, reviewer, tester, each with their own tool calls), the only visibility you have is what they choose to report back. Which is often sanitised and … dangerously optimistic.
The role separation / independent verification structure I run helps catch bad outputs, but it doesn't give me the live timeline of HOW an agent got to a conclusion. That's why I find this genuinely useful.
Noticed OpenClaw is already on the roadmap - had my hands tingling to fork and adapt it. Starring it for now and added to my watchlist. The hook architecture should translate … OpenClaw fires session events that could feed the same pipeline. Looking forward to seeing that happen.
You can create a huge task list and Ralph mode can crank through it and also store persistent memory.
Interested in trying them together. 1. https://github.com/roboticforce/sugar
> Claude code hooks are blocking - performance degrades rapidly if you have a lot of plugins that use hooks
can confirm. ended up being really careful about what runs synchronously vs in the background.
IMHO the "sanitised optimism" thing others mention here is real too. had to add explicit verification steps because Claude kept reporting success when it just silenced the error. now I always make it prove things actually worked before moving on.
Role separation (builder/reviewer/tester) helps but the reviewer agent also tends to be too polite. Making the reviewer explicitly output PASS/FAIL/UNKNOWN with no room for "looks good overall" is the only thing that worked for me.