I’ve been developing an open-source version of something similar[1] and used it quite extensively (well over 1k PRs)[2]. I’m definitely believer of the “prompt to PR model”. Very liberating to not have to think about managing the agent sessions. Seems that you have built a lot of useful tooling (e.g., session videos) around this core idea.
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
24/7 running coding agents are pretty clearly the direction the industry is going now. I think we'll need either on-premises or cloud solutions, since obviously if you need an agent to run 24/7 then it can't live on your laptop.
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
Edit: just noticed this is a semi duplicate question to https://news.ycombinator.com/item?id=47723506 so rephrasing my question - will you have computer use and will you have self-hosted runners option? (you being just the controlplane / task orchestrator, which is the hardest problem apparently...)
Additional question - what types of sandboxes you use? (just docker or also firecracker etc...)
Original comment:
Congrats on the launch!
What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?)
Claude code has a version that runs on the cloud. You grant it access to Github then and you can tell it to make changes and create PR from your phone, tablet, or desktop. I'm curious, but what makes this difference than that?
Great timing as I'm exploring the space to get rid of Cursor in our stack.
For local dev everyone is switching to Claude Code or Codex.
The state of the art for cloud agents in my opinion right now is Cursor. But their pricing model per-user doesn't make sense when what I want is to enable anyone in the company to fix things in the product.
2 things not immediately clear from your homepage:
- do you support full computer use? Again Cursor is the best I've tried there
- what kind of triggers do you support? We have in particular one automation built with cursor to auto approve PRs that are low-risk. It triggers on a specific comment on a PR
Finally some advice from a user's pov: you need to invest a lot in the onboarding experience. I tried Devin today and it couldn't get it to work after one hour of fiddling. How do you store the repo's setup scripts? Cursor cloud is pretty opaque and annoying to configure on that side.
Anyway I'll try it!
Running agents on a home server (Claude Code via systemd timers, not cloud sandboxes) has been interesting. Over 700+ autonomous sessions so far.
The biggest practical difference from cloud solutions: agents that run on your own machine can interact with your actual environment. Our agents browse the web, manage a Discord bot, push to git repos, and read email. They share a filesystem so one agent's output is another agent's input.
The tradeoff is obvious: no isolation, no scaling, and if your home server goes down, everything stops. But for a single developer who wants an AI that actually does things (not just produces PRs), local gives you reach that sandboxed cloud agents cannot.
The "prompt to PR" model is clean for dev work. For everything else (marketing, monitoring, data collection, content creation), agents need to touch the real world, and that is harder to sandbox.
I think Cloud Agents are the future, but I’ll be honest I don’t see how a third party provider survives in this space.
1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already.
2. Anthropic and Cursor are already playing in this space and likely will eat your lunch.
IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.
"The agent can't skip steps" is doing a lot of work in that sentence. What happens when the plan itself is wrong? Curious whether the approval gate is genuinely blocking or if teams end up rubber-stamping to avoid being the bottleneck.
Cloud sandboxes for persistence and parallelization is a smart move. Local setups hit those exact walls very quickly. That feels like the right long term bet as the underlying models keep improving
Just checked the demo, it looks super interesting. how can i make sure, it doesnt burn through endless tokens / credits when i let it work independently? Thanks
I strongly believe all of these projects are unnecessary.
Install LXC on a server
Start a container called dev
Add 25 lines to your zshrc
I say dev1 and it spins up fresh
Dev2 copies from that and is a fresh container.
Auto uses tmux.
Claude code with bypass mode. Do anything. Close laptop. Come back later.
Even have a lock mode blocking all internet access except to the llm provider.
Ssh key agent forwarding through 1pw CLI so it can't even push to github unless I reconnect.
I feel like the Dropbox quote years ago but its a lot easier than people think, and its weird to delegate to another service something that a dev should already understand how to do.
All I have to do to have the same issue to PR flow is open dev1. Open Claude. Have GH CLI and task system MCP.
Do /loop watch for new ticket assigned to me and complete it and push it up.
I built an internal version of this for my workplace.
Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name.
It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck
95 comments
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
[1] https://airut.org [2] https://haulos.com/blog/building-agents-over-email/
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
Additional question - what types of sandboxes you use? (just docker or also firecracker etc...)
Original comment:
Congrats on the launch!
What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?)
https://cursor.com/blog/agent-computer-use
Or the existing Claude Code Web?
The biggest practical difference from cloud solutions: agents that run on your own machine can interact with your actual environment. Our agents browse the web, manage a Discord bot, push to git repos, and read email. They share a filesystem so one agent's output is another agent's input.
The tradeoff is obvious: no isolation, no scaling, and if your home server goes down, everything stops. But for a single developer who wants an AI that actually does things (not just produces PRs), local gives you reach that sandboxed cloud agents cannot.
The "prompt to PR" model is clean for dev work. For everything else (marketing, monitoring, data collection, content creation), agents need to touch the real world, and that is harder to sandbox.
One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"?
Also how do you add more agent types, do you support just ACP?
1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already. 2. Anthropic and Cursor are already playing in this space and likely will eat your lunch.
IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.
The analysis request failed.
Hosted shell completed without parseable score_repo.py JSON output. 11 command(s), 11 output(s). (rest redacted)
> Run the same agent n times to increase success rate.
Are there benchmarks out there that back this claim?
Install LXC on a server Start a container called dev
Add 25 lines to your zshrc
I say dev1 and it spins up fresh
Dev2 copies from that and is a fresh container.
Auto uses tmux.
Claude code with bypass mode. Do anything. Close laptop. Come back later.
Even have a lock mode blocking all internet access except to the llm provider.
Ssh key agent forwarding through 1pw CLI so it can't even push to github unless I reconnect.
I feel like the Dropbox quote years ago but its a lot easier than people think, and its weird to delegate to another service something that a dev should already understand how to do.
All I have to do to have the same issue to PR flow is open dev1. Open Claude. Have GH CLI and task system MCP.
Do /loop watch for new ticket assigned to me and complete it and push it up.
Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name.
It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck