Launch HN: Twill.ai (YC S25) – Delegate to cloud agents, get back PRs

[−] hardsnow 35d ago

I’ve been developing an open-source version of something similar[1] and used it quite extensively (well over 1k PRs)[2]. I’m definitely believer of the “prompt to PR model”. Very liberating to not have to think about managing the agent sessions. Seems that you have built a lot of useful tooling (e.g., session videos) around this core idea.

Couple of learnings to share that I hope could be of use:

1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.

2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.

3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.

[1] https://airut.org [2] https://haulos.com/blog/building-agents-over-email/

[−] willydouhard 35d ago

Willy from Twill here.

I love the idea of emailing agents like we email humans! Thank you for sharing your learnings:

1. Network constraints vary quite a bit from one enterprise customer to another, so right now this is something we handle on a case-by-case basis with them.

2. We came to the same conclusion. For sensitive credentials like LLM API keys, we generate ephemeral keys so the real keys never touch the sandbox.

3. Totally right, we support constrained tasks too (ask mode, automated CI fixes). We've gone back and forth on whether to go vertical-first or stay generic. We're still figuring out where the sweet spot is. The constrained tasks are more reliable today, but the open-ended ones are where teams get the most leverage.

[−] mike_hearn 34d ago

Very cool. I've been putting together something very similar, although mine only does email and not Slack. Also it uses Codex not Claude Code, and just relies on ordinary UNIX user isolation rather than containers that are created/destroyed for every request. I just issue it with restricted API keys and rely on the fact that most products already allow humans to be 'sandboxed' via ordinary permissions.

I've also (separately) got a tool for local dev that sets up containers and does SSL interception on traffic from the agent, so it could also swap creds and similar.

The reason they're separate is that in a corp environment the expectation is very strongly that an email account = a human. You can't easily provision full employee accounts for AIs, HR doesn't know anything about that :) In my own company I am HR, so that's not a problem.

[−] rachel_rig 33d ago

[flagged]

[−] cowmanOG 31d ago

Very cool indeed. I'm building something similar with a more minimal setup: https://github.com/lkoelman/githueber . The goal is to have a local daemon watching your github projects and spawning resumable sessions in OpenCode|Claude|Codex , so you can drop in and pick up. Will check out your repo for the Claude side.

[−] 2001zhaozhao 35d ago

24/7 running coding agents are pretty clearly the direction the industry is going now. I think we'll need either on-premises or cloud solutions, since obviously if you need an agent to run 24/7 then it can't live on your laptop.

Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?

[−] zingar 34d ago

Optimising to keep the coding going 24/7 feels like a local optimisation trap. The amount of code that can be written by coding agents in normal working hours dwarfs what humans can productively describe and assess.

My efforts will be in improving agentic requirements gathering and assessment.

[−] piker 35d ago

> 24/7 running coding agents are pretty clearly the direction the industry is going now.

This assertion needs some support for those of us that don't have a macro insight into the industry. Are you seeing this from within FAANG shops? As a solo developer? What? Honest question.

[−] 2001zhaozhao 35d ago

I'm speaking from my daily experience. Sometimes i don't want to close my laptop before going to bed because there are still 1-2 tasks ongoing in my AI kanban board, so I just leave my laptop open (lock but not suspend it) so that the agents keep working for a while. I don't even have things all that automated.

I anticipate that once I have some more complex agentic scaffolds set up to do things like automatically explore promising directions for the project, then leaving the AI system on overnight becomes a necessity.

[−] gabriel-uribe 34d ago

100% this.

I also have Claude Cowork automations running constantly. As-is, I can't shut down my laptop, and it gets frustrating when my laptop is in my backpack all day because of commutes or travel.

[−] willydouhard 34d ago

Yes or when you get good feedback/idea talking to someone, being able to spawn tasks from your phone makes everything much faster

[−] danoandco 35d ago

For a solo dev running one task at a time, a beefy desktop overnight is totally viable. We see a lot of this with the Mac Mini hype

Cloud starts to matter when you want to (a) run a swarm of agents on multiple independent tasks in parallel, (b) share agents across a team, or (c) not worry about keeping a machine online

[−] 2001zhaozhao 35d ago

I would point out that a beefy desktop is probably faster at compiling code than a typical cloud instance simply due to more CPU performance. So maybe up to 10-ish concurrent agents it's faster to use a local desktop than a cloud instance, and then you start to get into the territory where multiple agents are compiling code at the same time, and the cloud setup starts to win. (That's assuming the codebase takes a while to compile and pegs your CPU at 100% while doing so. If the codebase is faster to compile or uses fewer threads, then the breakeven agent count is even higher.)

Other than that, I agree with what you said. I don't know what the tradeoffs for local on-premises and cloud agents are in terms of other areas like convenience, but I do think that scalability in the cloud is a big advantage.

[−] danoandco 35d ago

Totally right on the compile time. CIs have the same bottleneck, and the ecosystem is working on fixing this (faster cpus, better caching) in both coding agents and CI to improve overall velocity

[−] dbbk 32d ago

I've just set up an old MacBook with tmux and ssh/mosh, you can run as many parallel Claudes in the 'cloud' as you want

[−] zingar 34d ago

And (d) not worry about toddler agents wrecking your single point of failure beefy desktop

[−] ragelink 35d ago

The core issue for me is, I don't want to trust someone else with my code, or run my stuff on their computers. I don't see serious enterprise organizations offloading something as critical to security outside their own network perimeter.

[−] _pdp_ 34d ago

To do what exactly? If the work requires 24/7 then fair enough.

[−] eranation 35d ago

Edit: just noticed this is a semi duplicate question to https://news.ycombinator.com/item?id=47723506 so rephrasing my question - will you have computer use and will you have self-hosted runners option? (you being just the controlplane / task orchestrator, which is the hardest problem apparently...)

Additional question - what types of sandboxes you use? (just docker or also firecracker etc...)

Original comment:

Congrats on the launch!

What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?)

https://cursor.com/blog/agent-computer-use

Or the existing Claude Code Web?

[−] willydouhard 35d ago

We already support computer use out of the box (linux sandboxes). Self hosted runners are not available yet, but Twill is built on a runtime agnostic layer (see https://github.com/TwillAI/agentbox-sdk) so it is feasible!

[−] eranation 35d ago

You definitely got my interest... will try it out!

[−] willydouhard 35d ago

Let us know how it goes!

[−] blacksoil 34d ago

Claude code has a version that runs on the cloud. You grant it access to Github then and you can tell it to make changes and create PR from your phone, tablet, or desktop. I'm curious, but what makes this difference than that?

[−] willydouhard 34d ago

The main difference is that you can pick/combine coding agent CLIs (claude code, codex, open code). There is no vendor lock-in.

[−] dbbk 33d ago

That's literally just GitHub agents?

[−] danoandco 32d ago

Yes, broadly. The main structural difference is that we’re agent-agnostic, so we can combine lab-native CLIs in one workflow. GitHub will likely struggle there because they have direct partnerships with Anthropic and OpenAI.

On the features themselves, we have a better UX across integrations, and more advanced features like video recording.

[−] dbbk 32d ago

This seems like a weak argument. GitHub is already agent (not just model) agnostic, they have Copilot and Claude Code. I just don't see how this is a business, sorry.

[−] cocoflunchy 34d ago

Great timing as I'm exploring the space to get rid of Cursor in our stack. For local dev everyone is switching to Claude Code or Codex. The state of the art for cloud agents in my opinion right now is Cursor. But their pricing model per-user doesn't make sense when what I want is to enable anyone in the company to fix things in the product. 2 things not immediately clear from your homepage: - do you support full computer use? Again Cursor is the best I've tried there - what kind of triggers do you support? We have in particular one automation built with cursor to auto approve PRs that are low-risk. It triggers on a specific comment on a PR Finally some advice from a user's pov: you need to invest a lot in the onboarding experience. I tried Devin today and it couldn't get it to work after one hour of fiddling. How do you store the repo's setup scripts? Cursor cloud is pretty opaque and annoying to configure on that side. Anyway I'll try it!

[−] danoandco 34d ago

On computer use: Yes. Sandboxes come with a computer-use CLI for driving Linux GUI apps via X11.

On triggers: Cron, GitHub (PRs, issues, @twill mentions in review comments), Slack, Linear, Notion, Asana webhooks, plus CLI and web. Our PR-comment workflow is you would have to tag @twill with an instruction. That being said, you can also setup a daily cron on Twill that checks PRs with a specific label like Confidence Score : x/5 and tell it to auto-approve when 5/5 for example.

On setup scripts: Per-repo entrypoint script, env vars, and ports, all accessible on the UI. There is a dedicated Dev Environment agent mode that you start with to setup the infra. You can steer the agent into how to setup if it gets stuck. So this should be smooth. The agent can also rewrite the entrypoint mid-task.

There is also a Twill skill you can add to your local agents to dispatch tasks to Twill. Meaning you can research and plan locally using your CLI and delegate the implementation to a sandbox on Twill.

[−] wcdolphin 34d ago

I’ve been hacking on something in this vein and would love your feedback. What if you could reuse your CI env by using Github Actions as your sandbox. You can reuse the caching, any oidc based roles and self host via runs-on.com for cost and performance. We expose a claude code web experience of interactive low latency chat. I have a working prototype I’m happy to share if you think it would be interesting.

[−] willydouhard 34d ago

This is very convenient but has limitations. GitHub actions are not built to resume state (conversations in our case) and handle multi player experiences.

However reusing the GitHub workflows out of the box feels really nice

[−] crohr 34d ago

I’m the founder of runs-on.com, we should talk!

[−] cocoflunchy 34d ago

Sent you some feedback from the app, I can't get GitHub to connect. Feel free to contact me over email to troubleshoot!

[−] danoandco 34d ago

Mmh this works on my end. Sending you an email. Ty

[−] glompylabs 34d ago

Running agents on a home server (Claude Code via systemd timers, not cloud sandboxes) has been interesting. Over 700+ autonomous sessions so far.

The biggest practical difference from cloud solutions: agents that run on your own machine can interact with your actual environment. Our agents browse the web, manage a Discord bot, push to git repos, and read email. They share a filesystem so one agent's output is another agent's input.

The tradeoff is obvious: no isolation, no scaling, and if your home server goes down, everything stops. But for a single developer who wants an AI that actually does things (not just produces PRs), local gives you reach that sandboxed cloud agents cannot.

The "prompt to PR" model is clean for dev work. For everything else (marketing, monitoring, data collection, content creation), agents need to touch the real world, and that is harder to sandbox.

[−] danoandco 32d ago

Definitely and Twill is for SWE delegation first, not so much the “general agent on my machine.”

[−] dennisy 35d ago

Congrats on the launch, the agentbox-sdk looks interesting, but seeing as the first commit was 3 days ago - I feel a little wary to use it just yet!

One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"?

Also how do you add more agent types, do you support just ACP?

[−] willydouhard 35d ago

Thank you! agentbox-sdk is very recent so it is not stable just yet indeed!

For the lightweight sandbox, can you give an example?

Currently we support main coding CLIs, ACP support is not shipped yet.

[−] dennisy 34d ago

I was thinking something that runs in the same process, and does not require docker or a third party API.

For example Monty by the pydantic team, or the Anthropic sandbox which I believe uses OS level primitives.

[−] willydouhard 32d ago

Got it! Yes this is definitely doable, added in the roadmap. We will embed the roadmap in the readme soon.

[−] woeirua 34d ago

I think Cloud Agents are the future, but I’ll be honest I don’t see how a third party provider survives in this space.

1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already. 2. Anthropic and Cursor are already playing in this space and likely will eat your lunch.

IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.

[−] qainsights 34d ago

Cool. Tried my side project ai.dosa.dev to create an utility; it did good. PR https://github.com/QAInsights/awesome-ai-tools/pull/23

[−] senordevnyc 35d ago

How does this compare to something like Cursor Cloud Agents with a solid set of skills and tools?

[−] kuzivaai 34d ago

"The agent can't skip steps" is doing a lot of work in that sentence. What happens when the plan itself is wrong? Curious whether the approval gate is genuinely blocking or if teams end up rubber-stamping to avoid being the bottleneck.

[−] ibrahimhossain 34d ago

Cloud sandboxes for persistence and parallelization is a smart move. Local setups hit those exact walls very quickly. That feels like the right long term bet as the underlying models keep improving

[−] sschlegel 34d ago

Just checked the demo, it looks super interesting. how can i make sure, it doesnt burn through endless tokens / credits when i let it work independently? Thanks

[−] eranation 35d ago

HN hug of death probably, but your scorecard returns an error :(

The analysis request failed.

Hosted shell completed without parseable score_repo.py JSON output. 11 command(s), 11 output(s). (rest redacted)

[−] Chrisszz 34d ago

It's cool, and I completely get what you mean, I am curious to know how it differs from cloud agents by others like Cursor, Anthropic and Warp.

[−] Mr_P 35d ago

How does this compare to Claude Managed Agents?

[−] hmokiguess 35d ago

> Run the same agent n times to increase success rate.

Are there benchmarks out there that back this claim?

[−] dbbk 33d ago

There are billions of these. What's different?

[−] a_t48 35d ago

Does it support running Docker images inside the sandbox?

[−] zackify 34d ago

I strongly believe all of these projects are unnecessary.

Install LXC on a server Start a container called dev

Add 25 lines to your zshrc

I say dev1 and it spins up fresh

Dev2 copies from that and is a fresh container.

Auto uses tmux.

Claude code with bypass mode. Do anything. Close laptop. Come back later.

Even have a lock mode blocking all internet access except to the llm provider.

Ssh key agent forwarding through 1pw CLI so it can't even push to github unless I reconnect.

I feel like the Dropbox quote years ago but its a lot easier than people think, and its weird to delegate to another service something that a dev should already understand how to do.

All I have to do to have the same issue to PR flow is open dev1. Open Claude. Have GH CLI and task system MCP.

Do /loop watch for new ticket assigned to me and complete it and push it up.

[−] wordpad 35d ago

How does this compare to Jules from Google?

[−] auszeph 35d ago

I built an internal version of this for my workplace.

Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name.

It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck

[−] gbnwl 35d ago

So instead of using my Claude Code subscription, I can pay the vastly higher API rates to you so you can run Claude Code for me?

[−] mc-serious 34d ago

[flagged]

[−] Jorda-dev 35d ago

[flagged]

[−] j_gonzalez 35d ago

[flagged]

Launch HN: Twill.ai (YC S25) – Delegate to cloud agents, get back PRs (twill.ai)

95 comments