Launch HN: Freestyle – Sandboxes for Coding Agents (freestyle.sh)

by benswerd 158 comments 322 points
Read article View on HN

158 comments

[−] TheTaytay 39d ago
Wow, forking memory along with disk space this quickly is fascinating! That's something that I haven't seen from your competitors.

If the machine can fork itself, it could allow for some really neat auto-forking workflows where you fuzz the UI testing of a website by forking at every decision point. I forget the name of the recent model that used only video as its latent space to control computers and cars, but they had an impressive demo where they fuzzed a bank interface by doing this, and it ended up with an impressive number of permutations of reachable UI states.

[−] benswerd 39d ago
That’s what I’m hoping for!
[−] _pdp_ 39d ago
Nice work.

However, 50 concurrent VMs is not a lot. Similar limits exists on all cloud providers, except perhaps in AWS where the cost is prohibitive and it is slow.

Earlier this year, we ended up rolling out own. It is nothing special. We keep X number of machines in a warm pool. Everything is backed by a cluster of firecracker vms. There is no boot time that we care about. Every new sandbox gets vm instantaneously as long as the pool is healthy.

[−] kjok 39d ago
Thanks for sharing your approach!

> It is nothing special. We keep X number of machines in a warm pool.

I'd love to better understand the unit economics here. Specifically, whether cost is a meaningful factor.

The reason I ask is that many startups we've seen focus heavily on optimizing their technology to reduce cold/boot startup times. As you pointed out, perceived latency can also be improved by maintaining a warm pool of VMs.

Given that, I'm trying to determine whether it's more effective to invest in deeper technical optimizations, or to address the cold start problem by keeping a warm pool.

[−] benswerd 39d ago
50 is not heavy, what is heavy is 1000 VMs that can be paused/brought back 50 in 1 second.

Though generally ya, handrolling this stuff can work at the scale of 50 VMs, it becomes a lot harder once you hit hundreds/thousands.

[−] stingraycharles 39d ago
I’m super interested since it seems like you have given everything a lot of thought and effort but I am not sure I understand it.

When I’m thinking of sandboxes, I’m thinking of isolated execution environments.

What does forking sandboxes bring me? What do your sandboxes in general bring me?

Please take this in the best possible way: I’m missing a use case example that’s not abstract and/or small. What’s the end goal here(

[−] benswerd 39d ago
So isolation is correct. Forking a sandbox gives you multiple exact duplicates of isolated environments.

When your coding agent has 10 ideas for what to do, to evaluate them correctly it needs to be able to evaluate them in isolation.

If you're building a website testing agent and halfway down a website, with a form half filled out a session ongoing, etc and it realizes it wants to test 2 things in isolation, forking is the only way.

We also envision this powering the next generation of devcycles "AI Agent, go try these 10 things and tell me which works best". AI forks the environment 10 times, gets 10 exact copies, does the thing in each of them, evaluates it, then takes the best option.

[−] vasco 39d ago

> and it realizes it wants to test 2 things in isolation, forking is the only way

Why would forking be the only way, when humans don't work like that? You can easily try one thing, undo, try the second thing. Your way is a faster way potentially, but also uses more compute.

[−] benswerd 39d ago
This assumes you can retain the same state after an operation.

> "I wonder if this is slow because we have 100k database rows" > DELETE FROM TABLE; > "Woah its way faster now" > But was is the 100k rows or was it a specific row

Thats a great place where drilling bugs and recreating exact issues can be really problem, and testing the issues themselves can be destructive to the environment leading to the need for snapshots and fork.

[−] vasco 39d ago
Again, that is a problem of approach, not of compute. Compute just makes that faster, it doesn't make it possible. It's like you saying the only way to do something is with threads. It's good for some use cases, bad for others, and makes most faster, but it doesn't unlock much
[−] stingraycharles 38d ago
You should focus much more on this aspect, this makes so much more sense but it’s a very specific, narrow use case: multiple solution spaces must be explored in parallel, and then reconciled.

I can also see this being more of a framework / library that integrates into existing LLM frameworks than a SaaS; I wouldn’t switch my whole application to a different framework / runtime just for this.

[−] benswerd 38d ago
This is a good note. We've never been great at explaining what we're doing and plan to do a lot more work on making it accessible/make sense.
[−] indigodaddy 39d ago
Yep I can see this especially when the agent is spinning up test servers/smokes and you don't want those conflicting. How do we reconcile all the potential different git hashes though, upstream I guess etc (this might be an easy answer and I'm not super proficient with git so forgive)
[−] benswerd 39d ago
So we recommend branch per fork, merge what you like.

You have to change the branch on each fork individually currently and thats unlikely to change in the short term due to the complexity of git internals, but its not that hard to do yourself git checkout -b fork-{whateverDiscriminator}

[−] chrisweekly 39d ago
Have you considered git worktree?
[−] mememememememo 39d ago
The other way might be testing VMs vs agent VMs but that would be slower as to "fork" it would need to run the test again to that point. But wouldn't need agent context.

The forking you provided adds a lot more speed.

[−] wsve 39d ago
Agreed, the thing I'd be most interested in is the isolated execution environment you mentioned. Agents running autopilot are powerful. Agents running unsupervised on a machine with developer permissions and certificates where anything could influence the agent to act on an attacker's behalf is terrifying
[−] shubhamintech 39d ago
I think one of the very few who actually support ebpf & xdp, which you do need when you're building low level stuff. + the bare metal setup is like out of the world lol.
[−] etse 38d ago
The memory forking is really interesting. I wonder if copy-on-write at the VM level, O(1) with respect to machine size, won't scale cost with how many forks to take, but 320ms median seems good for the branch-and-explore pattern without reprovisioning every time.

One gap I'm noticing in these comments and in the current sandbox landscape is Windows. Every platform mentioned in these comments like E2B, Daytona, Fly Sprites, Sandflare appears Linux-native. Makes sense for coding agents targeting Debian environments, but a real category exists to automate Windows-specific workflows: enterprise software, ERP systems, anything that runs only on Windows.

If anyone wants to run agents in Mac or Linux and need to access Windows for computer use, Dexbox could be helpful. [github.com/getdexbox/dexbox]

I launched an open source developer tool called Dexbox to run agent workloads that quickly provision and run Windows desktops. It's a CLI and MCP experience that's different from Freestyle, but slightly closer to our Windows-specific production infra, Nen. I like Freestyle's cool UI that shows off the unique technical approach and developer friendliness. Nen's a bit closer to that experience.

[−] qainsights 39d ago
Is this similar to https://instavm.io/?
[−] vimota 39d ago
This is awesome - the snapshotting especially is critical for long running agents. Since we run agents in a durable execution harness (similar to Temporal / DBOS) we needed a sandboxing approach that would snapshot the state after every execution in order to be able to restore and replay on any failure.

We ended up creating localsandbox [0] with that in mind by using AgentFS for filesystem snapshotting, but our solution is meant for a different use case than Freestyle - simpler FS + code execution for agents all done locally. Since we're not running a full OS it's much less capable but also simpler for lots of use cases where we want the agent execution to happen locally.

The ability to fork is really interesting - the main use case I could imagine is for conversations that the user forks or parallel sub-agents. Have you seen other use cases?

[0] https://github.com/coplane/localsandbox

[−] stocktech 39d ago
I built something like this at work using plain Docker images. Can you help me understand your value prop a little better?

The memory forking seems like a cool technical achievement, but I don't understand how it benefits me as a user. If I'm delegating the whole thing to the AI anyway, I care more about deterministic builds so that the AI can tackle the problem.

[−] _jayhack_ 39d ago
Would love to understand how you compare to other providers like Modal, Daytona, Blaxel, E2B and Vercel. I think most other agent builders will have the same question. Can you provide a feature/performance comparison matrix to make this easier?
[−] MarcelinoGMX3C 39d ago
The technical challenges in getting memory forking to deliver those sub-second start and fork times are significant. I've seen the pain of trying to achieve that level of state transfer and rapid provisioning. While "EC2-like" gets the point across for many, going bare metal reveals the practical limits of cloud virtualization for high-performance, complex workloads like these. It shows a real understanding of where cloud abstraction helps and where it just adds overhead.

The cost argument for owning the hardware for this specific use case also makes sense, considering the scale these agent environments will demand. Also worth noting, sandboxes are effectively an open attack surface; architecting them not to be in your main VPC is a sound security decision from the start.

[−] cheema33 39d ago
I currently use lightweight VMs (Proxmox containers) and git worktrees. I can fork an existing VM in in seconds. It is not entirely clear to me what I would gain from using your solution.
[−] sonink 38d ago
Congratulations on the launch !

We run upwards of a thousand sandboxes for coding agents - but these are all standard VM's that we buy off the shelf from Azure, GCP, Akamai and AWS. I am not sure why we should use this instead of the standard VM's? Pricing could be one part, but not sure if the other features resonate.

Forking is interesting, but I would need to know how it works and if it is in the blast radius of the agent execution. If we need to modify the agent to be cognizant of forking, then that is a complexity which could be very expensive to handle in terms of context. If not, then I am not sure what is the use for it.

Sandbox start time at 500ms is definitely interesting. But its something we already are on track to reproduce with a pooled batch of VM's. So not sure if that in itself is worth paying for the premium.

My two cents on the space is that agents are rapidly becoming more capable to just use the tooling developed for humans. All clouds provide a CLI which agents can already use to orchestrate - they should just use the VM's designed for humans through the CLI. Our agent can already 'login' to any VM on the cloud and use the shell exactly like a human would. No software harness is required for this capability. The agent working on a VM is indistinguishable from humans.

[−] benatkin 39d ago
It's hard to tell what this is or how it compares to other things that are out there, but what I latched onto is this:

> Freestyle is the only sandbox provider with built-in multi-tenant git hosting — create thousands of repos via API and pair them directly with sandboxes for seamless code management. On top of that, Freestyle VMs are full Linux virtual machines with nested virtualization, systemd, and a complete networking stack, not containers.

It makes me think of the git automation around rigs in Gas Town: https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...

Edit: I realize the Loom is a way to look at it. Loom interrupted me twice and I almost skipped it. However it gave me a better idea of what it does, it "invents" snapshotting and restoring of VMs in a way that appears faster. That actually makes sense and I know it isn't that hard to do with how VMs work and that it greatly benefits from having only part of the VM writable and having little memory used (maybe it has read-only memory too?).

[−] rasengan 39d ago
Interesting!

We're working on a similar solution at UnixShells.com [1]. We built a VMM that forks, and boots, in < 20ms and is live, serving customers! We have a lot of great tools available, via MIT, on our github repo [2] as well!

[1] https://unixshells.com

[2] https://github.com/unixshells

[−] n2d4 39d ago
Cool! I've been using your API for running sandboxed JS. Nice to see you also support VMs now.

    > we mean forking the whole memory of it
How does this work? Are you copying the entire snapshot, or is this something fancy like copy-on-write memory? If it's the former, doesn't the fork time depend on the size of the machine?
[−] skybrian 39d ago
It doesn't seem very easy to calculate how much it would cost per month to keep a mostly-idle VM running (for example, with a personal web app). The $20/month plan from exe.dev seems more hobbyist-friendly for that. Maybe that's not the intended use, though?
[−] alasano 39d ago
Just want to say that even if alternatives exist (not necessarily exact capabilities obviously), I appreciate what seems to be genuine excitement on your part of having built something cool / best in class.

So best of luck with your vision for it!

[−] nyellin 39d ago
Is it possible to run a Kubernetes cluster inside one? (E.g. via KIND.)

If so, we'd very much like to test this. We make extensive use of Claude Code web but it can't effectively test our product inside the sandbox without running a K8s cluster

[−] csomar 38d ago
I was intrigued to try but your web app is so extremely slow, it takes up to 30+ seconds to move from one tab to the next. Not exactly selling your point of being a super fast provisioning service. Another thing I am wondering. You seem to be selling this as VMs configurable from node/bun. Wouldn't a CLI make more sense here?

Another question: How hard do you think it'll be to integrate this with something like Claude Code. ie: /resume in claude code both return your session and wake up your vm. Or even better /resume from freestyle and have your claude code session open where you left it.

[−] umarcyber 39d ago
Your UI design is really nice.
[−] Bnjoroge 39d ago
Looks cool - would be great to see a PR with some benchmarks on this repo if you can: https://github.com/computesdk/benchmarks

edit: just saw the pr for freestyle. something seems to be blocking, but curious how it compares: https://github.com/computesdk/benchmarks/pull/41

[−] BlueRock-Jake 38d ago
Ton of people have mentioned this but what you're doing with memory forking is pretty unique. Most sandboxes seem to just fork the filesytem and call it a day. Forking full VM memory mid-exec is taking it to another level entirely. Would be very interested to hear how the implementation looks under the hood, specifically how you handle dirty memory pages across forks without the pause ballooning.
[−] esseph 39d ago

> In order to make this possible, we’ve moved to our own bare metal racks. Early in our testing we realized that moving VMs across cloud nodes would not have acceptable performance properties. We asked Google Cloud and AWS for a quote on their bare metal nodes and found that the monthly cost was equivalent to the total cost of the hardware so we did that.

Yes! And good on you, well-tuned bare metal performance is hard to beat.

[−] jFriedensreich 39d ago
Non open source and non local SAAS sandboxes are offensive to even try to launch. No one needs this and the only customers will be vibe coders who just don't know any better. There are teams building actual sandboxes like smolmachines, podman, colima and mre. At least be honest and put the virtualisation tech you are using as well as that its closed source SAAS on the landing page to safe people time.
[−] brap 38d ago
Very nice, congrats!

One thing:

>Freestyle is the only sandbox provider with built-in multi-tenant git hosting — create thousands of repos via API and pair them directly with sandboxes for seamless code management.

Maybe I’m just stupid, but I don’t know what this means. I initially thought I’m your target audience but after failing to understand this part I’m thinking maybe I’m not? I honestly don’t know.

[−] lukebaze 38d ago
The observability point is real but honestly the loop detection problem is more about how you structure your agent than the sandbox. When I've had agents go rogue, the issue was always the outer loop logic, not visibility into the VM. What does your current loop controller look like?
[−] holoduke 39d ago
The problem with agents is that it is currently way too expensive. 100 times more expensive maybe. Another big issue is the lack of interactivity with an agent. Therefor for now agentic development is only viable from your own machine. And there isolation is less of an issue easier to manage.
[−] orliesaurus 38d ago
There are many providers popping up every day offering sandboxes, I think Cloudflare is ahead of the game for pricing and performance, that being said it would be super nice to see a huge competitor analysis: Cloudflare vs e2b vs daytona vs freestyle vs whatever else
[−] ianberdin 39d ago
Congrats guys! Would share some technical details, I bet you have great stories to tell. Let’s, what is forking? You completely copy disk, make ram snapshot and run it? If CoW, but ram? You mentioned 8GB ram vms. Sounds like impossible to copy 8Gb under 500ms, also disk?
[−] skybrian 39d ago
Any ideas for locking down remote access from an untrusted VM? Cloudflare has object-based capabilities and some similar thing might be useful to let a VM make remote requests without giving it API keys. (Keys could be exfiltrated via prompt injection.)
[−] k38f 39d ago
500ms fork of a running VM with full memory state is the kind of thing I'd assume wasn't possible until I saw it work. What does failure look like — does the fork just not happen, or can you get partial state?
[−] randomtoast 38d ago
Do you have any recommendations for CLI-based microVM solutions that support running multiple instances of Claude Code with "--yolo sandboxing" on Linux?
[−] CompuIves 38d ago
This is really cool to see, reminds me of the early days of CodeSandbox. Though this API looks _fantastic_. I love that you do VM configuration using with.
[−] siscia 39d ago
It is not clear to me how much CPU I get.

"Unlimited" as in 8vCPU and then I am billed for it on consumption?

[−] jbethune 39d ago
Congratulations on the launch! Will definitely test this out.
[−] bhaktatejas922 38d ago
do you think the industry is overfixated on startup times? what are better metrics people building with sandboxes should pay attention to
[−] jnstrdm05 39d ago
how many seconds to provision are we talking about here? 1 sec vs 60 is a dealbreaker for me, some clarity on that would be nice.
[−] lawrencechen 39d ago
Can you develop freestyle in freestyle vms?
[−] zhdhdjfdhsbs 38d ago
Wqq wwiq and hdhddjdbnzzs S
[−] Fraaaank 39d ago
Your pricing page is broken
[−] danielhanchen 37d ago
Congrats on the launch!
[−] dominotw 39d ago
dumb question. none of these protect your from prompt injection. yes?
[−] maxmaio 39d ago
Congrats Ben and Jacob!
[−] messh 39d ago
Checkout shellbox.dev, you can do pretty much the same, automating it all bia ssh
[−] schopra909 39d ago
Honestly never considered the forking use case; but it makes a ton of sense when explained

Congrats on the launch. This is cool tech

[−] fawabc 39d ago
how does this differ from daytona or e2b?
[−] siva7 39d ago
I have so many interesting problems on Ai, sandboxing isn't one of them. It's a pointless excercise yet disproportionately so many people love to to do this. Probably because sandboxing doesn't feel as magic as Agents itself and more like the old times of "traditional" software development.