Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR (github.com)

by jawiggins 59 comments 88 points
Read article View on HN

59 comments

[−] stingraycharles 51d ago
I’ve come to the realization that these kind of systems don’t work, and that a human in the loop is crucial for task planning; the LLM’s role being to identify issues, communicate the design / architecture, etc before it’s handed off, otherwise the LLM always ends up doing not entirely the correct thing.

How is this part tackled when all that you have is GH issues? Doesn’t this work only for the most trivial issues?

[−] vidarh 51d ago
I've come to the opposite conclusions: The big limitation of systems like this is starting and ending with human involvement at the same level, instead of directing at a higher level. You end up quibbling over detail the agents can handle themselves with sufficient guardrails and process, instead of setting higher level requirements and reviewing higher level decisions and outcomes, and dealing with exceptions.

You can afford a lot of extra guardrails and process to ensure sufficient quality when the result is a system that gets improved autonomously 24/7.

I'm on my way home from a client, and meanwhile another project has spent the last 10 hours improving with no involvement from me. I spent a few minutes reviewing things this morning, after it's spent the whole night improving unattended.

[−] tim-projects 49d ago
I don't believe comments like this. Sure it did work for ten hours but if you didn't review it you will sooner or later when it breaks. And it will. I run the agents all day and that's what happens - they do stuff that is unwanted but that you aren't aware of.
[−] stingraycharles 50d ago
I find that that doesn’t work in the long run. Software agents are not yet capable of maintaining a decently active repository for extended periods of time.

I am all for delegating everything to AI agents, but it just becomes a mess over time if you don’t steer things often enough.

[−] vidarh 50d ago
Not my experience at all. If anything, they make it cheap enough to deal with tech debt that it is far easier to justify being strict.

EDIT: I'll add that you can't expect it to guess what you want, but you can let it manage how it delivers it. We don't expect e.g. a product manager to dictate how developers deliver the code, just what the acceptance criteria is, and that's where I'm headed.

[−] mshark 51d ago
Had the same realization which inspired eforge (shameless plug) https://github.com/eforge-build/eforge - planning stays in the developer’s control with all engineering (agent orchestration) handed off to eforge. This has been working well for a solo or siloed developer (me) that is free to plan independently. Allows the developer to confidently stay in the planning plane while eforge handles the rest using a methodology that in my experience works well. Of course, garbage in garbage out - thorough human planning (AI assisted, not autonomous) is key.
[−] berkay 50d ago
I like the separation of planning and execution. I think the right set of artifacts to pass on to the execution will evolve but may be it's different for different types of work.

From the project: "The plugin enqueues the input and a daemon picks it up - planning, building, reviewing, and validating autonomously."

The part that is not clear to me (and causes most problems for me) is the "validating". It makes a mistake, or decides mocking an interface is fine, etc. declares success and moves on to the next. The bigger the project the more small mistakes compound. It sounds like the agent is doing the validation. What's the approach here for validation?

[−] stingraycharles 51d ago
To me that doesn't do enough yet in terms of up-front planning and visualization, but it's a step in the right direction. I prefer Traycer myself.
[−] mshark 51d ago
Hadn’t seen Traycer, that looks really polished. An important difference is that eforge is open source (Apache 2.0). I purposefully left out planning features from eforge because I don’t want the same tool that builds my code to force me into a planning methodology. Our role as developers has shifted heavily into planning (offloading implementation), and I’m still getting comfortable with that and want to be free to explore the planning space. Maybe I’ll change my mind after my planning opinions evolve.
[−] jawiggins 51d ago
Maybe - I do think as the model get better they'll be able to handle more and more difficult tasks. And yet, even if they can only solve the simplest issues now, why not let them so you can focus on the more important things?
[−] Ryand1234 49d ago
[flagged]
[−] denysvitali 51d ago
FWIW, a "cheaper" version of this is triggering Claude via GitHub Actions and @claudeing your agents like that. If you run your CI on Kubernets (ARC), it sounds pretty much the same
[−] olegbk 45d ago
The feedback loop is what most people miss when they build these systems. You spin up the agent, it submits a PR, CI goes red, and suddenly you're back to being the bottleneck you were trying to eliminate.

One thing I ran into building something similar, agents are surprisingly good at fixing the exact error message they're given, but terrible at recognizing when they're going in circles. After the third retry on the same failing test, you're not getting a fix, you're getting increasingly creative excuses for why the test is wrong.

How deep does the self-healing go? Is there a retry limit before it escalates, or does it just keep going until you manually intervene?

[−] naultic 51d ago
I'm working on something a little similar but mines more a dev tool vs process automation but I love where yours is headed. The biggest issue I've run into is handling retries with agents. My current solution is I have them set checkpoints so they can revert easily and when they can't make an edit or they can't get a test passing, they just restart from earlier state. Problem is this uses up lots of tokens on retries how did you handle this issue in your app?
[−] jawiggins 51d ago
Generally I've found agents are capable of self correcting as long as they can bash up against a guardrail and see the errors. So in optio the agent is resumed and told to fix any CI failures or fix review feedback.
[−] MrDarcy 51d ago
Looks cool, congrats on the launch. Is there any sandbox isolation from the k8s platform layer? Wondering if this is suitable for multiple tenants or customers.
[−] jawiggins 51d ago
Oh good question, I haven't thought deeply about this.

Right now nothing special happens, so claude/codex can access their normal tools and make web calls. I suppose that also means they could figure out they're running in a k8s pod and do service discovery and start calling things.

What kind of features would you be interested in seeing around this? Maybe a toggle to disable internet connections or other connections outside of the container?

[−] nevon 51d ago
Network policies controlling egress would be one thing. I haven't seen how you make secrets available to the agent, but I would imagine you would need to proxy calls through a mitm proxy to replace tokens with real secrets, or some other way to make sure the agent cannot access the secrets themselves. Specifically for an agent that works with code, I could imagine being able to run docker-in-docker will probably be requested at some point, which means you'll need gvisor or something.
[−] jordanedev 50d ago
That's exactly what i did personnaly on my oss repo https://github.com/ysa-ai/ysa

I want to run my agents fully isolated with headless mode. To achieve that safely you have to run a proxy

[−] navilai 45d ago
[dead]
[−] saltpath 51d ago
The parallel execution model makes sense for independent tickets but I'm wondering what happens when agent A is halfway through a PR touching shared/utils.py and agent B gets assigned a ticket that needs the same file. Does the orchestrator do any upfront dependency analysis to detect that, or do you just let them both run and deal with the conflict at merge time?
[−] vidarh 51d ago
It's generally not worth it worrying about it too much other than at a very high level vs. letting them fight it out, as long as your test suite is good enough and your orchestrator is even moderately prepared to handle retries.
[−] Andrew_McCarron 48d ago
[flagged]
[−] verdverm 51d ago
I love k8s, but having it as a requirement for my agent setup is a non-starter. Kubernetes is one method for running, not the center piece.
[−] pianopatrick 51d ago
I wonder, based on your experience, how hard would it be to improve your system to have an AI agent review the software and suggest tickets?

Like, can an AI agent use a browser, attempt to use the software, find bugs and create a ticket? Can an AI agent use a browser, try to use the software and suggest new features?

[−] ramon156 51d ago
I think it's more important to pin down where a human must be in order for this not to become a mess. Or have we skipped that step entirely?
[−] vidarh 51d ago
Yes, they can, and they do a reasonably good job at it. Hand them playwright or similar, and point them at it. The caveat is that they're often "lazy", and it takes some practice to coax them into being thorough (hot tip: have one write a list of things to probe and test, and tell it to use sub agents to address each; otherwise they tend to decide very quickly it's too tedious and start taking shortcuts)
[−] mlsu 51d ago
perhaps we can give the AI a bit of money, make it the customer, then we can all safely get off the computer and go outside :)
[−] stingraycharles 51d ago
AI agents can absolutely use web browsers to do these things, but the hard part is accurately defining the acceptance criteria.
[−] smokeyfish 51d ago
Datadog have a feature like that.
[−] pistoriusp 51d ago
Hey @jawiggins, would you considering using https://github.com/redwoodjs/agent-ci?
[−] raised_hand 51d ago
Why K6? Is there a way I could run it without
[−] maxdo 51d ago
Is the pod per repo or per task ?
[−] fhouser 51d ago
Hot take: You should want to review your agents' output and progress.
[−] conception 51d ago
What’s the most complicated, finished project you’ve done with this?
[−] antihero 51d ago
And what stops it making total garbage that wrecks your codebase?
[−] abybaddi009 51d ago
Does this support skills and MCP?
[−] hmokiguess 51d ago
the misaligned columns in the claude made ASCII diagrams on the README really throw me off, why not fix them?

| | | |

[−] psychomfa_tiger 43d ago
[flagged]
[−] MarcelinoGMX3C 51d ago
[dead]
[−] ferreyadinarta 51d ago
[flagged]
[−] rafaelbcs 51d ago
[dead]
[−] Andrew_McCarron 48d ago
[flagged]
[−] bmd1905 51d ago
[dead]
[−] Acacian 51d ago
[dead]
[−] georaa 48d ago
[flagged]
[−] QubridAI 51d ago
[flagged]
[−] hustleracer 51d ago
[flagged]