I’ve come to the realization that these kind of systems don’t work, and that a human in the loop is crucial for task planning; the LLM’s role being to identify issues, communicate the design / architecture, etc before it’s handed off, otherwise the LLM always ends up doing not entirely the correct thing.
How is this part tackled when all that you have is GH issues? Doesn’t this work only for the most trivial issues?
FWIW, a "cheaper" version of this is triggering Claude via GitHub Actions and @claudeing your agents like that. If you run your CI on Kubernets (ARC), it sounds pretty much the same
The feedback loop is what most people miss when they build these systems.
You spin up the agent, it submits a PR, CI goes red, and suddenly
you're back to being the bottleneck you were trying to eliminate.
One thing I ran into building something similar, agents are surprisingly
good at fixing the exact error message they're given, but terrible at
recognizing when they're going in circles. After the third retry on the
same failing test, you're not getting a fix, you're getting increasingly
creative excuses for why the test is wrong.
How deep does the self-healing go? Is there a retry limit before it
escalates, or does it just keep going until you manually intervene?
I'm working on something a little similar but mines more a dev tool vs process automation but I love where yours is headed. The biggest issue I've run into is handling retries with agents. My current solution is I have them set checkpoints so they can revert easily and when they can't make an edit or they can't get a test passing, they just restart from earlier state. Problem is this uses up lots of tokens on retries how did you handle this issue in your app?
Looks cool, congrats on the launch. Is there any sandbox isolation from the k8s platform layer? Wondering if this is suitable for multiple tenants or customers.
The parallel execution model makes sense for independent tickets but I'm wondering what happens when agent A is halfway through a PR touching shared/utils.py and agent B gets assigned a ticket that needs the same file.
Does the orchestrator do any upfront dependency analysis to detect that, or do you just let them both run and deal with the conflict at merge time?
I wonder, based on your experience, how hard would it be to improve your system to have an AI agent review the software and suggest tickets?
Like, can an AI agent use a browser, attempt to use the software, find bugs and create a ticket? Can an AI agent use a browser, try to use the software and suggest new features?
59 comments
How is this part tackled when all that you have is GH issues? Doesn’t this work only for the most trivial issues?
@claudeing your agents like that. If you run your CI on Kubernets (ARC), it sounds pretty much the sameOne thing I ran into building something similar, agents are surprisingly good at fixing the exact error message they're given, but terrible at recognizing when they're going in circles. After the third retry on the same failing test, you're not getting a fix, you're getting increasingly creative excuses for why the test is wrong.
How deep does the self-healing go? Is there a retry limit before it escalates, or does it just keep going until you manually intervene?
Like, can an AI agent use a browser, attempt to use the software, find bugs and create a ticket? Can an AI agent use a browser, try to use the software and suggest new features?
| | | |