How I write software with LLMs (stavros.io)

by indigodaddy 528 comments 544 points
Read article View on HN

528 comments

[−] akhrail1996 61d ago
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

[−] christofosho 62d ago
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.

I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.

Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA

1. https://github.com/humanlayer/advanced-context-engineering-f...

[−] lbreakjai 61d ago
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.

My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.

I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.

Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:

https://github.com/marcosloic/notion-agent-hive

[−] highfrequency 61d ago

> I’ll tell the LLM my main goal (which will be a very specific feature or bugfix e.g. “I want to add retries with exponential backoff to Stavrobot so that it can retry if the LLM provider is down”), and talk to it until I’m sure it understands what I want. This step takes the most time, sometimes even up to half an hour of back-and-forth until we finalize all the goals, limitations, and tradeoffs of the approach, and agree on what the end architecture should look like.

This sounds sensible, but also makes me wonder how much time is actually being saved if implementing a "very specific feature or bugfix" still takes an hour of back and forth with an LLM.

Can't help but think that this is still just an awkward intermediate phase of development with adolescent LLMs where we need to think about implementation choices at all.

[−] miguelgrinberg 61d ago

> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.

It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.

In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.

[−] plastic041 62d ago
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.

You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.

[−] jbergqvist 61d ago
I've found that spending most of my time on design before any code gets written makes the biggest difference.

The way I think about it: the model has a probability distribution over all possible implementations, shaped by its training data. Given a vague prompt, that distribution is wide and you're likely to get something generic. As you iterate on a design with the model (really just refining the context), the distribution narrows towards a subset of implementations. By the time the model writes code, you've constrained the space enough that most of what it produces is actually what you want.

[−] jumploops 62d ago
This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:

The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.

This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.

A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).

I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).

One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).

[−] thenthenthen 61d ago
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
[−] gehsty 61d ago
When I use Claude code to work on a hobby project it feels like doom scrolling…

I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.

[−] cpt_sobel 61d ago
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:

> I'd like to add email support to this bot. Let's think through how we would do this.

and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).

Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)

[−] takwatanabe 61d ago
We build and run a multi-agent system. Today Cursor won. For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.

Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks

Multi-agent loses at debugging. But the structure has value.

[−] xhale 61d ago
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?

(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)

[−] oytis 61d ago
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
[−] zingar 61d ago
Big +1 for opencode which for my purposes is interchangeable or better than Claude and can even use anthropic models via my GitHub copilot pro plan. I use it and Claude when one or the other hits token limits.

Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.

[−] zingar 61d ago
On using different models: GitHub copilot has an API that gives you access to many different models from many different providers. They are very transparent about how they use your data[1]; in some cases it’s safer to use a model through them than through the original provider.

You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.

Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.

Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.

[1]: https://docs.github.com/en/copilot/reference/ai-models/model...

[2]: https://github.com/ericc-ch/copilot-api

[−] shell0x 61d ago
Isn't that what Amp code essentially does? I've used Codex and Claude Max but I keep going back to Amp https://ampcode.com/models

It uses different models for different modes.

I just find it to be faster and it often gets things right at the first attempt, but YMMV.