Get Shit Done: A meta-prompting, context engineering and spec-driven dev system (github.com)

by stefankuehnel 254 comments 473 points
Read article View on HN

254 comments

[−] gtirloni 60d ago
I was using this and superpowers but eventually, Plan mode became enough and I prefer to steer Claude Code myself. These frameworks are great for fire-and-forget tasks, especially when there is some research involved but they burn 10x more tokens, in my experience. I was always hitting the Max plan limits for no discernable benefit in the outcomes I was getting. But this will vary a lot depending on how people prefer to work.
[−] marcus_holmes 60d ago
I ended up grafting the brainstorm, design, and implementation planning skills from Superpowers onto a Ralph-based implementation layer that doesn't ask for my input once the implementation plan is complete. I have to run it in a Docker sandbox because of the dangerously set permissions but that is probably a good idea anyway.

It's working, and I'm enjoying how productive it is, but it feels like a step on a journey rather than the actual destination. I'm looking forward to seeing where this journey ends up.

[−] jghn 60d ago
I've gone the other way recently, shifting from pure plan mode to superpowers. I was reminded of it due to the announcement of the latest version.

It is perhaps confirmation bias on my part but I've been finding it's doing a better job with similar problems than I was getting with base plan mode. I've been attributing this to its multiple layers of cross checks and self-reviews. Yes, I could do that by hand of course, but I find superpowers is automating what I was already trying to accomplish in this regard.

[−] healsdata 59d ago
Just tried GSD and Plan Mode on the same exact task (prompt in an MD file). Plan Mode had a plan and then base implementation in twenty minutes. GSD ran for hours to achieve the same thing.

I reviewed the code from both and the GSD code was definitely written with the rest of the project and possibilities in mind, while the Claude Plan was just enough for the MVP.

I can see both having their pros and cons depending on your workflow and size of the task.

[−] Rapzid 60d ago
I use GitHub Copilot and unfortunately there has been a weird regression in the bundled Plan mode. It suddenly, when they added the new plan memory, started getting both VERY verbose in the plan output and also vague in the details. It's adding a lot of step that are like "design" and "figure out" and railroads you into implementation without asking follow-up questions.
[−] whalesalad 59d ago
Same experience. Superpowers are a little too overzealous at times. For coding especially I don’t like seeing a comprehensive design spec written (good) and then turning that into effectively the same doc but macro expanded to become a complete implementation with the literal code for the entire thing in a second doc (bad). Even for trivial changes I’d end up with a good and succinct -design.md, then an -implementation.md, then end with a swarm of sub agents getting into races while more or less just grabbing a block from the implementation file and writing it.

A mess. I still enjoy superpowers brainstorming but will pull the chute towards the end and then deliver myself.

[−] sigbottle 59d ago
Yup yup yup. I burned literally a weeks worth of the 20$ claude subscription and then 20$ worth of API credits on gsdv2. To get like 500 LOC.

And that was AFTER literally burning a weeks worth of codex and Claude 20$ plans and 50$ API credits and getting completely bumfucked - AI was faking out tests etc.

I had better experiences just guiding the thing myself. It definitely was not a set and forget experience (6 hours of constant monitoring) but I was able to get a full research MVP that informed the next iteration with only 75% of a codex weekly plan.

[−] SayThatSh 60d ago
I've played around a bit with the plugins and as you've said, plan mode really handles things fine for the most part. I've got various workflows I run through in Claude and I've found having CC create custom skills/agents created for them gets me 80% of the way there. It's also nice that letting the Claude file refer to them rather than trying to define entire workflows within it goes a long way. It'll still forget things here and there, leading to wasted tokens as it realizes it's being dumb and corrects itself, but nothing too crazy. At least, it's more than enough to let me continue using it naturally rather than memorizing a million slash commands to manually evoke.
[−] abhisek 60d ago
I have been using superpowers for Gryph development for a while. Love the brainstorming and exploration that it brings in. Haven’t really compared token usage but something in my bucket.
[−] locknitpicker 59d ago

> I was using this and superpowers but eventually, Plan mode became enough and I prefer to steer Claude Code myself.

Plan mode is great, but to me that's just prompting your LLM agent of choice to generate an ad-hoc, imprecise, and incomplete spec.

The downside of specs is that they can consume a lot of context window with things that are not needed for the task. When that is a concern, passing the spec to plan mode tends to mitigate the issue.

[−] hatmanstack 60d ago
Why are we using cli wrappers if you're using Claude Code? I get if you need something like Codex but they released sub agents today so maybe not even that, but it's an unnecessary wrapper for Claude Code.
[−] joegaebel 59d ago
In my view, Spec-Driven systems are doomed to fail. There's nothing that couples the english language specs you've written with the actual code and behaviour of the system - unless your agent is being insanely diligent and constantly checking if the entire system aligns with your specs.

This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.

Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.

The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.

I've scoped this out here [1] and here [2].

[1] https://www.joegaebel.com/articles/principled-agentic-softwa... [2] https://github.com/JoeGaebel/outside-in-tdd-starter

[−] coopykins 59d ago
There are so many of these "meta" frameworks going around. I have yet to see one that proves in any meaningful way they improve anything. I have a hard time believing they accomplish anything other than burn tokens and poison the context window with too much information. What works best IME is keeping things simple, clear and only providing the essential information for the task at hand, and iterating in manageable slices, rather than trying to one-shot complex tasks. Just Plan, Code and Verify, simple as that.
[−] AndyNemmity 60d ago
I have a ai system i use. I'd like to release it so others can benefit, but at the same time it's all custom to myself and what i do, and work on.

If I fork out a version for others that is public, then I have to maintain that variation as well.

Is anyone in a similar situation? I think most of the ones I see released are not particularly complex compraed to my system, but at the same time I don't know how to convey how to use my system as someone who just uses it alone.

it feels like I don't want anyone to run my system, I just want people to point their ai system to mine and ask it what there is valuable to potentially add to their own system.

I don't want to maintain one for people. I don't want to market it as some magic cure. Just show patterns that others can use.

[−] maccam912 60d ago
I've had a good experience with https://github.com/obra/superpowers. At first glance this looks similar. Has anyone tried both who can offer a comparison?
[−] yoaviram 60d ago
I've been using GSD extensively over the past 3 months. I previously used speckit, which I found lacking. GSD consistently gets me 95% of the way there on complex tasks. That's amazing. The last 5% is mostly "manual" testing. We've used GSD to build and launch a SaaS product including an agent-first CMS (whiteboar.it).

It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.

[−] Frannky 60d ago
I tried it once; it was incredibly verbose, generating an insane amount of files. I stopped using it because I was worried it would not be possible to rapidly, cheaply, and robustly update things as interaction with users generated new requirements.

The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.

[−] gbrindisi 60d ago
I like openspec, it lets you tune the workflow to your liking and doesn’t get in the way.

I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.

I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.

[−] btiwaree 59d ago
I used this for a team hackathon and it took way too much time to build understanding of the codebase, wrote too many agent transcripts and spent way too much token during generation. It also failed multiple times when either generating agent transcript or extracting things from agent transcript - once citing "The agent transcripts are too complex to extract from" - quite confounding considering it's the transcript you created. For what we were trying to build - few small sets of features - using gsd was an overkill. The idea was to get some learnings whether gsd could be useful - for our case it was a strong no. Learning for me: don't overcomplicate - write better specs, use claude plan mode, iterate.
[−] toastal 59d ago
This pile of Markdown files has the most cringe-inducing name I have seen in weeks.
[−] DamienB 60d ago
I've compared this to superpowers and the classic prd->task generator. And I came away convinced that less is more. At least at the moment. gsd performed well, but took hours instead of minutes. Having a simple explanation of how to create a PRD followed by a slightly more technical task list performed much better. It wasn't that grd or superpowers couldn't find a solution, it's just that they did it much slower and with a lot more help. For me, the lesson was that the workflow has changed, and we that we can't apply old project-dev paradigms to this new/alien technology. There's a new instruction manual and it doesn't build on the old one.
[−] recroad 60d ago
I use openspec and love it. I’m doing 5-7x with close to 100% of code AI generated, and shipping to production multiple times a day. I work on a large sass app with hundreds of customers. Wrote something here:

https://zarar.dev/spec-driven-development-from-vibe-coding-t...

[−] vinnymac 60d ago
I tried this for a week and gave up. Required far too much back and forth. Ate too many tokens, and required too much human in the loop.

For this reason I don’t think it’s actually a good name. It should be called planning-shit instead. Since that’s seemingly 80%+ of what I did while interacting with this tool. And when it came to getting things done, I didn’t need this at all, and the plans were just alright.

[−] galexyending 60d ago
I gave it a shot, but won't be using it going forward. It requires a waterfall process. And, I found it difficult, and in some cases impossible, to adjust phases/plans when bugs or changes in features arise. The execution prompts didn't do a good job of steering the code to be verified while coding and relies on the user to manually test at the end of each phase.
[−] obsidianbases1 60d ago

> If you know clearly what you want

This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.

[−] paddy_m 59d ago
Has anything like this been built?

I want a system that enforces planning, tests, and adversarial review (preferably by a different company's model). This is more for features, less for overall planning, but a similar workflow could be built for planning.

1. Prompt 2. Research 3. Plan (including the tests that will be written to verify the feature) 4. adversarial review of plan 5. implementation of tests, CI must fail on the tests 6. adversarial review verifying that the tests match with the plan 7. implementation to make the tests pass. 8. adversarial PR review of implementation

I want to be able to check on the status of PRs based on how far along they are, read the plans, suggest changes, read the tests, suggest changes. I want a web UI for that, I don't want to be doing all of this in multiple terminal windows.

A key feature that I want is that if a step fails, especially because of adversarial review, the whole PR branch is force pushed back to the previous state. so say #6 fails, #5 is re-invoked with the review information. Or if I come to the system and a PR is at #8, and I don't like the plan, then I make some edits to the plan (#3), the PR is reset to the git commit after the original plan, and the LLM is reinvoked with either my new plan or more likely my edits to the plan, then everything flows through again.

I want to be able to sit down, tend to a bunch of issues, then come back in a couple of hours and see progress.

I have a design for this of course. I haven't implemented it yet.

[−] giancarlostoro 59d ago
Nice, I like the UI more than mine, I built a similar tool out of minor frustrations with some design choices in Beads, mine uses SQLite exclusively instead of git or hard files, been using it for all my personal projects, but havent gone back to try and refine what I have a little more. One thing a lot of these don't do that I added to mine is synching to and from GitHub. I want people to see exactly what my local tasks are, and if they need to pull one down to work on.

I think the secret sauce is talk to the model about what you want first, make the plan, then when you feel good about the spec, regardless of tooling (you can even just use a simple markdown file!) you have it work on it. Since it always has a file to go back to, it can never 'forget' it just needs to remember to review the file. The more detail in the file, the more powerful the output.

Tell your coding model: how you want it, what you want, and why you want it. It also helps to ask it to poke holes and raise concerns (bypass the overly agreeable nature of it so you dont waste time on things that are too complex).

I love using Claude to prototype ideas that have been in my brain for years, and they wind up coming out better than I ever envisioned.

[−] seneca 60d ago
I've tried several of these sorts of things, and I keep coming away with the feeling that they are a lot of ceremony and complication for not much value. I appreciate that people are experimenting with how to work with AI and get actual value, but I think pretty much all of these approaches are adding complexity without much, or often any, gain.

That's not a reason to stop trying. This is the iterative process of figuring out what works.

[−] dfltr 60d ago
GSD has a reputation for being a token burner compared to something like Superpowers. Has that changed lately? Always open to revisiting things as they improve.
[−] melvinroest 60d ago
If you want some context about spec-driven development and how it could be used with LLMs I recommend [1]. Having some background like helps me to understand tools like this a bit more.

[1] https://www.riaanzoetmulder.com/articles/ai-assisted-program...

[−] anentropic 59d ago
I have been using this a lot lately and ... it's good.

Sometimes annoying - you can't really fire and forget (I tend to regret skipping discussion on any complex tasks). It asks a lot of questions. But I think that's partly why the results are pretty good.

The new /gsd:list-phase-assumptions command added recently has been a big help there to avoid needing a Q&A discussion on every phase - you can review and clear up any misapprehensions in one go and then tell it to plan -> execute without intervention.

It burns quite a lot of tokens reading and re-reading its own planning files at various times, but it manages context effectively.

Been using the Claude version mostly. Tried it in OpenCode too but is a bit buggy.

They are working on a standalone version built on pi.dev https://github.com/gsd-build/gsd-2 ...the rationale is good I guess, but it's unfortunate that you can't then use your Claude Max credits with it as has to use API.

[−] arjie 60d ago
I could not produce useful output from this. It was useful as a rubber duck because it asks good motivating questions during the plan phase, but the actual implementation was lacklustre and not worth the effort. In the end, I just have Claude Opus create plans, and then I have it write them to memory and update it as it goes along and the output is better.
[−] visarga 60d ago
I did a similar system myself, then I run evals on it and found that the planning ceremony is mostly useless, claude can deal with simple prose, item lists, checkbox todos, anything works. The agent won't be a better coder for how you deliver your intent.

But what makes a difference is running plan review and work review agent, they fix issues before and after work. Both pull their weight but the most surprising is the plan-review one. The work review judge reliably finds bugs to fix, but not as surprising in its insights. But they should run from separate subagents not main one because they need a fresh perspective.

Other things that matter are 1. testing enforcement, 2. cross task project memory. My implementation for memory is a combination of capturing user messages with a hook, append only log, and keeping a compressed memory state of the project, which gets read before work and updated after each task.

[−] jankhg 60d ago
Apart from GSD and superpowers, there's another system, called PAUL [1]. It apparently requires fewer tokens compared to GSD, as it does not use subagents, but keeps all in one session. A detailed comparison with GSD is part of the repo [2].

[1] https://github.com/ChristopherKahler/paul

[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...

[−] theodorewiles 60d ago
I think the research / plan / execute idea is good but feels like you would be outsourcing your thinking. Gotta review the plan and spend your own thinking tokens!
[−] smusamashah 60d ago
There should be an "Examples" section in projects like this one to show what has actually been made using it. I scrolled to the end and was really expecting an example the way it's being advertised.

If it was game engine or new web framework for example there would be demos or example projects linked somewhere.

[−] jdwyah 59d ago
I used GSD for a bit. It was helpful for a side project where I constantly forgot where I was in implementation. Helpful to be able to just say "Do the next thing"

I would imagine that for a non-engineer trying to code it would be quite useful / deliver a better result / less liable to end up in total mess. But for experienced engineers it quickly felt like overkill / claude itself just gets better and better. Particularly once we got agent swarms I left GSD and don't think I'll be back. But I would recommend it to non coders trying to code.

[−] randomthought12 59d ago
I tried this but it creates a lot of content inside the repository and I don't like that. I understand these tools need to organize their context somewhere to be efficient but I feel that it just pollutes my space.

If multiple people work with different AI tools on the same project, they will all add their own stuff in the project and it will become messy real quick.

I'll keep superpowers, claude-mem, context7 for the moment. This combination produces good results for me.

[−] chrisss395 60d ago
I'm curious if anyone has used this (or similar) to build a production system?

I'm facing increasing pressure from senior executives who think we can avoid the $$$ B2B SaaS by using AI to vibe code a custom solution. I love the idea of experimenting with this but am horrified by the first-ever-case being a production system that is critical to the annual strategic plan. :-/

[−] tomkaczocha 56d ago
I’ve been testing GSD on several of my projects recently. I also looked at GSD-2. Both of these have some interesting features but they are so very slow. Before learning about GSD I built my own framework. That’s so much quicker. It’s on GitHub.
[−] rdtsc 59d ago

> GSD is designed for frictionless automation. Run Claude Code with: claude --dangerously-skip-permissions

Is this supposed to run in a VM?

[−] Andrei_dev 60d ago
250K lines in a month — okay, but what does review actually look like at that volume?

I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.

You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.

All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.

[−] yoavsha1 60d ago
How come we have all these benchmarks for models, but none whatsoever for harnesses / whatever you'd call this? While I understand assigning "scores" is more nuanced, I'd love to see some website that has a catalog of prompts and outputs as produced with a different configuration of model+harness in a single attepmt
[−] jcmontx 59d ago
My experience with this library has been underwhelming sadly. I have a better experience going raw with any cli agent
[−] lemax 59d ago
I'm still stuck on superpowers. Can't seem to get better plans out of native claude planning - superpowers ensures I have a reviewed design that actually matches my mental model. Typical claude planning doesn't confirm assumptions sufficiently for my weak brain dumps/poorly spec'd tickets.
[−] MeetingsBrowser 60d ago
I've tried it, and I'm not convinced I got measurably better results than just prompting claude code directly.

It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.

[−] BTAQA 59d ago
Built my first SaaS as a frontend dev with no backend experience using a similar approach. The key shift was treating Claude Code as a senior developer who needs clear specs, not a magic box. The more precise the context and requirements, the better the output. Vague prompts produce vague code.
[−] LoganDark 60d ago
This seems like something I'd want to try but I am wholly opposed to npx being the sole installation mechanism. Let me install it as a plugin in Claude Code. I don't want npx to stomp all over my home directory / system configuration for this, or auto-find directories or anything like that.
[−] ricardo_lien 59d ago
I tried it after watching the video demo from the repo creator, and it looked quite impressive at first. And I decided to rebuild my side project with this, but after a few days I realized that it was not for me. It's way too much of a black box for me as an engineer, not a prompter.