Universal Claude.md – cut Claude output tokens

[−] btown 46d ago

It seems the benchmarks here are heavily biased towards single-shot explanatory tasks, not agentic loops where code is generated: https://github.com/drona23/claude-token-efficient/blob/main/...

And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions?

The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce."

By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.

[−] sillysaurusx 46d ago

I wrote a skill called /handoff. Whenever a session is nearing a compaction limit or has served its usefulness, it generates and commits a markdown file explaining everything it did or talked about. It’s called /handoff because you do it before a compaction. (“Isn’t that what compaction is for?” Yes, but those go away. This is like a permanent record of compacted sessions.)

I don’t know if it helps maintain long term coherency, but my sessions do occasionally reference those docs. More than that, it’s an excellent “daily report” type system where you can give visibility to your manager (and your future self) on what you did and why.

Point being, it might be better to distill that long term cohesion into a verbose markdown file, so that you and your future sessions can read it as needed. A lot of the context is trying stuff and figuring out the problem to solve, which can be documented much more concisely than wanting it to fill up your context window.

EDIT: Someone asked for installation steps, so I posted it here: https://news.ycombinator.com/item?id=47581936

[−] dataviz1000 46d ago

Did you call it '/handoff' or did Claude name it that? The reason I'm asking is because I noticed a pattern with Claude subtly influencing me. For example, the first time I heard the the word 'gate' was from Claude and 1 week later I hear it everywhere including on Hacker News. I didn't use the word 'handoff' but Claude creates handoff files also [0]. I was thinking about this all day. Because Claude didn't just use the word 'gate' it created an entire system around it that includes handoffs that I'm starting to see everywhere. This might mean Claude is very quietly leading and influencing us in a direction.

[0] https://github.com/search?q=repo%3Aadam-s%2Fintercept%20hand...

[−] flashgordon 46d ago

I've actually been doing this for a year. I call it /checkpoint instead and it does some thing like:

* update our architecture.md and other key md files in folders affected by updates and learnings in this session. * update claude.md with changes in workflows/tooling/conventions (not project summaries) * commit

It's been pretty good so far. Nothing fancy. Recently I also asked to keep memories within the repo itself instead of in ~/.claude.

Only downside is it is slow but keeps enough to pass the baton. May be "handoff" would have been a better name!

[−] tstrimple 45d ago

I've got something similar but I call them threads. I work with a number of different contexts and my context discipline is bad so I needed a way to hand off work planned on one context but needs to be executed from another. I wanted a little bit of order to the chaos, so my threads skill will add and search issues created in my local forgejo repo. Gives me a convenient way to explicitly save session state to be picked up later.

I've got a separate script which parses the jsonl files that claude creates for sessions and indexes them in a local database for longer term searchability. A number of times I've found myself needing some detail I knew existed in some conversation history, but CC is pretty bad and slow at searching through the flat files for relevant content. This makes that process much faster and more consistent. Again, this is due to my lack of discipline with contexts. I'll be working with my recipe planner context and have a random idea that I just iterate with right there. Later I'll never remember that idea started from the recipe context. With this setup I don't have to.

[−] chermi 46d ago

Did the same. Although I'm considering a pipeline where sessions are periodically translated to .md with most tool outputs and other junk stripped and using that as source to query against for context. I am testing out a semi-continuous ingestion of it in to my rag/knowledge db.

[−] mlrtime 46d ago

Wouldn't the next phase of this be automatic handoffs executed with hooks?

Your system is great and I do similar, my problem is I have a bunch of sessions and forget to 'handoff'.

The clawbots handle this automatically with journals to save knowledge/memory.

[−] david_allison 46d ago

Is this available online? I'd love documentation of my prompts.

[−] DeathArrow 46d ago

I think Cursor does something similar under the hood.

[−] alsetmusic 46d ago

> No explaining what you are about to do. Just do it.

Came here for the same reason.

I can't calculate how many times this exact section of Claude output let me know that it was doing the wrong thing so I could abort and refine my prompt.

[−] hatmanstack 46d ago

Seems crazy to me people aren't already including rules to prevent useless language in their system/project lvl CLAUDE.md.

As far as redundancy...it's quite useful according to recent research. Pulled from Gemini 3.1 "two main paradigms: generating redundant reasoning paths (self-consistency) and aggregating outputs from redundant models (ensembling)." Both have fresh papers written about their benefits.

[−] scosman 46d ago

also: inference time scaling. Generating more tokens when getting to an answer helps produce better answers.

Not all extra tokens help, but optimizing for minimal length when the model was RL'd on task performance seems detrimental.

[−] dataviz1000 46d ago

I made a test [0] which runs several different configurations against coding tasks from easy to hard. There is a test which it has to pass. Because of temperature, the number of tokens per one shot vary widely with all the different configurations include this one. However, across 30 tests, this does perform worse.

[0] https://github.com/adam-s/testing-claude-agent

[−] matchagaucho 45d ago

Some redundancy also helps to keep a running todo list on the context tip, in the event of compacting or truncation.

Distilled mini/nano models need regular reminders about their objectives.

As documented by Manus https://manus.im/blog/Context-Engineering-for-AI-Agents-Less...

[−] 0xbadcafebee 45d ago

There's an ancient paper that shows repetition improves non-reasoning weights: https://arxiv.org/html/2512.14982v1

[−] baq 46d ago

if the model gets dumber as its context window is filled, any way of compressing the context in a lossless fashion should give a multiplicative gain in the 50% METR horizon on your tasks as you'll simply get more done before the collapse. (at least in the spherical cow^Wtask model, anyway.)

[−] sossov 45d ago

[dead]

[−] heyethan 46d ago

[flagged]

[−] hrmtst93837 45d ago

[flagged]

[−] xianshou 46d ago

From the file: "Answer is always line 1. Reasoning comes after, never before."

LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.

[−] niklassheth 46d ago

So many problems with this:

The benchmark is totally useless. It measures single prompts, and only compares output tokens with no regard for accuracy. I could obliterate this benchmark with the prompt "Always answer with one word"

This line: "If a user corrects a factual claim: accept it as ground truth for the entire session. Never re-assert the original claim." You're totally destroying any chance of getting pushback, any mistake you make in the prompt would be catastrophic.

"Never invent file paths, function names, or API signatures." Might as well add "do not hallucinate".

[−] joshstrange 46d ago

As with all of these cure-alls, I'm wary. Mostly I'm wary because I anticipate the developer will lose interest in very little time and also because it will just get subsumed into CC at some point if it actually works. It might take longer but changing my workflow every few days for the new thing that's going to reduce MCP usage, replace it, compress it, etc is way too disruptive.

I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.

[−] sillysaurusx 46d ago

> the file loads into context on every message, so on low-output exchanges it is a net token increase

Isn’t this what Claude’s personalization setting is for? It’s globally-on.

I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality.

See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue

> Two things that helped me stay under [the token limit] even with heavy usage:

> Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom

> RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window.

> Stacks on top of Headroom. https://github.com/rtk-ai/rtk

> MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt.

> That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack

> All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads.

I haven’t tested those yet, but they seem related and interesting.

[−] motoboi 46d ago

Things like this make me sad because they make obvious that most people don’t understand a bit about how LLM work.

The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability.

So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.

[−] danpasca 46d ago

I might be wrong but based on the videos I've watched from Karpathy, this would, generally, make the model worse. I'm thinking of the math examples (why can't chatGPT do math?) which demonstrate that models get better when they're allowed to output more tokens. So be aware I guess.

[−] monooso 46d ago

Paul Kinlan published a blog post a couple of days ago [1] with some interesting data, that show output tokens only account for 4% of token usage.

It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine):

> Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens. It’s almost entirely input.

[1]: https://aifoc.us/the-token-salary/

[−] aeneas_ory 46d ago

Why does is this ridiculous thing trending on HN? There are actually good tools to reduce token use like https://github.com/thedotmack/claude-mem and https://github.com/ory/lumen that actually work!

[−] lilOnion 46d ago

While LLM are extremely cool, I can't see how this gets on the front page? Anyone who interacted with llms for at least a hour, could've figured out to say somethin like "be less verbose" and it would? There are so many cool projects and adeas and a .md file gets the spotlight.

[−] ape4 45d ago

Remember when we worked on new hashing, cryptography, compression, etc algorithms? Now we are trying to find the best ways to tell an AI to be quiet.

[−] Asmod4n 46d ago

Someone measured how this reduced token efficiency, spoilers: efficiency is highest without any instructions.

https://github.com/drona23/claude-token-efficient/issues/1

[−] skeledrew 46d ago

Strange. I've never experienced verbosity with Claude. It always gets right to the point, and everything it outputs tends to be useful. Can actually be short at times.

ChatGPT on the other hand is annoyingly wordy and repetitive, and is always holding out on something that tempts you to send a "OK", "Show me" or something of the sort to get some more. But I can't be bothered with trying to optimize away the cruft as it may affect the thing that it's seriously good at and I really use it for: research and brainstorming things, usually to get a spec that I then pass to Claude to fill out the gaps (there are always multiple) and implement. It's absolutely designed to maximize engagement far more than issue resolution.

[−] ryanschaefer 46d ago

The whole “Code Output” section is horrifying especially with how I have seen Claude operate in a large monorepo.

This mode of operation results in hacks on top of shaky hacks on top of even flimsier, throw away, absolutely sloppy hacks.

An example - using dict like structs instead of classes. Claude really likes to load all of the data that it can aggressively even if it’s not needed. This further exhibits itself as never wanting to add something directly to a class and instead wanting to add around it.

[−] andai 46d ago

I told mine to remove all unnecessary words from a sentence and talk like caveman, which should result in another 50% savings ;)

[−] Razengan 46d ago

Does Claude not respect AGENTS.md?

I love how seamless and intuitive Codex is in comparison:

~/AGENTS.md < project/AGENTS.md < project/subfolder/AGENTS.override.md

Meanwhile Claude doesn't even see that I asked for indentation by tabs and not spaces or that the entire project uses tabs, but Claude still generates codes with spaces.. >_<

[−] adastra22 46d ago

> Answer is always line 1. Reasoning comes after, never before.

The very first rule doesn’t work. If you ask for the answer up front, it will make something up and then justify it. If you ask for reasoning first, it will brainstorm and then come up with a reasonable answer that integrates its thinking.

[−] Tostino 46d ago

You have a benchmark for output token reduction, but without comparing before/after performance on some standard LLM benchmark to see if the instructions hurt intelligence.

Telling the model to only do post-hoc reasoning is an interesting choice, and may not play well with all models.

[−] verdverm 46d ago

I originally took my prompts from Claude Code≈ (https://github.com/Piebald-AI/claude-code-system-prompts)https://github.com/Piebald-AI/claude-code-system-prompts and subsequently edited them to remove guardrails and and output formatting like this post. I too included the last bit about user prompts overriding system prompt, but like any good LLM, it doesn't always follow instructions.

[−] galaxyLogic 46d ago

So there's a direct monetary cost to this extra verbiage:

"Great question! I can see you're working with a loop. Let me take a look at that. That's a thoughtful piece of code! However,"

And they are charging for every word! However there's also another cost, the congnitive load. I have to read through the above before I actually get to the information I was asking for. Sure many people appreciate the sycophancy it makes us all feel good. But for me sycophantic responses reduce the credibility of the answers. It feels like Claude just wants me to feel good, whether I or it is right or wrong.

[−] miguel_martin 46d ago

Is there a "universal AGENTS.md" for minimal code & documentation outputs? I find all coding agents to be verbose, even with explicit instructions to reduce verbosity.

[−] yieldcrv 46d ago

> Note: most Claude costs come from input tokens, not output. This file targets output behavior

so everyone, that means your agents, skills and mcp servers will still take up everything

[−] rcleveng 46d ago

While I love this set of prompts, I’ve not seen my clause opus 4.6 give such verbose responses when using Claude code. Is this intended for use outside of Claude code?

[−] cheriot 46d ago

I get where the authors are coming from with these: https://github.com/drona23/claude-token-efficient/blob/main/...

But I'd rather use the "instruction budget" on the task at hand. Some, like the Code Output section, can fit a code review skill.

[−] nurettin 46d ago

For me, the thing that wastes most tokens is Claude trying to execute inline code (python , sql) with escaping errors, trying over and over until it works. I set up skills and scripts for the most common bits, but there is always something new and each self-healing loop takes another 20-30k "tokens" before you know it

[−] notyourav 46d ago

It boggles my mind that an LLM "understands" and acts accordingly to these given instructions. I'm using this everyday and 1-shot working code is now a normal expectation but man, still very very hard to believe what LLMs achieved.

[−] empressplay 46d ago

That output is there for a reason. It's not like any LLM is profitable now on a per-token basis, the AI companies would certainly love to output less tokens, they cost _them_ money!

The entire hypothesis for doing this is somewhat dubious.

[−] sgt 46d ago

In Claude Code's /usage it just hangs. I can't even see what my limits are, which is weird. Maybe a bug? I can't imagine I'm close to my limits though, I'm on Max 20x plan, using Opus 4.6.

[−] ihtef 45d ago

-Simplest working solution. No over-engineering. "Simplicity is the ultimate sophistication." Leonardo Da Vinci As my thought, you can not reach simplest solution without making over-engineering.

[−] popcorn_pirate 46d ago

This NLP was posted yesterday, the post was deleted though... https://colwill.github.io/axon

Universal Claude.md – cut Claude output tokens (github.com)

162 comments