From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

[−] coppsilgold 45d ago

There are also interesting approaches to more directly compress a large document or an entire codebase into a smaller set of tokens without getting the LLM to wing it. For example, Cartridges: <https://hazyresearch.stanford.edu/blog/2025-06-08-cartridges>

They basically get gradient descent to optimize the KV cache while freezing the network.

[−] refulgentis 45d ago

Good prose, but it keeps collapsing distinct layers of the stack into one poetic notion of “memory.” KV cache, prompt caching, product-level saved memory, transcript storage, retrieval, summarization, and long-context failure modes are different mechanisms with different failure modes. Once those boundaries disappear, you get lines like “API pricing is the price of remembering." Evocative, sure. Explanatory, not really.

Same thing in the technical bits.

“Computation drops from quadratic to linear” is only narrowly true for incremental decoding after the prefix is already processed.

“When the KV cache gets too large, the standard solution is compaction” is worse: the standard responses are boring systems tricks like limits, eviction, paging/offload, compression, etc. Summarization is usually an application workaround where you throw away old text and replace it with a shorter prompt. The cache never became a summary; the prompt did.

So I wouldn’t call the piece wrong so much as aggressively smooth. It knows the vocabulary, but it keeps letting metaphor outrun mechanism.

[−] nstj 45d ago

concur - a lot of the article was useful but a lot of it was "sorta the right stuff in sorta the wrong place"

[−] LuxBennu 45d ago

good overview of the architecture side but worth mentioning there's another axis that stacks on top of all of this: you can quantize the kv cache itself at inference time. in llama.cpp you can run q8 for keys and q4 for values and it cuts cache memory roughly in half again on top of whatever gqa or mla already saves you. i run qwen 70b 4-bit on m2 max 96gb and the kv quant is what actually made longer contexts fit without running out of unified memory. keys need more precision because they drive attention scores but values are way more tolerant of lossy compression, so the asymmetry works out.

[−] suprjami 45d ago

Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types.

TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.

[−] Ecko123 45d ago

[dead]

[−] LuxBennu 45d ago

yeah fair point, it's definitely model dependent. i've had good results with qwen but tried it on a smaller mistral variant once and the output quality dropped noticeably even at q8 for both. the speed hit from mixed types hasn't been bad on apple silicon in my experience but i can see it mattering more on cuda.

[−] hrmtst93837 45d ago

[flagged]

[−] az09mugen 45d ago

Unrelated, but 69KB is how much RAM Voyager 1 has.

[−] gregman1 45d ago

Voyager as a token of curiosity

[−] sachamorard 45d ago

The compaction problem described here is worse than it looks because of the asymmetry between the compactor and the reader. The model doing the compaction has full access to everything, it can see all six rules in the policy, the exact budget figure, every constraint. The model reading the summary has no reference point to notice what's missing. There's no checksum on memory.

The article mentions the void between volatile KV cache and permanent weights. One thing that lives in that void: compression results. At Edgee we cache prompt compression outputs in a globally distributed KV store specifically to avoid recomputing them on every request. It maps naturally to the architecture, the cache is already the right abstraction, you're just caching one layer higher.

The interesting property is that compression results for similar contexts are often reusable across sessions, which the KV cache itself never is. The Greg Egan framing is apt. The trajectory from MHA to GQA to MLA reads exactly like a series of decisions about what's worth remembering in full fidelity vs. what can be abstracted. The difference is Egan's citizens chose their own compression ratios.

[−] jasonjmcghee 45d ago

> OpenAI applies it automatically and charges 50% less for cache hits

This is incorrect. It's 90% cheaper.

https://developers.openai.com/api/docs/pricing

[−] algolint 45d ago

[flagged]

[−] childrapst 45d ago

[dead]

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem (news.future-shock.ai)

10 comments