TurboQuant: Redefining AI efficiency with extreme compression (research.google)

by ray__ 166 comments 576 points
Read article View on HN

166 comments

[−] amitport 52d ago
This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.
[−] gavinray 52d ago
Can someone ELI5 these two concepts please, which make no sense to me:

  > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"
I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

If I throw a bunch of shapes on the ground, tightly packed and touching each other, then rotate all of them, you can't guarantee that the new conglomerate shape is any more/less "simple" than before, right?

  > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."
How can a boolean value preserve all of the relational and positional information between data points?
[−] akhenakh 52d ago
Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...
[−] parsimo2010 51d ago
This blog post sucks. It does not make me want to read the papers.

Look at this figure: https://storage.googleapis.com/gweb-research2023-media/image...

The speedup labels on the vertical axis are 0, 2, 2, 4, 6, 8... Why is 2 repeated? Did they just have nano-banana make them some charts? Can they not be bothered to use matplotlib or bokeh and directly render a graph? I don't know, maybe there is some legitimate reason that I don't know about for making a single value occur multiple times on a graph axes, but if that is the case, then they probably need to explain it in the figure caption. So it's either a "GenAI special" or it's poor communication about how to read the graph...

Look at this video visualization: https://storage.googleapis.com/gweb-research2023-media/media...

Do you have literally any clue what Polar Quantization is? Would this make me think, "I kind of have a high level understanding of that, let me go get the details from the paper."

Look at this figure: https://storage.googleapis.com/gweb-research2023-media/image...

The left hand side of the graph, which is normally assumed to start at 0, starts at 48. Those MASSIVE differences you see in the figure? Only a few percent. And that's a deception but only if the figure is even accurate, because we saw earlier they can't even get figure axes correct.

[−] pstoll 52d ago
And a group has published an independent working implementation today, nice to see:

https://github.com/tonbistudio/turboquant-pytorch

[−] benob 52d ago
This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.
[−] mesuvash 51d ago
TurboQuant explained with an easy to understand (no-math) animation https://mesuvash.github.io/blog/2026/turboquant-interactive/
[−] wbsun 51d ago
The blog is new but the paper was submitted almost one year ago: https://arxiv.org/abs/2504.19874. Anyone has ideas if this is already implemented in many models (at least Gemini, I guess)? If that's the case, can I expect cheaper RAM for my computer :D
[−] mskkm 47d ago
seems to be a scam

"The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.

We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3ASKZlok).

We would greatly appreciate your attention and help in sharing it."

https://x.com/gaoj0017/status/2037532673812443214

[−] bdcs 51d ago
Here's my attempt at a undergrad-level summary (corrections welcome!):

The core idea is to quantize KV cache, but do so in a way that destroys minimal information. In this case, it's similarly scores between vectors. The simplest way to do this is to change all the elements from 16bit of precision to, say, 4 bits (Scalar Quant.). These papers improve on it by realizing: almost all the energy (concentration of measure) is towards the equator of the hypersphere (normally distributed as 1/d; d=vector dimensionality). (The curse/blessing of hyper dimensionality strikes again.) So when we quantize the elements (think "latitudes", e.g. to the nearest degree) we destroy a lot of information because basically all the vectors were around the equator (so some latitudes have a lot of vectors and some have very few). The idea is to rotate the vectors away from the equator so they're more consistently distributed (to better preserve the entropy during quantization, which I guess was amitport's DRIVE idea). PolarQuant does a hyperpolar coordinate transform which superficially seems neat for preserving entropy because of this equator/polar framing (and ultimately unnecessary as shown by TurboQuant). They also realized there's a bias to the resulting vectors during similarity, so they wrote the QJL paper to fix the bias. And then the TurboQuant paper took PolarQuant + QJL, removed the hyperpolar coords, and added in some gross / highly-pragmatic extra bits for important channels (c.f. elements of the vectors) which is sort of a pathology of LLMs these days but it is what it is. Et voila, highly compressed KV Cache. If you're curious why you can randomly rotate the input, it's because all the vectors are rotated the same, so similarity works out. You could always un-rotate to get the original, but there's no need because the similarity on rotated/unrotated is the same if you compare apples to apples (with the QJL debiasing). Why was PolarQuant even published? Insu Han is solely on that paper and demanded/deserved credit/promotion, would be my guess. The blog post is chock-full of errors and confusions.

[−] zeeshana07x 52d ago
The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer
[−] bluequbit 52d ago
I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

[−] htrp 51d ago
The actual paper from April 2025

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

https://arxiv.org/abs/2504.19874

[−] bilsbie 52d ago
It seems like most breakthroughs I see are for efficiency? What are the most importsnt breakthroughs from the past two or three years for intelligence?
[−] antiresonant 51d ago
At this rate, the current AI era is going to clear the queue of all mathematics that's ever been created but not yet applied.
[−] naasking 52d ago
This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware.

[1] https://github.com/z-lab/paroquant

[−] mrbonner 49d ago
I feel like I’m not the only who feel excited about the whole “compression” tricks while maintaining fidelity in our AI era. In a way, it has a vibe similar to the early 2000s when digital music became popular and the need for lossless compression was paramount. Sort of a pied piper moment for us now . Someone please make a Weisseman score for this stuff.
[−] ssijak 52d ago
For my grug brain can somebody translate this to ELIgrug terms?

Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality?

[−] maurelius2 52d ago
I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?
[−] iddan 52d ago
I am guessing as Google is vertically integrated and "actually pays" for AI infra (compared to OpenAI & Anthropic that receives hardware as partnerships) they have a more urgent incentive to reduce model sizes. Also, Google and Apple will be the first to gain from running model on-device
[−] macleginn 52d ago
"TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy" -- what do each 3 bits correspond to? Hardly individual keys or values, since it would limit each of them to 8 different vectors.
[−] mmastrac 52d ago
Is this a tradeoff between GPU-computation-expense vs accuracy? ie: you could quantize into segments or grids on the unit circle/sphere/etc, but that's too expensive so it's better to just quantize to a Cartesian grid because the GPU can decompress cheaper?
[−] lwhi 52d ago
Will this help us run models locally?
[−] antoniuschan99 51d ago
It could turn a 1M context system to a 4M context system. TurboQuant-style KV-cache compression makes longer context windows cheaper to serve. Not exactly sure how much increase in context size though.
[−] moktonar 52d ago
Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?
[−] lucrbvi 52d ago
Sounds like Multi-Head Latent Attention (MLA) from DeepSeek
[−] _s_a_m_ 52d ago
has the word "advanced", gotta be good
[−] alkenrinnstet 50d ago
This article is AI-generated slop.

> This clever step simplifies the data's geometry

No self-respecting researcher talks about their work in this way. But it is characteristic of these chatbots' tendency to over-use superlatives and sycophantic language.

[−] Serhii-Set 52d ago
[flagged]
[−] maxothex 52d ago
[dead]
[−] pugchat 52d ago
[dead]
[−] diablevv 52d ago
[flagged]
[−] Yanko_11 52d ago
[dead]
[−] leontloveless 52d ago
[dead]
[−] veunes 52d ago
[dead]
[−] ryguz 52d ago
[dead]
[−] wei03288 52d ago
[dead]
[−] rsmtjohn 52d ago
[flagged]
[−] mohsen1 52d ago
[dead]
[−] aledevv 52d ago
[dead]
[−] hikaru_ai 52d ago
[dead]
[−] dev_tools_lab 52d ago
[dead]
[−] paxrel_ai 52d ago
[dead]
[−] hackimov 51d ago
[dead]
[−] sta1n 51d ago
[dead]
[−] QubridAI 52d ago
[dead]
[−] vaildegraff 52d ago
[flagged]