"The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons.
We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.
We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3AS
KZlok
).
We would greatly appreciate your attention and help in sharing it."
I guess I'm trying to understand. I'm hearing this paper has been around for a year -- I would think that many companies would have already implemented and measured its performance in production by now... is that not the case?
Okay, I spent about half an hour reading about this and asking gemini I guess my best understanding is this:
The main breakthrough [rotating by an orthogonal matrix to make important outliers averaged acrossed more dimensions] comes from RaBitQ. Sounds like the RaBitQ team was much more involved, and earlier, and the turbo quant paper very deliberately tries to avoid crediting and acknowledging RaBitQ.
My understanding is that the efficacy of these methods isn't in dispute, what turboquant did was adapt the method that was being used in vector databases and adapted it for transformers, and passed it of more as a new invention than an adaptation.
> applying this compression algorithm at scale may significantly relax the memory bottleneck issue.
I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
Is there a size cutoff you would say where diminishing returns really kick in?
My experience doesn't disagree, at least. I've been using Qwen for coding locally a bit. It is much better than I thought it would be. But also still falls short in some obvious ways compared to the frontiers.
Well, when a companies have 100billion dollar incentives to make discoveries like this, I don't know if we should assume this is the only optimization that will happen.
Given that increasing model size doesn't yield proportional increases in intelligence, there is a world where these datacenters don't have a positive ROI if we make these models even a fraction as effective as the human brain.
I think that either investors were extremely skittish that the stocks might crash and jumped at the first sign of trouble (creating a self-fulfilling prophecy) or they were trading on non-public information and analysts who don't have access to said information are reading too much into the temporal coincidence of the Google Research blog highlighting this paper.
Well considering basically the entire market was down these past few days, Google included, its unlikely attributable to this paper alone. Its most likely correlated with general war/trade route restrictions/potential recession fears, or at least, more correlated with those than it is with this paper.
This paper was released a year ago and was probably part of how google got to 1m context before other labs.
> The obvious one outside of KV caches as mentioned above is vector databases. Any RAG pipeline that stores embedding vectors for retrieval benefits from the same compression. TurboQuant reduces indexing time to “virtually zero” on vector search tasks and outperforms product quantisation and RabbiQ on recall benchmarks using GloVe vectors.
This part sounds especially cool. I did not think about this application when reading the other articles about TurboQuant. It would be cool to have access to this performance optimization for local RAG.
We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.
There's a bunch of research showing that more/better information doesn't reliably improve judgement, but better feedback on your existing predictions does. Makes me think of Soros and his whole thing about reflexivity.
Unfortunately, nobody at big companies know, what exactly math will win, so competition not end.
So, researchers will try one solution, then other solution, etc, until find something perfect, or until semiconductors production (Moore's Law) made enough semiconductors to run current models fast enough.
I believe, somebody already have silver bullet of ideal AI algorithm, which will lead all us to AGI, when scaled in some big company, but this knowledge is not obvious at the moment.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
There are techniques which already achieve great compression of the cache at 4 bit, eg using hadamard transforms. Going from
4 bit to 3 bit isn’t the great leap people expect this to be. It’s actually slower to run and is generally worse in practice.
I mean, since GPT-4, I believe the RAM is no longer creating the miracle that the LLM performance scales directly with the model size. At least ChatGPT itself convinced me that any decent-sized company can create a GPT4 equivalent in terms of model size, but limited by service options, like memory cache and hallucination handling. Companies buy RAM simply to ride the stock hype.
I am no expert, so this is a shallow take, but I think the global LLM already reaches its limit, and general AGI could only be possible if it's living in the moment, i.e., retraining every minute or so, and associating it with a much smaller device that can observe the surroundings, like a robot or such.
Instead of KV cache, I have an idea of using LoRA's instead: having a central LLM unchanged by learning, surrounded by a dozen or thousands of LoRAs, made orthogonal to each other, each competed by weights to be trained every 1 min say. The LLM, since it's a RNN anyway, provides "summarize what your state and goal is at this moment" and trains the LoRAs with the summary along with all the observations and say inputs from the users. The output of the LoRAs feeds back to the LLM for it to decide the weights for further LoRAs training.
Anyways, I am just thinking there needs to be a structure change of some kind.
Ive thought for a while that the real gains now will not come from throwing more hardware at the problem, but advances in mathematical techniques to make things for more efficient.
The TurboQuant paper is from April 2025. I’m sure the major labs knew about it on, or even before, the day it published. Any impact it had would have been a year ago. Yet I keep seeing these posts and discuss completely ignoring this.
Can we please start talking about this in that context? We already know what TurboQuant will do to DRAM demand. We already know what it will do to context windows. There is no need to speculate. There is no need to panic sell stocks.
Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.
> If I were Google, I wouldn’t release research that exposes a competitive advantage.
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.
103 comments
We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.
We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3AS KZlok ).
We would greatly appreciate your attention and help in sharing it."
https://x.com/gaoj0017/status/2037532673812443214
The main breakthrough [rotating by an orthogonal matrix to make important outliers averaged acrossed more dimensions] comes from RaBitQ. Sounds like the RaBitQ team was much more involved, and earlier, and the turbo quant paper very deliberately tries to avoid crediting and acknowledging RaBitQ.
My understanding is that the efficacy of these methods isn't in dispute, what turboquant did was adapt the method that was being used in vector databases and adapted it for transformers, and passed it of more as a new invention than an adaptation.
https://openreview.net/forum?id=tO3ASKZlok
> applying this compression algorithm at scale may significantly relax the memory bottleneck issue.
I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
My experience doesn't disagree, at least. I've been using Qwen for coding locally a bit. It is much better than I thought it would be. But also still falls short in some obvious ways compared to the frontiers.
> Is there a size cutoff you would say where diminishing returns really kick in?
No idea yet. But also it's obvious that making LLMs without MoE is stupid.
The demand for memory isn't going to go down, we'll just be able to do more with the same amount of memory.
Given that increasing model size doesn't yield proportional increases in intelligence, there is a world where these datacenters don't have a positive ROI if we make these models even a fraction as effective as the human brain.
I think that either investors were extremely skittish that the stocks might crash and jumped at the first sign of trouble (creating a self-fulfilling prophecy) or they were trading on non-public information and analysts who don't have access to said information are reading too much into the temporal coincidence of the Google Research blog highlighting this paper.
This paper was released a year ago and was probably part of how google got to 1m context before other labs.
> The obvious one outside of KV caches as mentioned above is vector databases. Any RAG pipeline that stores embedding vectors for retrieval benefits from the same compression. TurboQuant reduces indexing time to “virtually zero” on vector search tasks and outperforms product quantisation and RabbiQ on recall benchmarks using GloVe vectors.
This part sounds especially cool. I did not think about this application when reading the other articles about TurboQuant. It would be cool to have access to this performance optimization for local RAG.
Unfortunately, nobody at big companies know, what exactly math will win, so competition not end.
So, researchers will try one solution, then other solution, etc, until find something perfect, or until semiconductors production (Moore's Law) made enough semiconductors to run current models fast enough.
I believe, somebody already have silver bullet of ideal AI algorithm, which will lead all us to AGI, when scaled in some big company, but this knowledge is not obvious at the moment.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
I am no expert, so this is a shallow take, but I think the global LLM already reaches its limit, and general AGI could only be possible if it's living in the moment, i.e., retraining every minute or so, and associating it with a much smaller device that can observe the surroundings, like a robot or such.
Instead of KV cache, I have an idea of using LoRA's instead: having a central LLM unchanged by learning, surrounded by a dozen or thousands of LoRAs, made orthogonal to each other, each competed by weights to be trained every 1 min say. The LLM, since it's a RNN anyway, provides "summarize what your state and goal is at this moment" and trains the LoRAs with the summary along with all the observations and say inputs from the users. The output of the LoRAs feeds back to the LLM for it to decide the weights for further LoRAs training.
Anyways, I am just thinking there needs to be a structure change of some kind.
Can we please start talking about this in that context? We already know what TurboQuant will do to DRAM demand. We already know what it will do to context windows. There is no need to speculate. There is no need to panic sell stocks.
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
> If I were Google, I wouldn’t release research that exposes a competitive advantage.
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.