MSA: Memory Sparse Attention

[−] kingstnap 53d ago

I do wonder about the usefulness about this massive context dumping exercise. 100M is a ridiculous amount. Usually to get good results on practical tasks you need to actually think about what you are dumping into context.

I also have my gripes about the way 2 hop is mentioned here. With figure 3 being the canonical example of what I would consider too trivial/misleading (The exact text match of "Eric Watts" being in the question and in the context). It leads to the natural question of how does it do compared to an LLM with a grep tool.

What I would consider more interesting is practical synthesis over such a large context where you can't just string lookup answers. For example maybe dumping all of Intel's x86 manuals into context and then asking an LLM to try to write assembly or something.

[−] sosodev 52d ago

I spent some time trying to understand this paper and I think calling this a new attention mechanism is a bit misleading. As a dead comment pointed out this is much closer to RAG. It's not exposing all 100M tokens directly to the model while doing each prediction. However, the RAG mechanisms have been integrated directly into the model architecture and that means it can have higher accuracy and lower latency. The higher accuracy is because it isn't storing text, but rather the actual in-memory representations (K/V, compressed tensor representations, routing keys, etc) of each document so it can search and utilize them more effectively. Given that it's computing up to 100x the context space it, like RAG, cannot process that volume in realtime. They explicitly state the the model needs to do offline encoding before handling inference. So you shouldn't expect to just send 100M tokens over an API and start getting a response.

I also think some of the benchmarks are misleading. Getting a RAG system to do an attention benchmark and then comparing it against a model without RAG just isn't fair. It is obviously better but it's not apples to apples. Some of the benchmarks compare against model+RAG and there the delta in performance is much smaller.

[−] ting0 53d ago

So basically there's no excuse not to see ChatGPT and Claude release 10M -> 100M models within 6months or so. <9% degradation is crazy. Hopefully DeepSeek and Qwen4 can implement this.

[−] cyanydeez 53d ago

Neat. Can't wait for our language, framework specific tools for models. I don't need my models writing shakespeare, unless I'm working on shakespeare.

[−] mememememememo 53d ago

[dead]

[−] algolint 53d ago

[flagged]

MSA: Memory Sparse Attention (github.com)

8 comments