Mamba-3 (together.ai)

by matt_d 55 comments 300 points
Read article View on HN

55 comments

[−] nl 56d ago
I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
[−] jychang 56d ago
I'm not sure that I buy their conclusion that more compute during inference is good.

Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.

With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.

If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

[−] Havoc 56d ago
Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows
[−] lambda 56d ago
Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.

Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.

The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.

See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture

[−] 3abiton 55d ago
I am curious of the tradeoff of hybrid approaches, it sounds too good to be true.
[−] lambda 54d ago
It mostly trades some potential performance loss for speed, especially at longer contexts.

Nemotron 3 Super doesn't perform quite as well on benchmarks as the similarly sized Qwen3.5 122B A10B model, but it goes faster and is cheaper to run.

https://artificialanalysis.ai/?models=gpt-oss-120b%2Cmistral...

Now, you're not exactly comparing apples to apples there, since the training process (mix of data for pre-training, and the fine tuning stages of instruction turning, RLVR, etc) could have as much or more impact on how well it does as the architecture itself. Nemotron 3 Super does get better scores on performance than GPT-OSS 120B and Mistral Small 4, both also similarly sized open weights models.

[−] 0xbadcafebee 56d ago
They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear
[−] cubefox 56d ago
Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.
[−] energy123 56d ago
Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.
[−] mountainriver 56d ago
We kinda do do this with hybrid mamba transformers
[−] roger_ 56d ago
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?

I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.

[−] jeffhwang 56d ago
I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).

https://github.com/mamba-org/mamba

[−] manlymuppet 56d ago
I'm looking forward to the fifth iteration of this model.
[−] fudged71 56d ago
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?