Mamba-3 (together.ai)

by matt_d 55 comments 300 points
Read article View on HN

55 comments

[−] nl 56d ago
I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
[−] jychang 56d ago
I'm not sure that I buy their conclusion that more compute during inference is good.

Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.

With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.

If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

[−] Havoc 56d ago
Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows
[−] lambda 56d ago
Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.

Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.

The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.

See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture

[−] 3abiton 55d ago
I am curious of the tradeoff of hybrid approaches, it sounds too good to be true.
[−] lambda 54d ago
It mostly trades some potential performance loss for speed, especially at longer contexts.

Nemotron 3 Super doesn't perform quite as well on benchmarks as the similarly sized Qwen3.5 122B A10B model, but it goes faster and is cheaper to run.

https://artificialanalysis.ai/?models=gpt-oss-120b%2Cmistral...

Now, you're not exactly comparing apples to apples there, since the training process (mix of data for pre-training, and the fine tuning stages of instruction turning, RLVR, etc) could have as much or more impact on how well it does as the architecture itself. Nemotron 3 Super does get better scores on performance than GPT-OSS 120B and Mistral Small 4, both also similarly sized open weights models.

[−] 0xbadcafebee 56d ago
They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear
[−] cubefox 56d ago
Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.
[−] energy123 56d ago
Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.
[−] mountainriver 56d ago
We kinda do do this with hybrid mamba transformers
[−] roger_ 56d ago
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?

I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.

[−] jeffhwang 56d ago
I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).

https://github.com/mamba-org/mamba

[−] manlymuppet 56d ago
I'm looking forward to the fifth iteration of this model.
[−] breadsniffer 55d ago
Mamboooo no. #5
[−] fudged71 56d ago
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
[−] snek_case 56d ago
Probably constrained by training resources. It's much easier to experiment with a smaller architecture. You may need many training runs to figure out hyperparameters for example. If each run needs multiple GPUs for a week the cost adds up quickly. I think it makes a lot of sense to start small.
[−] diablevv 56d ago
[flagged]
[−] daliliu 56d ago
[dead]
[−] robofanatic 56d ago

> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.

Why can’t they simply say -

Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.

[−] esquire_900 56d ago
This is sort of what their first sentence states? Except your line implies that they are fast in training and inference, they imply they are focusing on inference and are dropping training speed for it.

It's a nice opening as it is imo

[−] cubefox 56d ago
They don't say anything about dropping training speed.
[−] estearum 56d ago

> a departure from Mamba-2, which optimized for training speed.

?

[−] i000 56d ago
Agreed. What you wrote was probably the input, what we see is the LLM output with the directive to "make us sound smart, put gratuitous em-dash"
[−] E-Reverance 56d ago
The first sentence basically does though, no?
[−] mufasachan 56d ago
The blog is technical, technical terms in the TL;DR seems relevant to me.
[−] renewiltord 56d ago
Found the guy who made the Windows error messages say “Your computer did an oopsie :(“ instead of including any useful information.
[−] arendtio 56d ago
I don't get the downvotes, as I had trouble understanding the intro as well. It seems it was written for a very specific audience.
[−] camillomiller 56d ago
I don’t know why you’re being downvoted. As a longtime editor your version is immensely better. Looks like the original was probably not human-written.