I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.
Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.
The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.
Now, you're not exactly comparing apples to apples there, since the training process (mix of data for pre-training, and the fine tuning stages of instruction turning, RLVR, etc) could have as much or more impact on how well it does as the architecture itself. Nemotron 3 Super does get better scores on performance than GPT-OSS 120B and Mistral Small 4, both also similarly sized open weights models.
They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear
Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.
Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
Probably constrained by training resources. It's much easier to experiment with a smaller architecture. You may need many training runs to figure out hyperparameters for example. If each run needs multiple GPUs for a week the cost adds up quickly. I think it makes a lot of sense to start small.
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.
Why can’t they simply say -
Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.
This is sort of what their first sentence states? Except your line implies that they are fast in training and inference, they imply they are focusing on inference and are dropping training speed for it.
I don’t know why you’re being downvoted. As a longtime editor your version is immensely better.
Looks like the original was probably not human-written.
55 comments
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.
The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.
See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture
Nemotron 3 Super doesn't perform quite as well on benchmarks as the similarly sized Qwen3.5 122B A10B model, but it goes faster and is cheaper to run.
https://artificialanalysis.ai/?models=gpt-oss-120b%2Cmistral...
Now, you're not exactly comparing apples to apples there, since the training process (mix of data for pre-training, and the fine tuning stages of instruction turning, RLVR, etc) could have as much or more impact on how well it does as the architecture itself. Nemotron 3 Super does get better scores on performance than GPT-OSS 120B and Mistral Small 4, both also similarly sized open weights models.
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
https://arxiv.org/abs/2603.15569
https://github.com/mamba-org/mamba
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.
Why can’t they simply say -
Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.
It's a nice opening as it is imo
> a departure from Mamba-2, which optimized for training speed.
?