I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.
Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.
The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.
Now, you're not exactly comparing apples to apples there, since the training process (mix of data for pre-training, and the fine tuning stages of instruction turning, RLVR, etc) could have as much or more impact on how well it does as the architecture itself. Nemotron 3 Super does get better scores on performance than GPT-OSS 120B and Mistral Small 4, both also similarly sized open weights models.
They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear
Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.
Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
55 comments
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.
The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.
See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture
Nemotron 3 Super doesn't perform quite as well on benchmarks as the similarly sized Qwen3.5 122B A10B model, but it goes faster and is cheaper to run.
https://artificialanalysis.ai/?models=gpt-oss-120b%2Cmistral...
Now, you're not exactly comparing apples to apples there, since the training process (mix of data for pre-training, and the fine tuning stages of instruction turning, RLVR, etc) could have as much or more impact on how well it does as the architecture itself. Nemotron 3 Super does get better scores on performance than GPT-OSS 120B and Mistral Small 4, both also similarly sized open weights models.
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
https://arxiv.org/abs/2603.15569
https://github.com/mamba-org/mamba