The scaling of inference-time compute has become a primary driver for Large Language Model (LLM) performance, shifting architectural focus toward inference efficiency alongside model quality. While Transformer-based architectures remain the standard, their quadratic computational complexity and linear memory requirements create significant deployment bottlenecks. A team of researchers from Carnegie Mellon University (CMU), Princeton University, Together AI, and Cartesia AI have introduced Mamba-3, a model that addresses these constraints through an ‘inference-first’ design.
Mamba-3 builds upon the State Space Model (SSM) framework, introducing three core methodological updates: exponential-trapezoidal discretization, complex-valued state updates, and a Multi-Input Multi-Output (MIMO) formulation.
1. Exponential-Trapezoidal Discretization
State space models are continuous-time systems that must be discretized to process discrete sequences. Previous iterations like Mamba-1 and Mamba-2 utilized a first-order heuristic known as ‘exponential-Euler’ discretization. Mamba-3 replaces this with exponential-trapezoidal discretization, which provides a second-order accurate approximation of the state-input integral.
Technically, this update changes the discrete recurrence from a two-term update to a three-term update:
This formula is equivalent to applying a data-dependent, width-2 convolution on the state-input Btxt within the core recurrence. In empirical testing, this implicit convolution, combined with learnable B and C biases, allows Mamba-3 to function effectively without the external short causal convolutions typically required by recurrent models.
2. Complex-Valued State Space Models and the ‘RoPE Trick‘
A limitation of real-valued linear models is their inability to solve ‘state-tracking’ tasks, such as determining the parity of bit sequences. This failure stems from restricting the eigen-values of the transition matrix to real numbers, which cannot represent the ‘rotational’ dynamics required for such tasks.
Mamba-3 incorporates complex-valued SSMs to resolve this. The research team established a theoretical equivalence between discretized complex SSMs and real-valued SSMs that utilize data-dependent Rotary Positional Embeddings (RoPE) on the B and C projections.
By using the ‘RoPE trick,’ the model applies aggregated data-dependent rotations across time steps. This enables Mamba-3 to solve synthetic tasks like Parity and Modular Arithmetic, where Mamba-2 and real-valued variants perform no better than random guessing.
3. Multi-Input, Multi-Output (MIMO) Formulation
To address the hardware inefficiency of memory-bound decoding, Mamba-3 transitions from a Single-Input Single-Output (SISO) recurrence to a Multi-Input, Multi-Output (MIMO) structure.
In standard SSM decoding, the arithmetic intensity is approximately 2.5 ops per byte, far below the compute-bound regime of modern GPUs like the H100. MIMO increases the rank R of the input and output projections (Bt E RNR and xt E RPR), transforming the state update from an outer product to a matrix-matrix multiplication.
This shift increases decoding FLOPs by up to 4x relative to Mamba-2 at a fixed state size. Because the additional computation is overlaid with the existing memory I/O required for the state update, MIMO improves modeling quality and perplexity while maintaining similar wall-clock decode latency.
Architecture and Normalization
The Mamba-3 block follows the Llama-style layout, alternating with SwiGLU blocks. Key refinements include:
- BC/QK Normalization: RMS normalization is applied to the B and C projections, mirroring QKNorm in Transformers. This stabilizes training and enables the removal of the post-gate RMSNorm used in previous versions.
- Head-Specific Biases: Learnable, channel-wise biases are added to B and C components after normalization to induce convolution-like behavior.
- Hybrid Integration: When used in hybrid architectures—interleaving linear layers with self-attention—the addition of a pre-gate, grouped RMSNorm was found to improve length generalization in retrieval tasks.
Results and Efficiency
Evaluations were conducted on the FineWeb-Edu dataset across four model scales (180M to 1.5B).
- Downstream Performance: At the 1.5B scale, the Mamba-3 SISO variant outperforms Mamba-2 and Gated DeltaNet (GDN). The MIMO variant (R=4) further improves average downstream accuracy by 1.2 points over the SISO baseline.
- Pareto Frontier: Mamba-3 achieves comparable pretraining perplexity to Mamba-2 while using only half the state size (e.g., Mamba-3 with state size 64 matches Mamba-2 with 128).
- Kernel Performance: Optimized Triton (for prefill) and CuTe DSL (for decode) kernels ensure that the additional mathematical components remain lightweight. SISO Mamba-3 kernels demonstrate lower latency than released Mamba-2 and GDN kernels at standard BF16 settings.
| Model (1.5B) | Avg. Downstream Acc % ↑ | FW-Edu Ppl ↓ |
| Transformer | 55.4 | 10.51 |
| Mamba-2 | 55.7 | 10.47 |
| Mamba-3 SISO | 56.4 | 10.35 |
| Mamba-3 MIMO (R=4) | 57.6 | 10.24 |
Mamba-3 demonstrates that fundamental adjustments to the state space model viewpoint can bridge the gap between theoretical sub-quadratic efficiency and practical modeling capability.
Check out Paper, GitHub Page and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency appeared first on MarkTechPost.
