20.5 C
New York
Thursday, June 18, 2026

MiniMax Sparse Consideration (MSA): a Two-Department Block-Sparse Consideration Educated on a 109B-Parameter MoE With a 3T-Token Finances


MiniMax launched MSA (MiniMax Sparse Consideration), a sparse consideration technique constructed straight on Grouped Question Consideration (GQA). It targets one bottleneck: the quadratic price of softmax consideration at lengthy context. The MiniMax analysis group examined it inside a 109B-parameter Combination-of-Consultants mannequin educated with native multimodal information. In addition they open-sourced an inference kernel and shipped a manufacturing mannequin, MiniMax-M3.

What’s MSA (MiniMax Sparse Consideration)

MSA (MiniMax Sparse Consideration) components consideration into two phases: an Index Department and a Principal Department. The Index Department decides which key-value blocks every question ought to learn. The Principal Department then runs precise softmax consideration over solely these blocks.

Choice occurs at block granularity, not per token. The default block measurement is Bokay = 128 tokens. Every question and GQA group retains okay = 16 blocks. That fixes the per-query finances at kBokay = 2,048 key-value tokens.

The 2 price constructions differ. Dense GQA consideration scales per question as O(N), the total context. MSA scales as O(kBokay), which stays mounted as N grows. The compute hole due to this fact widens as context size will increase.

Choice is shared inside every GQA group however unbiased throughout teams. One key-value head serves a number of question heads, they usually share one block set. Completely different teams can attend to totally different long-range areas.

How the Two Branches Work

The Index Department provides solely two projection matrices to a regular GQA layer. It defines one index question head per GQA group and one shared index key head. It scores seen key tokens, then max-pools these scores to the block stage.

A High-k operator then selects the highest-scoring blocks per question and group. The native block containing the question is at all times included. This prevents the selector from dropping the question’s instant neighborhood.

The Principal Department gathers causally seen tokens from the chosen blocks. It applies scaled dot-product softmax consideration restricted to these tokens. Every question head retains its personal question projection however shares the group’s block set.

A visualization within the report exhibits what the realized indexer selects. Heads think about the native diagonal and the primary block. They reserve the remainder of the finances for a couple of long-range stripes.

https://arxiv.org/pdf/2606.13392v1
https://arxiv.org/pdf/2606.13392v1

How MSA is Educated

High-k choice is non-differentiable, so the language-modeling loss can’t practice the index projections. MSA solves this with a KL alignment loss. The loss matches the Index Department distribution to the Principal Department consideration sample. The trainer is the group-averaged Principal Department distribution over the chosen tokens.

Three mechanisms stabilize sparse coaching. Gradient Detach applies stop-gradient to the Index Department enter. This confines the KL loss to the index projections, not the spine. With out it, bigger KL coefficients brought about gradient spikes and loss divergence.

Indexer Warmup runs full consideration in each branches for the primary iterations. The indexer learns from the KL loss earlier than it controls routing. The compelled Native Block reserves one slot for close by context.

Ablations formed the ultimate recipe. An early variant added an Index Department worth head with its personal output. As soon as warmup is used, that worth head is now not crucial. The ultimate design drops it on effectivity grounds.

MSA helps two coaching routes. MSA-PT trains from scratch after a 40B-token indexer warmup. MSA-CPT converts a dense GQA checkpoint educated on 2.6T tokens. It then continues for 400B tokens, together with 40B tokens of warmup.

The Kernel Co-Design

Theoretical sparsity doesn’t turn out to be pace with out a matching GPU path. MSA pairs the algorithm with two kernel concepts.

The primary is exp-free High-k choice. Softmax preserves order, so rating uncooked scores yields similar indices. The kernel skips the max, exp, and sum steps earlier than choice. At 128K context with okay = 16, it ran 5.1× quicker than torch.topk. It additionally beat the TileLang radix-select kernel by 3.7×.

The second is KV-outer sparse consideration with question collect. Iterating over KV blocks raises arithmetic depth versus iterating over queries. The kernel packs ⌈128/G⌉ question positions into one 128×128 rating MMA. A two-phase ahead splits the eye and mix steps throughout CTAs.

The open-source kernel, fmha_sm100, targets NVIDIA SM100 GPUs. It ships dense FlashAttention plus sparse High-k kernels underneath an MIT license. It helps BF16, FP8, NVFP4, and FP4 precision.

How MSA Compares To Different Sparse Strategies

The analysis group positions MSA in opposition to 4 natively educated sparse designs.

The desk under summarizes the variations it describes.

Methodology Spine Choice granularity Indexer / choice sign
MSA GQA Block-level (B_k = 128), per-GQA-group High-k KL alignment loss
NSA MQA / MHA Compressed + chosen blocks + sliding window Native (end-to-end) coaching
InfLLM-V2 Dense↔sparse switchable Parameter-free block choice + sliding window Parameter-free (no educated indexer)
MoBA GQA Very massive KV blocks (block-averaged keys) LM gradient solely
DSA MLA (MQA mode) Token-level; single High-k shared throughout heads ReLU lightning indexer

MSA’s distinguishing pair is per-GQA-group High-k sharing mixed with block-level choice. This retains KV reads contiguous whereas giving every group its personal retrieval.

The standard aspect holds up. Each sparse fashions keep broadly aggressive with the Full-Consideration baseline.

The desk under exhibits consultant outcomes underneath the 3T-token finances.

Benchmark Full MSA-PT MSA-CPT
MMLU 67.0 67.2 66.8
GSM8K 76.2 77.7 73.7
HumanEval 61.0 64.0 57.9
RULER-8K 79.8 84.2 77.2
RULER-32K 75.0 77.5 75.7
VideoMME 41.11 45.48 39.65

After long-context extension, MSA-CPT stayed near Full on HELMET-128K and RULER-128K. Every question nonetheless attends to solely 2,048 key-value tokens.

Explainer Playground


Related Articles

Latest Articles