Artificial Intelligence

MiniMax Sparse Consideration (MSA): a Two-Department Block-Sparse Consideration Educated on a 109B-Parameter MoE With a 3T-Token Finances

June 17, 2026

[ad_1]

MiniMax launched MSA (MiniMax Sparse Consideration), a sparse consideration technique constructed straight on Grouped Question Consideration (GQA). It targets one bottleneck: the quadratic price of softmax consideration at lengthy context. The MiniMax analysis group examined it inside a 109B-parameter Combination-of-Consultants mannequin educated with native multimodal information. In addition they open-sourced an inference kernel and shipped a manufacturing mannequin, MiniMax-M3.

What’s MSA (MiniMax Sparse Consideration)

MSA (MiniMax Sparse Consideration) components consideration into two phases: an Index Department and a Principal Department. The Index Department decides which key-value blocks every question ought to learn. The Principal Department then runs precise softmax consideration over solely these blocks.

Choice occurs at block granularity, not per token. The default block measurement is B_okay = 128 tokens. Every question and GQA group retains okay = 16 blocks. That fixes the per-query finances at kB_okay = 2,048 key-value tokens.

The 2 price constructions differ. Dense GQA consideration scales per question as O(N), the total context. MSA scales as O(kB_okay), which stays mounted as N grows. The compute hole due to this fact widens as context size will increase.

Choice is shared inside every GQA group however unbiased throughout teams. One key-value head serves a number of question heads, they usually share one block set. Completely different teams can attend to totally different long-range areas.

How the Two Branches Work

The Index Department provides solely two projection matrices to a regular GQA layer. It defines one index question head per GQA group and one shared index key head. It scores seen key tokens, then max-pools these scores to the block stage.

A High-k operator then selects the highest-scoring blocks per question and group. The native block containing the question is at all times included. This prevents the selector from dropping the question’s instant neighborhood.

The Principal Department gathers causally seen tokens from the chosen blocks. It applies scaled dot-product softmax consideration restricted to these tokens. Every question head retains its personal question projection however shares the group’s block set.

A visualization within the report exhibits what the realized indexer selects. Heads think about the native diagonal and the primary block. They reserve the remainder of the finances for a couple of long-range stripes.

How MSA is Educated

High-k choice is non-differentiable, so the language-modeling loss can’t practice the index projections. MSA solves this with a KL alignment loss. The loss matches the Index Department distribution to the Principal Department consideration sample. The trainer is the group-averaged Principal Department distribution over the chosen tokens.

Three mechanisms stabilize sparse coaching. Gradient Detach applies stop-gradient to the Index Department enter. This confines the KL loss to the index projections, not the spine. With out it, bigger KL coefficients brought about gradient spikes and loss divergence.

Indexer Warmup runs full consideration in each branches for the primary iterations. The indexer learns from the KL loss earlier than it controls routing. The compelled Native Block reserves one slot for close by context.

Ablations formed the ultimate recipe. An early variant added an Index Department worth head with its personal output. As soon as warmup is used, that worth head is now not crucial. The ultimate design drops it on effectivity grounds.

MSA helps two coaching routes. MSA-PT trains from scratch after a 40B-token indexer warmup. MSA-CPT converts a dense GQA checkpoint educated on 2.6T tokens. It then continues for 400B tokens, together with 40B tokens of warmup.

The Kernel Co-Design

Theoretical sparsity doesn’t turn out to be pace with out a matching GPU path. MSA pairs the algorithm with two kernel concepts.

The primary is exp-free High-k choice. Softmax preserves order, so rating uncooked scores yields similar indices. The kernel skips the max, exp, and sum steps earlier than choice. At 128K context with okay = 16, it ran 5.1× quicker than torch.topk. It additionally beat the TileLang radix-select kernel by 3.7×.

The second is KV-outer sparse consideration with question collect. Iterating over KV blocks raises arithmetic depth versus iterating over queries. The kernel packs ⌈128/G⌉ question positions into one 128×128 rating MMA. A two-phase ahead splits the eye and mix steps throughout CTAs.

The open-source kernel, fmha_sm100, targets NVIDIA SM100 GPUs. It ships dense FlashAttention plus sparse High-k kernels underneath an MIT license. It helps BF16, FP8, NVFP4, and FP4 precision.

How MSA Compares To Different Sparse Strategies

The analysis group positions MSA in opposition to 4 natively educated sparse designs.

The desk under summarizes the variations it describes.

Methodology	Spine	Choice granularity	Indexer / choice sign
MSA	GQA	Block-level (`B_k = 128`), per-GQA-group High-k	KL alignment loss
NSA	MQA / MHA	Compressed + chosen blocks + sliding window	Native (end-to-end) coaching
InfLLM-V2	Dense↔sparse switchable	Parameter-free block choice + sliding window	Parameter-free (no educated indexer)
MoBA	GQA	Very massive KV blocks (block-averaged keys)	LM gradient solely
DSA	MLA (MQA mode)	Token-level; single High-k shared throughout heads	ReLU lightning indexer

MSA’s distinguishing pair is per-GQA-group High-k sharing mixed with block-level choice. This retains KV reads contiguous whereas giving every group its personal retrieval.

The standard aspect holds up. Each sparse fashions keep broadly aggressive with the Full-Consideration baseline.

The desk under exhibits consultant outcomes underneath the 3T-token finances.

Benchmark	Full	MSA-PT	MSA-CPT
MMLU	67.0	67.2	66.8
GSM8K	76.2	77.7	73.7
HumanEval	61.0	64.0	57.9
RULER-8K	79.8	84.2	77.2
RULER-32K	75.0	77.5	75.7
VideoMME	41.11	45.48	39.65

After long-context extension, MSA-CPT stayed near Full on HELMET-128K and RULER-128K. Every question nonetheless attends to solely 2,048 key-value tokens.

Explainer Playground

<tbody>“; for(var q=0;q0;c++) html+=’ <tr>‘; for(var col=0;colq){ shade=”#0d0d0d”; } // future (causal masks) else if(chosen[col]===’sink’){ shade=”#76B900″; } else if(chosen[col]===’native’){ shade=”#4a8f00″; } else if(chosen[col]===’lengthy’){ shade=”#2e6fd6″; } else { shade=”#1d1d1d”; } var bd = (col>q)?’#0d0d0d’:’#0e0e0e’; html+=’ <td style=""background:'+color+';border-color:'+bd+'""/>‘; } html+=’</tr> ‘; } html+=’</tbody> ‘; root.querySelector(‘#masks’).innerHTML=html; } perform fmtFlops(x){ if(x>=1e9) return (x/1e9).toFixed(1)+’ GFLOP’; if(x>=1e6) return (x/1e6).toFixed(1)+’ MFLOP’; if(x>=1e3) return (x/1e3).toFixed(1)+’ kFLOP’; return Math.spherical(x)+’ FLOP’; } perform commas(n){ return n.toString().substitute(/B(?=(d{3})+(?!d))/g, ‘,’); } perform updateCompute(idx){ var N=Ns[idx]; // per-token consideration FLOPs (divide eq.12 totals by N) var gqa = 2*Hq*dh*N; // dense GQA per token var msa = didx*Hkv*N + 4*Hq*dh*k_budget*Bk; // index + essential per token var ratio = gqa/msa; root.querySelector(‘#nval’).textContent=Nlabels[idx]; root.querySelector(‘#redux’).innerHTML=ratio.toFixed(1)+’×’; root.querySelector(‘#gqaf’).textContent=fmtFlops(gqa); root.querySelector(‘#msaf’).textContent=fmtFlops(msa); var msaPct=Math.max(4, (msa/gqa)*100); root.querySelector(‘#gqabar’).model.width=”100%”; root.querySelector(‘#msabar’).model.width=msaPct.toFixed(2)+’%’; root.querySelector(‘#finances’).textContent=commas(k_budget*Bk); root.querySelector(‘#densetok’).textContent=commas(N); postHeight(); } // —- wiring —- bslide.addEventListener(‘enter’,perform(){ B=+this.worth; bval.textContent=B; if(okay>B) { okay=B; kslide.worth=B; kval.textContent=B; } kslide.max=Math.min(8,B); buildMask(); postHeight(); }); kslide.addEventListener(‘enter’,perform(){ okay=+this.worth; kval.textContent=okay; buildMask(); postHeight(); }); tabs.forEach(perform(t){ t.addEventListener(‘click on’,perform(){ tabs.forEach(perform(x){x.classList.take away(‘on’)}); t.classList.add(‘on’); group=+t.getAttribute(‘data-g’); buildMask(); }); }); root.querySelector(‘#nslide’).addEventListener(‘enter’,perform(){ updateCompute(+this.worth); }); // —- auto-resize for WordPress iframe —- perform postHeight(){ strive{ var h=doc.physique.offsetHeight+40; if(window.mother or father && window.mother or father!==window){ window.mother or father.postMessage({msaDemoHeight:h},’*’); } }catch(e){} } buildMask(); updateCompute(5); window.addEventListener(‘load’,postHeight); window.addEventListener(‘resize’,postHeight); })();

[ad_2]

What’s MSA (MiniMax Sparse Consideration)

How the Two Branches Work

How MSA is Educated

The Kernel Co-Design

How MSA Compares To Different Sparse Strategies

The desk under summarizes the variations it describes.

The desk under exhibits consultant outcomes underneath the 3T-token finances.

Explainer Playground

RELATED ARTICLESMORE FROM AUTHOR

Context Graph vs RAG vs Uncooked Context

Sensible SQL Methods Each Knowledge Scientist Ought to Know

The Obtain: AI bottleneck debates, and BCI trials take off

The Milky Approach Was Rewired by a Cataclysmic Collision Billions of...

RELATED ARTICLES MORE FROM AUTHOR