Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

May 23, 2026 · 283 words · 2 minutes reading time

Multi-head Latent Attention (MLA)

TL;DR

MLA, introduced in DeepSeek-V2 (and refined in V3), compresses each token's KV state into a low-rank latent vector of dimension d_c much smaller than 2·d_h·H, dramatically shrinking the KV cache per token.
Unlike GQA (which shares heads), MLA stores a compressed latent and re-projects to full K/V at attention time.
A decoupled RoPE branch carries positional information on a small dedicated subspace, so the latent cache remains position-agnostic and reusable.
DeepSeek-V2 ablations show MLA matches or slightly beats MHA at a fraction of the KV memory.

Key facts

Variant	Cache per token (elements)	Quality vs MHA
MHA	2 · d_h · H · L	baseline
GQA (g groups)	2 · d_h · g · L	usually slightly worse
MLA	(d_c + d_r) · L	matches or beats MHA

How does MLA work mathematically?

For input (x_n \in \mathbb{R}^d), per layer:

Down-project to latent: (c^{KV}_n = W^{DKV} x_n \in \mathbb{R}^{d_c}) ← cached
Up-project for attention: (K = W^{UK} c^{KV}), (V = W^{UV} c^{KV})
Decoupled RoPE: a small per-head subspace (k^{rope}_n = \mathrm{RoPE}(W^{KR} x_n)) is computed separately and concatenated to K for the attention dot product (also cached, dimension d_r).

Queries are similarly down/up-projected; query compression is used only during training.

Why decouple RoPE?

If you applied RoPE to the latent (c^{KV}) directly, the up-projection (W^{UK}) would no longer commute with rotation — you'd have to materialize K at every step at full size, killing the cache savings. The decoupled RoPE subspace solves this elegantly: position information lives on a small, separately cached vector.

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Multi-head Latent Attention (MLA)

TL;DR

Key facts

How does MLA work mathematically?

Why decouple RoPE?

Code (MLA core, illustrative)

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex