Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3
Multi-head Latent Attention (MLA)
TL;DR
- MLA, introduced in DeepSeek-V2 (and refined in V3), compresses each token's KV state into a low-rank latent vector of dimension d_c much smaller than 2·d_h·H, dramatically shrinking the KV cache per token.
- Unlike GQA (which shares heads), MLA stores a compressed latent and re-projects to full K/V at attention time.
- A decoupled RoPE branch carries positional information on a small dedicated subspace, so the latent cache remains position-agnostic and reusable.
- DeepSeek-V2 ablations show MLA matches or slightly beats MHA at a fraction of the KV memory.
Key facts
| Variant | Cache per token (elements) | Quality vs MHA |
|---|---|---|
| MHA | 2 · d_h · H · L | baseline |
| GQA (g groups) | 2 · d_h · g · L | usually slightly worse |
| MLA | (d_c + d_r) · L | matches or beats MHA |
How does MLA work mathematically?
For input (x_n \in \mathbb{R}^d), per layer:
- Down-project to latent: (c^{KV}_n = W^{DKV} x_n \in \mathbb{R}^{d_c}) ← cached
- Up-project for attention: (K = W^{UK} c^{KV}), (V = W^{UV} c^{KV})
- Decoupled RoPE: a small per-head subspace (k^{rope}_n = \mathrm{RoPE}(W^{KR} x_n)) is computed separately and concatenated to K for the attention dot product (also cached, dimension d_r).
Queries are similarly down/up-projected; query compression is used only during training.
Why decouple RoPE?
If you applied RoPE to the latent (c^{KV}) directly, the up-projection (W^{UK}) would no longer commute with rotation — you'd have to materialize K at every step at full size, killing the cache savings. The decoupled RoPE subspace solves this elegantly: position information lives on a small, separately cached vector.