Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

· 283 words · 2 minutes reading time

Multi-head Latent Attention (MLA)

TL;DR

  • MLA, introduced in DeepSeek-V2 (and refined in V3), compresses each token's KV state into a low-rank latent vector of dimension d_c much smaller than 2·d_h·H, dramatically shrinking the KV cache per token.
  • Unlike GQA (which shares heads), MLA stores a compressed latent and re-projects to full K/V at attention time.
  • A decoupled RoPE branch carries positional information on a small dedicated subspace, so the latent cache remains position-agnostic and reusable.
  • DeepSeek-V2 ablations show MLA matches or slightly beats MHA at a fraction of the KV memory.

Key facts

VariantCache per token (elements)Quality vs MHA
MHA2 · d_h · H · Lbaseline
GQA (g groups)2 · d_h · g · Lusually slightly worse
MLA(d_c + d_r) · Lmatches or beats MHA

How does MLA work mathematically?

For input (x_n \in \mathbb{R}^d), per layer:

  1. Down-project to latent: (c^{KV}_n = W^{DKV} x_n \in \mathbb{R}^{d_c}) ← cached
  2. Up-project for attention: (K = W^{UK} c^{KV}), (V = W^{UV} c^{KV})
  3. Decoupled RoPE: a small per-head subspace (k^{rope}_n = \mathrm{RoPE}(W^{KR} x_n)) is computed separately and concatenated to K for the attention dot product (also cached, dimension d_r).

Queries are similarly down/up-projected; query compression is used only during training.

Why decouple RoPE?

If you applied RoPE to the latent (c^{KV}) directly, the up-projection (W^{UK}) would no longer commute with rotation — you'd have to materialize K at every step at full size, killing the cache savings. The decoupled RoPE subspace solves this elegantly: position information lives on a small, separately cached vector.

Code (MLA core, illustrative)

Series

LLM Pre-Training 2026

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.