1 post with tag attention

llm
attention
architecture
deepseek-v3
kv-cache
mla

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.

May 23, 2026 · 2 minutes reading time