2 posts with tag architecture

llm
attention
architecture
deepseek-v3
kv-cache
mla

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.

May 23, 2026 · 2 minutes reading time

llm
mixture-of-experts
architecture
deepseek-v3
qwen
llama

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Fine-grained experts, shared experts, auxiliary-loss-free routing — the modern MoE recipe in 2026, with side-by-side comparison of DeepSeek-V3, Qwen 3, Llama 4.

May 22, 2026 · 1 minute reading time