Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

May 22, 2026 · 181 words · 1 minute reading time

Mixture-of-Experts Design in 2026

TL;DR

Fine-grained MoE + shared expert + auxiliary-loss-free routing is the de-facto 2026 recipe (DeepSeek-V3, Llama 4, Qwen 3 MoE variants).
Sparsity ratios have widened: DeepSeek-V3 activates 5.5% (37B/671B); Llama 4 Maverick activates 4.25% (17B/400B); Behemoth activates 14% (288B/2T).
Auxiliary-loss-free balancing (per-expert bias, updated each step) replaces classic load-balancing loss — better quality, same balance.

Key facts

Model	Total	Active	Experts	Top-k	Shared	Routing
DeepSeek-V3	671B	37B	256 routed + 1 shared (per MoE layer)	8	yes	aux-loss-free
Qwen 3 235B-A22B	235B	22B	128	8	yes	softmax + small aux
Qwen 3 30B-A3B	30B	3B	128	8	yes	softmax + small aux
Llama 4 Scout	109B	17B	16	1	yes	softmax
Llama 4 Maverick	400B	17B	128	1	yes	softmax
Llama 4 Behemoth	~2T	288B	16	1	yes	softmax

How does auxiliary-loss-free routing work?

Classic MoE adds an auxiliary loss penalizing expert imbalance. DeepSeek-V3 instead maintains a per-expert bias (b_e) that nudges the routing logits each step:

Series

LLM Pre-Training 2026

LLM Pre-Training in 2026: The Frontier in Numbers

The state of frontier LLM pre-training in 2026 — token counts, parameter counts, cluster sizes, costs, and what it all means for CTOs and ML leads.

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Where 2026 frontier LLMs get their pre-training data — Common Crawl, FineWeb, DCLM, StackV2, multilingual, PDFs — and how Qwen 3, DeepSeek-V3, Llama 3.1 sized their corpora.

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Fine-grained experts, shared experts, auxiliary-loss-free routing — the modern MoE recipe in 2026, with side-by-side comparison of DeepSeek-V3, Qwen 3, Llama 4.

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Side-by-side specifications of every chip and pod system used at the 2026 LLM frontier, with primary-source numbers from NVIDIA, Google Cloud, and AWS.

Cluster Scale 2026: Colossus, Stargate, Project Rainier

From 16K to 1M-GPU systems — how xAI built Colossus in 122 days, how AWS deployed 500K Trainium2 chips for Anthropic, and what OpenAI's $500B Stargate commitment actually covers.

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex