Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared
Mixture-of-Experts Design in 2026
TL;DR
- Fine-grained MoE + shared expert + auxiliary-loss-free routing is the de-facto 2026 recipe (DeepSeek-V3, Llama 4, Qwen 3 MoE variants).
- Sparsity ratios have widened: DeepSeek-V3 activates 5.5% (37B/671B); Llama 4 Maverick activates 4.25% (17B/400B); Behemoth activates 14% (288B/2T).
- Auxiliary-loss-free balancing (per-expert bias, updated each step) replaces classic load-balancing loss — better quality, same balance.
Key facts
| Model | Total | Active | Experts | Top-k | Shared | Routing |
|---|---|---|---|---|---|---|
| DeepSeek-V3 | 671B | 37B | 256 routed + 1 shared (per MoE layer) | 8 | yes | aux-loss-free |
| Qwen 3 235B-A22B | 235B | 22B | 128 | 8 | yes | softmax + small aux |
| Qwen 3 30B-A3B | 30B | 3B | 128 | 8 | yes | softmax + small aux |
| Llama 4 Scout | 109B | 17B | 16 | 1 | yes | softmax |
| Llama 4 Maverick | 400B | 17B | 128 | 1 | yes | softmax |
| Llama 4 Behemoth | ~2T | 288B | 16 | 1 | yes | softmax |
How does auxiliary-loss-free routing work?
Classic MoE adds an auxiliary loss penalizing expert imbalance. DeepSeek-V3 instead maintains a per-expert bias (b_e) that nudges the routing logits each step: