Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

· 181 words · 1 minute reading time

Mixture-of-Experts Design in 2026

TL;DR

  • Fine-grained MoE + shared expert + auxiliary-loss-free routing is the de-facto 2026 recipe (DeepSeek-V3, Llama 4, Qwen 3 MoE variants).
  • Sparsity ratios have widened: DeepSeek-V3 activates 5.5% (37B/671B); Llama 4 Maverick activates 4.25% (17B/400B); Behemoth activates 14% (288B/2T).
  • Auxiliary-loss-free balancing (per-expert bias, updated each step) replaces classic load-balancing loss — better quality, same balance.

Key facts

ModelTotalActiveExpertsTop-kSharedRouting
DeepSeek-V3671B37B256 routed + 1 shared (per MoE layer)8yesaux-loss-free
Qwen 3 235B-A22B235B22B1288yessoftmax + small aux
Qwen 3 30B-A3B30B3B1288yessoftmax + small aux
Llama 4 Scout109B17B161yessoftmax
Llama 4 Maverick400B17B1281yessoftmax
Llama 4 Behemoth~2T288B161yessoftmax

How does auxiliary-loss-free routing work?

Classic MoE adds an auxiliary loss penalizing expert imbalance. DeepSeek-V3 instead maintains a per-expert bias (b_e) that nudges the routing logits each step:

Series

LLM Pre-Training 2026

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Fine-grained experts, shared experts, auxiliary-loss-free routing — the modern MoE recipe in 2026, with side-by-side comparison of DeepSeek-V3, Qwen 3, Llama 4.