Optimizers in 2026: AdamW, Muon, Shampoo/SOAP
Optimizers in 2026: AdamW, Muon, Shampoo/SOAP
TL;DR
- AdamW remains the production default at frontier scale (Llama 3.1, Qwen 3, GPT-5-era).
- Muon (Jordan et al., 2024; Liu et al. arXiv:2502.16982) uses Newton–Schulz orthogonalization of momentum and shows ~2× compute efficiency vs AdamW under compute-optimal training in Moonshot's Moonlight scaling study (3B/16B MoE, 5.7T tokens).
- However, Wen et al. (arXiv:2512.05620) find the gap shrinks to ~1.4× at 400M–1.5B and ~1.1× at 1.2B once AdamW is well-tuned, raising the question of whether headline 2× gains transfer to multi-billion-scale frontier runs.
Why Muon works
Orthogonalizing the per-matrix momentum update prevents the optimizer from collapsing into a few dominant directions. Newton–Schulz makes the operation ~5–10× cheaper than full SVD, with negligible per-step overhead at GPU scale.
Recommendation
- Frontier dense / MoE runs: AdamW remains lowest-risk; switch to Muon only with a μP-style HP sweep at small scale.
- Pre-1B research: Muon's 1.4–2× gain is real and easy to pick up.
- SOAP: a Shampoo variant; competitive with Muon at 100M–1B but heavier compute.
References
- Jordan et al. 2024 (Muon original); Liu et al. arXiv:2502.16982; Wen et al. arXiv:2512.05620; Du & Su arXiv:2604.01472 (Newton-Muon).