Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

· 210 words · 2 minutes reading time

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

TL;DR

  • AdamW remains the production default at frontier scale (Llama 3.1, Qwen 3, GPT-5-era).
  • Muon (Jordan et al., 2024; Liu et al. arXiv:2502.16982) uses Newton–Schulz orthogonalization of momentum and shows ~2× compute efficiency vs AdamW under compute-optimal training in Moonshot's Moonlight scaling study (3B/16B MoE, 5.7T tokens).
  • However, Wen et al. (arXiv:2512.05620) find the gap shrinks to ~1.4× at 400M–1.5B and ~1.1× at 1.2B once AdamW is well-tuned, raising the question of whether headline 2× gains transfer to multi-billion-scale frontier runs.

Why Muon works

Orthogonalizing the per-matrix momentum update prevents the optimizer from collapsing into a few dominant directions. Newton–Schulz makes the operation ~5–10× cheaper than full SVD, with negligible per-step overhead at GPU scale.

Recommendation

  • Frontier dense / MoE runs: AdamW remains lowest-risk; switch to Muon only with a μP-style HP sweep at small scale.
  • Pre-1B research: Muon's 1.4–2× gain is real and easy to pick up.
  • SOAP: a Shampoo variant; competitive with Muon at 100M–1B but heavier compute.

References

  • Jordan et al. 2024 (Muon original); Liu et al. arXiv:2502.16982; Wen et al. arXiv:2512.05620; Du & Su arXiv:2604.01472 (Newton-Muon).

Series

LLM Pre-Training 2026

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP