Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

May 23, 2026 · 210 words · 2 minutes reading time

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

TL;DR

AdamW remains the production default at frontier scale (Llama 3.1, Qwen 3, GPT-5-era).
Muon (Jordan et al., 2024; Liu et al. arXiv:2502.16982) uses Newton–Schulz orthogonalization of momentum and shows ~2× compute efficiency vs AdamW under compute-optimal training in Moonshot's Moonlight scaling study (3B/16B MoE, 5.7T tokens).
However, Wen et al. (arXiv:2512.05620) find the gap shrinks to ~1.4× at 400M–1.5B and ~1.1× at 1.2B once AdamW is well-tuned, raising the question of whether headline 2× gains transfer to multi-billion-scale frontier runs.

Why Muon works

Orthogonalizing the per-matrix momentum update prevents the optimizer from collapsing into a few dominant directions. Newton–Schulz makes the operation ~5–10× cheaper than full SVD, with negligible per-step overhead at GPU scale.

Recommendation

Frontier dense / MoE runs: AdamW remains lowest-risk; switch to Muon only with a μP-style HP sweep at small scale.
Pre-1B research: Muon's 1.4–2× gain is real and easy to pick up.
SOAP: a Shampoo variant; competitive with Muon at 100M–1B but heavier compute.

References

Jordan et al. 2024 (Muon original); Liu et al. arXiv:2502.16982; Wen et al. arXiv:2512.05620; Du & Su arXiv:2604.01472 (Newton-Muon).

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

TL;DR

Why Muon works

Recommendation

References

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex