LLM Pre-Training in 2026: The Frontier in Numbers
LLM Pre-Training in 2026: The Frontier in Numbers
TL;DR
- Frontier 2026 LLMs are pre-trained on 14.8T–36T tokens with 17B–671B parameters (37B–288B active for MoE), on clusters of 16K to 200K+ accelerators.
- A single final training run now costs $5.6M (DeepSeek-V3 rental cost) to ~$490M (Grok 4 median, Epoch AI), while OpenAI's 2025 R&D compute is projected at ~$9B.
- Architecture has converged: fine-grained MoE + MLA or GQA + RoPE + RMSNorm + SwiGLU, with FP8 mixed precision now production-validated by DeepSeek-V3.
Key facts (citable)
| Lab / Model | Total params | Active | Pre-train tokens | Hardware | Headline cost |
|---|---|---|---|---|---|
| DeepSeek-V3 | 671B | 37B | 14.8T | 2,048 H800 | 2.788M H800-hours = $5.576M rental |
| Llama 3.1 405B | 405B (dense) | 405B | 15.6T | 16,384 H100 | ~$170M est. (3.8e25 FLOP) |
| Qwen 3 235B-A22B | 235B | 22B | 36T | undisclosed | — |
| Llama 4 Maverick | 400B | 17B (128 experts) | >30T multimodal | undisclosed | — |
| Llama 4 Behemoth | ~2T | 288B (16 experts) | undisclosed | undisclosed | — |
| GPT-5 (Epoch AI estimate) | undisclosed | undisclosed | "≥30T likely" | undisclosed | ~5e25 FLOP total |
Glossary — "active parameters." In MoE, "active" is the per-token compute footprint: e.g., DeepSeek-V3 routes each token through 37B of its 671B parameters.
What changed from 2024 to 2026?
1) Tokens went up; FLOPs per token went down
Qwen 3's 36T-token corpus is more than 2× the Chinchilla-optimal budget for a 235B-active model, reflecting the new economics of inference-aware overtraining (covered in article 21).
2) MoE became the default at the frontier
DeepSeek-V3 (671B/37B), Qwen 3 235B-A22B, Llama 4 Maverick (400B/17B), Llama 4 Behemoth (2T/288B). The only dense frontier holdout is Llama 3.1 405B.
3) FP8 is no longer experimental
DeepSeek-V3 trained the full model in FP8 with E4M3 on every tensor (Section 3.3 of the V3 paper), using 1×128 activation tiles and 128×128 weight blocks to control outliers.
4) Cluster scale broke the 200K-GPU barrier
xAI Colossus in Memphis went from 100K H100 to 200K H100/H200 in 92 days, then added 30K GB200. AWS Project Rainier (~500K Trainium2, scaling to 1M) hosts Anthropic Claude training. OpenAI Stargate is a $500B, 10-GW commitment, with ~7 GW under contract as of late 2025.
5) Power, not silicon, is now the bottleneck
The big four hyperscalers' combined 2026 AI capex is ~$725B (Microsoft $190B, Amazon ~$200B, Alphabet up to $190B, Meta up to $145B per company Q1 2026 guidance).
How to read this series
Each article is self-contained but cross-linked. We recommend: