LLM Pre-Training in 2026: The Frontier in Numbers

May 24, 2026 · 416 words · 3 minutes reading time

LLM Pre-Training in 2026: The Frontier in Numbers

TL;DR

Frontier 2026 LLMs are pre-trained on 14.8T–36T tokens with 17B–671B parameters (37B–288B active for MoE), on clusters of 16K to 200K+ accelerators.
A single final training run now costs $5.6M (DeepSeek-V3 rental cost) to ~$490M (Grok 4 median, Epoch AI), while OpenAI's 2025 R&D compute is projected at ~$9B.
Architecture has converged: fine-grained MoE + MLA or GQA + RoPE + RMSNorm + SwiGLU, with FP8 mixed precision now production-validated by DeepSeek-V3.

Key facts (citable)

Lab / Model	Total params	Active	Pre-train tokens	Hardware	Headline cost
DeepSeek-V3	671B	37B	14.8T	2,048 H800	2.788M H800-hours = $5.576M rental
Llama 3.1 405B	405B (dense)	405B	15.6T	16,384 H100	~$170M est. (3.8e25 FLOP)
Qwen 3 235B-A22B	235B	22B	36T	undisclosed	—
Llama 4 Maverick	400B	17B (128 experts)	>30T multimodal	undisclosed	—
Llama 4 Behemoth	~2T	288B (16 experts)	undisclosed	undisclosed	—
GPT-5 (Epoch AI estimate)	undisclosed	undisclosed	"≥30T likely"	undisclosed	~5e25 FLOP total

Glossary — "active parameters." In MoE, "active" is the per-token compute footprint: e.g., DeepSeek-V3 routes each token through 37B of its 671B parameters.

What changed from 2024 to 2026?

1) Tokens went up; FLOPs per token went down

Qwen 3's 36T-token corpus is more than 2× the Chinchilla-optimal budget for a 235B-active model, reflecting the new economics of inference-aware overtraining (covered in article 21).

2) MoE became the default at the frontier

DeepSeek-V3 (671B/37B), Qwen 3 235B-A22B, Llama 4 Maverick (400B/17B), Llama 4 Behemoth (2T/288B). The only dense frontier holdout is Llama 3.1 405B.

3) FP8 is no longer experimental

DeepSeek-V3 trained the full model in FP8 with E4M3 on every tensor (Section 3.3 of the V3 paper), using 1×128 activation tiles and 128×128 weight blocks to control outliers.

4) Cluster scale broke the 200K-GPU barrier

xAI Colossus in Memphis went from 100K H100 to 200K H100/H200 in 92 days, then added 30K GB200. AWS Project Rainier (~500K Trainium2, scaling to 1M) hosts Anthropic Claude training. OpenAI Stargate is a $500B, 10-GW commitment, with ~7 GW under contract as of late 2025.

5) Power, not silicon, is now the bottleneck

The big four hyperscalers' combined 2026 AI capex is ~$725B (Microsoft $190B, Amazon ~$200B, Alphabet up to $190B, Meta up to $145B per company Q1 2026 guidance).

How to read this series

Each article is self-contained but cross-linked. We recommend:

LLM Pre-Training in 2026: The Frontier in Numbers

LLM Pre-Training in 2026: The Frontier in Numbers

TL;DR

Key facts (citable)

What changed from 2024 to 2026?

1) Tokens went up; FLOPs per token went down

2) MoE became the default at the frontier

3) FP8 is no longer experimental

4) Cluster scale broke the 200K-GPU barrier

5) Power, not silicon, is now the bottleneck

How to read this series

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex