LLM Pre-Training in 2026: The Frontier in Numbers

LLM Pre-Training in 2026: The Frontier in Numbers

· 416 words · 3 minutes reading time

LLM Pre-Training in 2026: The Frontier in Numbers

TL;DR

  • Frontier 2026 LLMs are pre-trained on 14.8T–36T tokens with 17B–671B parameters (37B–288B active for MoE), on clusters of 16K to 200K+ accelerators.
  • A single final training run now costs $5.6M (DeepSeek-V3 rental cost) to ~$490M (Grok 4 median, Epoch AI), while OpenAI's 2025 R&D compute is projected at ~$9B.
  • Architecture has converged: fine-grained MoE + MLA or GQA + RoPE + RMSNorm + SwiGLU, with FP8 mixed precision now production-validated by DeepSeek-V3.

Key facts (citable)

Lab / ModelTotal paramsActivePre-train tokensHardwareHeadline cost
DeepSeek-V3671B37B14.8T2,048 H8002.788M H800-hours = $5.576M rental
Llama 3.1 405B405B (dense)405B15.6T16,384 H100~$170M est. (3.8e25 FLOP)
Qwen 3 235B-A22B235B22B36Tundisclosed
Llama 4 Maverick400B17B (128 experts)>30T multimodalundisclosed
Llama 4 Behemoth~2T288B (16 experts)undisclosedundisclosed
GPT-5 (Epoch AI estimate)undisclosedundisclosed"≥30T likely"undisclosed~5e25 FLOP total

Glossary — "active parameters." In MoE, "active" is the per-token compute footprint: e.g., DeepSeek-V3 routes each token through 37B of its 671B parameters.

What changed from 2024 to 2026?

1) Tokens went up; FLOPs per token went down

Qwen 3's 36T-token corpus is more than 2× the Chinchilla-optimal budget for a 235B-active model, reflecting the new economics of inference-aware overtraining (covered in article 21).

2) MoE became the default at the frontier

DeepSeek-V3 (671B/37B), Qwen 3 235B-A22B, Llama 4 Maverick (400B/17B), Llama 4 Behemoth (2T/288B). The only dense frontier holdout is Llama 3.1 405B.

3) FP8 is no longer experimental

DeepSeek-V3 trained the full model in FP8 with E4M3 on every tensor (Section 3.3 of the V3 paper), using 1×128 activation tiles and 128×128 weight blocks to control outliers.

4) Cluster scale broke the 200K-GPU barrier

xAI Colossus in Memphis went from 100K H100 to 200K H100/H200 in 92 days, then added 30K GB200. AWS Project Rainier (~500K Trainium2, scaling to 1M) hosts Anthropic Claude training. OpenAI Stargate is a $500B, 10-GW commitment, with ~7 GW under contract as of late 2025.

5) Power, not silicon, is now the bottleneck

The big four hyperscalers' combined 2026 AI capex is ~$725B (Microsoft $190B, Amazon ~$200B, Alphabet up to $190B, Meta up to $145B per company Q1 2026 guidance).

How to read this series

Each article is self-contained but cross-linked. We recommend:

Series

LLM Pre-Training 2026

LLM Pre-Training in 2026: The Frontier in Numbers

The state of frontier LLM pre-training in 2026 — token counts, parameter counts, cluster sizes, costs, and what it all means for CTOs and ML leads.