Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

May 20, 2026 · 266 words · 2 minutes reading time

Pre-Training Data Sources & Token Budgets

TL;DR

2026 frontier corpora span 14.8T (DeepSeek-V3) → 36T (Qwen 3) tokens.
The web is the substrate — Common Crawl + FineWeb (15T) + DCLM form the open-data backbone; private corpora add books, code (StackV2 ~3T), PDFs (Qwen 3 used Qwen2.5-VL OCR), and multilingual scrapes.
Mix recipes have shifted: ~5–30% of frontier tokens are now synthetic (textbooks, Q&A, code) generated by stronger teacher models.

Key facts

Corpus	Tokens	Source	Used by
FineWeb	15T	96 Common Crawl snapshots	open replications
FineWeb-Edu	1.3T	classifier-filtered subset of FineWeb	open models
DCLM-baseline	3.8T	Common Crawl + DCLM-pool filter	DCLM-7B
Nemotron-CC	6.3T unique	CC + classifier ensembling + synthetic	NVIDIA models
Ultra-FineWeb-en	~1T	fastText-filtered FineWeb	MiniCPM
StackV2	~900B (3T expanded)	GitHub	code mixes
Llama 3.1 pre-train	15.6T	mixed, undisclosed	Llama 3.1 405B
DeepSeek-V3 pre-train	14.8T	"diverse and high-quality"	DeepSeek-V3
Qwen 3 pre-train	~36T	web + PDFs + synthetic	Qwen 3 (119 languages)
Llama 4 pre-train	>30T multimodal	text + image + video	Llama 4 family

How big is the frontier corpus and why?

Qwen 3's 36T tokens is roughly 2× Qwen 2.5's 18T — the team writes: "we collected twice as many pre-training tokens — covering three times more languages." This pushes deep into the overtraining regime (article 21).

Llama 3.1 trained on 15.6T tokens for the 405B model, which Meta describes as roughly compute-optimal under their FLOP budget of 3.8×10²⁵.

DeepSeek-V3's 14.8T is smaller — appropriate for a 37B-active MoE.

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Pre-Training Data Sources & Token Budgets

TL;DR

Key facts

How big is the frontier corpus and why?

Where do the tokens come from?

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex