Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens
Pre-Training Data Sources & Token Budgets
TL;DR
- 2026 frontier corpora span 14.8T (DeepSeek-V3) → 36T (Qwen 3) tokens.
- The web is the substrate — Common Crawl + FineWeb (15T) + DCLM form the open-data backbone; private corpora add books, code (StackV2 ~3T), PDFs (Qwen 3 used Qwen2.5-VL OCR), and multilingual scrapes.
- Mix recipes have shifted: ~5–30% of frontier tokens are now synthetic (textbooks, Q&A, code) generated by stronger teacher models.
Key facts
| Corpus | Tokens | Source | Used by |
|---|---|---|---|
| FineWeb | 15T | 96 Common Crawl snapshots | open replications |
| FineWeb-Edu | 1.3T | classifier-filtered subset of FineWeb | open models |
| DCLM-baseline | 3.8T | Common Crawl + DCLM-pool filter | DCLM-7B |
| Nemotron-CC | 6.3T unique | CC + classifier ensembling + synthetic | NVIDIA models |
| Ultra-FineWeb-en | ~1T | fastText-filtered FineWeb | MiniCPM |
| StackV2 | ~900B (3T expanded) | GitHub | code mixes |
| Llama 3.1 pre-train | 15.6T | mixed, undisclosed | Llama 3.1 405B |
| DeepSeek-V3 pre-train | 14.8T | "diverse and high-quality" | DeepSeek-V3 |
| Qwen 3 pre-train | ~36T | web + PDFs + synthetic | Qwen 3 (119 languages) |
| Llama 4 pre-train | >30T multimodal | text + image + video | Llama 4 family |
How big is the frontier corpus and why?
Qwen 3's 36T tokens is roughly 2× Qwen 2.5's 18T — the team writes: "we collected twice as many pre-training tokens — covering three times more languages." This pushes deep into the overtraining regime (article 21).
Llama 3.1 trained on 15.6T tokens for the 405B model, which Meta describes as roughly compute-optimal under their FLOP budget of 3.8×10²⁵.
DeepSeek-V3's 14.8T is smaller — appropriate for a 37B-active MoE.