Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

· 266 words · 2 minutes reading time

Pre-Training Data Sources & Token Budgets

TL;DR

  • 2026 frontier corpora span 14.8T (DeepSeek-V3) → 36T (Qwen 3) tokens.
  • The web is the substrate — Common Crawl + FineWeb (15T) + DCLM form the open-data backbone; private corpora add books, code (StackV2 ~3T), PDFs (Qwen 3 used Qwen2.5-VL OCR), and multilingual scrapes.
  • Mix recipes have shifted: ~5–30% of frontier tokens are now synthetic (textbooks, Q&A, code) generated by stronger teacher models.

Key facts

CorpusTokensSourceUsed by
FineWeb15T96 Common Crawl snapshotsopen replications
FineWeb-Edu1.3Tclassifier-filtered subset of FineWebopen models
DCLM-baseline3.8TCommon Crawl + DCLM-pool filterDCLM-7B
Nemotron-CC6.3T uniqueCC + classifier ensembling + syntheticNVIDIA models
Ultra-FineWeb-en~1TfastText-filtered FineWebMiniCPM
StackV2~900B (3T expanded)GitHubcode mixes
Llama 3.1 pre-train15.6Tmixed, undisclosedLlama 3.1 405B
DeepSeek-V3 pre-train14.8T"diverse and high-quality"DeepSeek-V3
Qwen 3 pre-train~36Tweb + PDFs + syntheticQwen 3 (119 languages)
Llama 4 pre-train>30T multimodaltext + image + videoLlama 4 family

How big is the frontier corpus and why?

Qwen 3's 36T tokens is roughly 2× Qwen 2.5's 18T — the team writes: "we collected twice as many pre-training tokens — covering three times more languages." This pushes deep into the overtraining regime (article 21).

Llama 3.1 trained on 15.6T tokens for the 405B model, which Meta describes as roughly compute-optimal under their FLOP budget of 3.8×10²⁵.

DeepSeek-V3's 14.8T is smaller — appropriate for a 37B-active MoE.

Where do the tokens come from?

Series

LLM Pre-Training 2026

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Where 2026 frontier LLMs get their pre-training data — Common Crawl, FineWeb, DCLM, StackV2, multilingual, PDFs — and how Qwen 3, DeepSeek-V3, Llama 3.1 sized their corpora.