1 post with tag fineweb

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Where 2026 frontier LLMs get their pre-training data — Common Crawl, FineWeb, DCLM, StackV2, multilingual, PDFs — and how Qwen 3, DeepSeek-V3, Llama 3.1 sized their corpora.

· 2 minutes reading time