3 posts with tag pre-training
Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens
Where 2026 frontier LLMs get their pre-training data — Common Crawl, FineWeb, DCLM, StackV2, multilingual, PDFs — and how Qwen 3, DeepSeek-V3, Llama 3.1 sized their corpora.
·
2 minutes reading time
LLM Pre-Training in 2026: The Frontier in Numbers
The state of frontier LLM pre-training in 2026 — token counts, parameter counts, cluster sizes, costs, and what it all means for CTOs and ML leads.
·
3 minutes reading time
Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb
How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.
·
2 minutes reading time