12 posts in series LLM Pre-Training 2026
LLM Pre-Training in 2026: The Frontier in Numbers
The state of frontier LLM pre-training in 2026 — token counts, parameter counts, cluster sizes, costs, and what it all means for CTOs and ML leads.
Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens
Where 2026 frontier LLMs get their pre-training data — Common Crawl, FineWeb, DCLM, StackV2, multilingual, PDFs — and how Qwen 3, DeepSeek-V3, Llama 3.1 sized their corpora.
Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb
How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.
Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared
Fine-grained experts, shared experts, auxiliary-loss-free routing — the modern MoE recipe in 2026, with side-by-side comparison of DeepSeek-V3, Qwen 3, Llama 4.
Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3
How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.
Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3
Side-by-side specifications of every chip and pod system used at the 2026 LLM frontier, with primary-source numbers from NVIDIA, Google Cloud, and AWS.
Cluster Scale 2026: Colossus, Stargate, Project Rainier
From 16K to 1M-GPU systems — how xAI built Colossus in 122 days, how AWS deployed 500K Trainium2 chips for Anthropic, and what OpenAI's $500B Stargate commitment actually covers.