Training Cost Economics 2026

Training Cost Economics 2026

May 25, 2026 · 184 words · 1 minute reading time

Training Cost Economics 2026

TL;DR

DeepSeek-V3 final pre-training: 2.788M H800-hours = $5.576M at $2/hr rental (paper, Table 1). Excludes R&D, salaries, ~$1B of owned H800 hardware, ablations, post-training.
Llama 3.1 405B: ~$170M est., 3.8e25 FLOP on 16K H100s over ~54 days.
Grok 4 median: $490M (Epoch AI, 2025) — two independent methods (H100 rental + amortized hardware+power) both yielded ~$490M.
GPT-5 total compute: ~5e25 FLOP (Epoch AI estimate; less than GPT-4.5 at >1e26).
OpenAI 2024 cloud spend: ~~$7B (~~$5B R&D + ~$2B inference); 2025 projected ~$9B R&D.

What is and isn't in the headlines

Cost category	DeepSeek-V3 $5.5M	Llama 3.1 $170M	GPT-5 (Epoch est.)
Final pre-training run	✅	✅	✅
Hardware capex	❌ (rental price)	partial	❌
Ablations / experiments	❌	❌	❌
R&D salaries	❌	❌	❌
Failed runs	❌	❌	❌

The CFO interpretation

Headline numbers understate true cost by 5–20×. Use Epoch AI's amortized-hardware methodology for "all-in" comparisons.

References

arXiv:2412.19437; arXiv:2407.21783; epoch.ai/gradient-updates/why-gpt5-used-less-training-compute-than-gpt45-but-gpt6-probably-wont; epoch.ai/data-insights/grok-4-training-resources; epoch.ai/data-insights/openai-compute-spend

Series

LLM Pre-Training 2026

LLM Pre-Training in 2026: The Frontier in Numbers

The state of frontier LLM pre-training in 2026 — token counts, parameter counts, cluster sizes, costs, and what it all means for CTOs and ML leads.

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Where 2026 frontier LLMs get their pre-training data — Common Crawl, FineWeb, DCLM, StackV2, multilingual, PDFs — and how Qwen 3, DeepSeek-V3, Llama 3.1 sized their corpora.

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Fine-grained experts, shared experts, auxiliary-loss-free routing — the modern MoE recipe in 2026, with side-by-side comparison of DeepSeek-V3, Qwen 3, Llama 4.

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Side-by-side specifications of every chip and pod system used at the 2026 LLM frontier, with primary-source numbers from NVIDIA, Google Cloud, and AWS.

Cluster Scale 2026: Colossus, Stargate, Project Rainier

From 16K to 1M-GPU systems — how xAI built Colossus in 122 days, how AWS deployed 500K Trainium2 chips for Anthropic, and what OpenAI's $500B Stargate commitment actually covers.

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex