FP8 Mixed-Precision Training

May 21, 2026 · 210 words · 2 minutes reading time

FP8 Mixed-Precision Training (DeepSeek-V3 recipe)

TL;DR

DeepSeek-V3 is the first extremely large model (671B/37B-active) to validate end-to-end FP8 pre-training at scale.
The team uses E4M3 on all tensors (not the NVIDIA hybrid E4M3/E5M2), enabled by fine-grained 1×128 activation tiles and 128×128 weight blocks.
Optimizer state uses BF16 instead of FP32, saving memory; accumulation uses promoted FP32 (NVIDIA CUTLASS pattern).

Recipe (verbatim from DeepSeek-V3, §3.3)

"we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling." "we calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block."

Why it works

Tile/block-wise scaling shares the exponent across small element groups, capturing dynamic range that tensor-wide scaling can't. Outliers in any single element don't blow up the global scale.

FAQ

Q: Is FP4 next? Yes — Blackwell supports it; DeepSeek's hardware companion paper (arXiv:2505.09343) discusses low-precision directions. Q: Does FP8 hurt loss? DeepSeek-V3 reports <0.25% loss perturbation vs BF16 — negligible at scale.

References

arXiv:2412.19437 (V3); arXiv:2505.09343 (V3/R1 hardware insights); Colfax: "DeepSeek-R1 and FP8 Mixed-Precision Training"

FP8 Mixed-Precision Training

FP8 Mixed-Precision Training (DeepSeek-V3 recipe)

TL;DR

Recipe (verbatim from DeepSeek-V3, §3.3)

Why it works

FAQ

References

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex