FP8 Mixed-Precision Training
FP8 Mixed-Precision Training (DeepSeek-V3 recipe)
TL;DR
- DeepSeek-V3 is the first extremely large model (671B/37B-active) to validate end-to-end FP8 pre-training at scale.
- The team uses E4M3 on all tensors (not the NVIDIA hybrid E4M3/E5M2), enabled by fine-grained 1×128 activation tiles and 128×128 weight blocks.
- Optimizer state uses BF16 instead of FP32, saving memory; accumulation uses promoted FP32 (NVIDIA CUTLASS pattern).
Recipe (verbatim from DeepSeek-V3, §3.3)
"we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling." "we calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block."
Why it works
Tile/block-wise scaling shares the exponent across small element groups, capturing dynamic range that tensor-wide scaling can't. Outliers in any single element don't blow up the global scale.
FAQ
Q: Is FP4 next? Yes — Blackwell supports it; DeepSeek's hardware companion paper (arXiv:2505.09343) discusses low-precision directions. Q: Does FP8 hurt loss? DeepSeek-V3 reports <0.25% loss perturbation vs BF16 — negligible at scale.
References
- arXiv:2412.19437 (V3); arXiv:2505.09343 (V3/R1 hardware insights); Colfax: "DeepSeek-R1 and FP8 Mixed-Precision Training"