FP8 Mixed-Precision Training

FP8 Mixed-Precision Training

· 210 words · 2 minutes reading time

FP8 Mixed-Precision Training (DeepSeek-V3 recipe)

TL;DR

  • DeepSeek-V3 is the first extremely large model (671B/37B-active) to validate end-to-end FP8 pre-training at scale.
  • The team uses E4M3 on all tensors (not the NVIDIA hybrid E4M3/E5M2), enabled by fine-grained 1×128 activation tiles and 128×128 weight blocks.
  • Optimizer state uses BF16 instead of FP32, saving memory; accumulation uses promoted FP32 (NVIDIA CUTLASS pattern).

Recipe (verbatim from DeepSeek-V3, §3.3)

"we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling." "we calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block."

Why it works

Tile/block-wise scaling shares the exponent across small element groups, capturing dynamic range that tensor-wide scaling can't. Outliers in any single element don't blow up the global scale.

FAQ

Q: Is FP4 next? Yes — Blackwell supports it; DeepSeek's hardware companion paper (arXiv:2505.09343) discusses low-precision directions. Q: Does FP8 hurt loss? DeepSeek-V3 reports <0.25% loss perturbation vs BF16 — negligible at scale.

References

  • arXiv:2412.19437 (V3); arXiv:2505.09343 (V3/R1 hardware insights); Colfax: "DeepSeek-R1 and FP8 Mixed-Precision Training"

Series

LLM Pre-Training 2026

FP8 Mixed-Precision Training