Mixed-precision inference runs parts of a model in lower numerical precision (FP16, BF16, FP8) while keeping sensitive layers in FP32 to speed up throughput and reduce memory.
Precision Trade-offs
Calibration Methods
Current Trends (2025)
- NVIDIA FP8 transformer kernels hit 500 TFLOPs on H100.
- Intel AMX BF16 extends mixed-precision to CPUs.
- Quantization-aware finetuning plus FP8 achieves near FP16 quality.
Implementation Tips
- Use automatic mixed precision (AMP) flags in frameworks.
- Validate quality drop <0.5 % on dev set after precision change.
- Monitor overflow/underflow counters during inference.