Mixed-Precision Inference

Benched.ai Editorial Team

Mixed-precision inference runs parts of a model in lower numerical precision (FP16, BF16, FP8) while keeping sensitive layers in FP32 to speed up throughput and reduce memory.

Precision Trade-offs

Precision	Memory vs FP32	Throughput Gain	Example Layer
FP16/BF16	2× smaller	1.7×	MLP, Attention
FP8	4× smaller	2.3×	MatMul
INT8	4× smaller	2.5×	Linear projections

Calibration Methods

Method	Description
Static calibration	Collect activation ranges offline
Dynamic calibration	Estimate ranges on-the-fly
Per-channel scaling	Separate scale per weight row

Current Trends (2025)

NVIDIA FP8 transformer kernels hit 500 TFLOPs on H100.
Intel AMX BF16 extends mixed-precision to CPUs.
Quantization-aware finetuning plus FP8 achieves near FP16 quality.

Implementation Tips

Use automatic mixed precision (AMP) flags in frameworks.
Validate quality drop <0.5 % on dev set after precision change.
Monitor overflow/underflow counters during inference.