Command Palette

Search for a command to run...

Mixed-Precision Inference

Benched.ai Editorial Team

Mixed-precision inference runs parts of a model in lower numerical precision (FP16, BF16, FP8) while keeping sensitive layers in FP32 to speed up throughput and reduce memory.

  Precision Trade-offs

PrecisionMemory vs FP32Throughput GainExample Layer
FP16/BF162× smaller1.7×MLP, Attention
FP84× smaller2.3×MatMul
INT84× smaller2.5×Linear projections

  Calibration Methods

MethodDescription
Static calibrationCollect activation ranges offline
Dynamic calibrationEstimate ranges on-the-fly
Per-channel scalingSeparate scale per weight row

  Current Trends (2025)

  • NVIDIA FP8 transformer kernels hit 500 TFLOPs on H100.
  • Intel AMX BF16 extends mixed-precision to CPUs.
  • Quantization-aware finetuning plus FP8 achieves near FP16 quality.

  Implementation Tips

  1. Use automatic mixed precision (AMP) flags in frameworks.
  2. Validate quality drop <0.5 % on dev set after precision change.
  3. Monitor overflow/underflow counters during inference.