Model quantization compresses neural network weights and activations from full-precision (FP32/FP16) formats into lower-bit representations such as INT8, INT4, or FP4. The reduced numerical precision slashes memory footprint and bandwidth, enabling cheaper inference on CPUs, GPUs, and specialized accelerators with minimal accuracy loss when executed correctly.
Definition and Scope
Quantization maps a real-valued tensor (x) to a discrete set of representable levels (Q(x)) via an affine transformation (x \approx s \cdot (q - z)) where (s) is scale and (z) is zero-point. Approaches differ in how they estimate scale, whether they retrain the model, and whether quantization is symmetric (no zero-point) or asymmetric.
Techniques at a Glance
Accuracy vs Efficiency
Design Trade-offs
- Precision vs. Speed: INT4 doubles throughput over INT8 on memory-bound workloads, but rounding error grows.
- Layer Sensitivity: Embedding layers and attention projection matrices are more error-tolerant than layer-norms; selective mixed-precision mitigates drops.
- Activation Quantization: Gains extra 25-40 % memory relief but forces calibration data or QAT.
Current Trends (2025)
- FP4/FP3 formats with per-channel scales show <0.5 % accuracy delta on Llama-3 400 B5.
- Hardware Support: NVIDIA B200 and AMD MI350 include tensor cores that natively accumulate INT4 into FP16.
- Logarithmic Quantization: Base-2 power-law bins compress speech models to 2 bits with negligible WER change.
- Elastic Quantization: Runtime selects precision per sequence based on uncertainty, saving 18 % energy on average.
Implementation Tips
- Calibrate with representative prompts; random data skews activation histograms.
- Use per-channel scales for weight tensors; per-tensor for activations to simplify ops.
- Insert de-quant stubs only at input and output boundaries to avoid format ping-pong.
- Benchmark end-to-end latency, not kernel micro-benchmarks—CPU GEMM may dominate.
References
-
Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers, 2023. ↩
-
Lin et al., AWQ: Activation-aware Weight Quantization, 2023. ↩
-
Yaohui Wang et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2022. ↩
-
Internal measurements on AWS g5.2xlarge, batch=1, sequence=512. ↩