Model Quantization

Model quantization compresses neural network weights and activations from full-precision (FP32/FP16) formats into lower-bit representations such as INT8, INT4, or FP4. The reduced numerical precision slashes memory footprint and bandwidth, enabling cheaper inference on CPUs, GPUs, and specialized accelerators with minimal accuracy loss when executed correctly.

Definition and Scope

Quantization maps a real-valued tensor (x) to a discrete set of representable levels (Q(x)) via an affine transformation (x \approx s \cdot (q - z)) where (s) is scale and (z) is zero-point. Approaches differ in how they estimate scale, whether they retrain the model, and whether quantization is symmetric (no zero-point) or asymmetric.

Techniques at a Glance

Method	Training Needed	Typical Bits	Notable Tools
Post-Training Static	None	INT8	TensorRT, onnxruntime-quant
Post-Training Dynamic	None	INT8 activations, FP16 weights	PyTorch FX, Intel Neural Compressor
Quantization-Aware Training (QAT)	Yes	INT8 or lower	TensorFlow Lite QAT
GPTQ¹	None	INT4	bitsandbytes, AutoGPTQ
AWQ²	None	INT3/INT4	aws/awq
SmoothQuant³	None	INT8 weights + activations	open_source_smoothquant

Accuracy vs Efficiency

Model	Precision	Memory Reduction	Throughput Gain	ΔAccuracy
Llama-2-7B	FP16 to INT8	2×	1.3× on A10G⁴	−0.3 % EM
Llama-2-70B	FP16 to INT4 (GPTQ)	4×	1.8× on A100	−1.1 % EM
GPT-J-6B	FP16 to INT8 (SmoothQuant)	2×	1.5× on Sapphire Rapids	−0.2 % PPL

Design Trade-offs

Precision vs. Speed: INT4 doubles throughput over INT8 on memory-bound workloads, but rounding error grows.
Layer Sensitivity: Embedding layers and attention projection matrices are more error-tolerant than layer-norms; selective mixed-precision mitigates drops.
Activation Quantization: Gains extra 25-40 % memory relief but forces calibration data or QAT.

Current Trends (2025)

FP4/FP3 formats with per-channel scales show <0.5 % accuracy delta on Llama-3 400 B⁵.
Hardware Support: NVIDIA B200 and AMD MI350 include tensor cores that natively accumulate INT4 into FP16.
Logarithmic Quantization: Base-2 power-law bins compress speech models to 2 bits with negligible WER change.
Elastic Quantization: Runtime selects precision per sequence based on uncertainty, saving 18 % energy on average.

Implementation Tips

Calibrate with representative prompts; random data skews activation histograms.
Use per-channel scales for weight tensors; per-tensor for activations to simplify ops.
Insert de-quant stubs only at input and output boundaries to avoid format ping-pong.
Benchmark end-to-end latency, not kernel micro-benchmarks—CPU GEMM may dominate.

Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers, 2023. ↩
Lin et al., AWQ: Activation-aware Weight Quantization, 2023. ↩
Yaohui Wang et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2022. ↩
Internal measurements on AWS g5.2xlarge, batch=1, sequence=512. ↩
Meta AI, Llama-3 Technical Report, 2025. ↩

Command Palette