Throughput measures how much work an AI system completes per unit time—commonly tokens generated per second, requests per minute, or images classified per second.
Key Throughput Metrics
Factors Influencing Throughput
- Batch size and padding efficiency.
- Model size and numerical precision.
- GPU utilization and memory bandwidth.
- Network latencies in distributed setups.
Design Trade-offs
- Increasing batch size raises throughput but can inflate latency and memory.
- Mixed precision improves FLOPS but may introduce rounding error if not calibrated.
- Token streaming improves user-perceived speed yet reduces aggregate tokens/s due to flush overhead.
Current Trends (2025)
- Speculative decoding combined with diffusion cache yields 1.7× tokens/s on GPT-4-Turbo.
- Kernel fusion libraries (TorchInductor) auto-merge small ops to hit 80 % theoretical FLOPS1.
Implementation Tips
- Report throughput alongside p99 latency to avoid optimizing one at cost of the other.
- Use synthetic prompts at maximum context to stress-test tokens/s ceiling.
- Monitor tokens/Joule for energy-aware tuning.
References
-
PyTorch DevCon 2025, High-Performance Inference with Inductor. ↩