Throughput

Benched.ai Editorial Team

Throughput measures how much work an AI system completes per unit time—commonly tokens generated per second, requests per minute, or images classified per second.

Key Throughput Metrics

Metric	Training	Inference
Tokens / GPU second	Primary for LLM pre-training	Less relevant
Samples / second	Vision, speech networks	Batch-size dependent
Requests / minute	—	Chat assistant APIs

Factors Influencing Throughput

Batch size and padding efficiency.
Model size and numerical precision.
GPU utilization and memory bandwidth.
Network latencies in distributed setups.

Design Trade-offs

Increasing batch size raises throughput but can inflate latency and memory.
Mixed precision improves FLOPS but may introduce rounding error if not calibrated.
Token streaming improves user-perceived speed yet reduces aggregate tokens/s due to flush overhead.

Current Trends (2025)

Speculative decoding combined with diffusion cache yields 1.7× tokens/s on GPT-4-Turbo.
Kernel fusion libraries (TorchInductor) auto-merge small ops to hit 80 % theoretical FLOPS¹.

Implementation Tips

Report throughput alongside p99 latency to avoid optimizing one at cost of the other.
Use synthetic prompts at maximum context to stress-test tokens/s ceiling.
Monitor tokens/Joule for energy-aware tuning.

PyTorch DevCon 2025, High-Performance Inference with Inductor. ↩