Command Palette

Search for a command to run...

Throughput

Benched.ai Editorial Team

Throughput measures how much work an AI system completes per unit time—commonly tokens generated per second, requests per minute, or images classified per second.

  Key Throughput Metrics

MetricTrainingInference
Tokens / GPU secondPrimary for LLM pre-trainingLess relevant
Samples / secondVision, speech networksBatch-size dependent
Requests / minuteChat assistant APIs

  Factors Influencing Throughput

  1. Batch size and padding efficiency.
  2. Model size and numerical precision.
  3. GPU utilization and memory bandwidth.
  4. Network latencies in distributed setups.

  Design Trade-offs

  • Increasing batch size raises throughput but can inflate latency and memory.
  • Mixed precision improves FLOPS but may introduce rounding error if not calibrated.
  • Token streaming improves user-perceived speed yet reduces aggregate tokens/s due to flush overhead.

  Current Trends (2025)

  • Speculative decoding combined with diffusion cache yields 1.7× tokens/s on GPT-4-Turbo.
  • Kernel fusion libraries (TorchInductor) auto-merge small ops to hit 80 % theoretical FLOPS1.

  Implementation Tips

  1. Report throughput alongside p99 latency to avoid optimizing one at cost of the other.
  2. Use synthetic prompts at maximum context to stress-test tokens/s ceiling.
  3. Monitor tokens/Joule for energy-aware tuning.

  References

  1. PyTorch DevCon 2025, High-Performance Inference with Inductor.