Batch inference processes multiple inputs in a single model forward pass, amortizing overhead and improving throughput.
Throughput vs Latency
Scheduling Strategies
Design Trade-offs
- Larger batches yield lower cost per token but lengthen service time.
- Mixing user tenants risks one slow request delaying others (the head-of-line problem).
- Batching incompatible prompt lengths may waste tokens due to padding.
Current Trends (2025)
- Model-serving frameworks auto-tune batch size per GPU based on recent latency targets.
- Sequence-parallel decoding lets servers interleave generation for different requests without large padding.
- Multi-model batching (M2B) co-hosts small models in the same kernel launch.
Implementation Tips
- Start with 50 ms batching window; empirically tune per model.
- Pad sequences to the nearest multiple of 8 tokens to align tensor core tiles.
- Drop to batch size 1 when p99 latency SLA violated for >30 s.
- Emit per-batch metrics: size, average token count, compute time.