An inference pipeline is the sequence of steps that transform a user request into a model output and back to the client.
Reference Architecture
Latency Budget Example (Chat, 1k tokens)
Design Trade-offs
- Centralized scheduler maximizes GPU utilization but is single point of failure.
- Pre-processing on CPU saves GPU cycles yet increases data movement.
- Streaming reduces perceived latency but complicates retry logic.
Current Trends (2025)
- vLLM KV-cache multiplexing allows 10× concurrent streams per GPU.
- gRPC with HTTP/3 reduces tail latency vs REST.
- WASM pre-processing modules run inside envoy filters for low overhead.
Implementation Tips
- Tag each request with a trace ID that propagates through every stage.
- Use batched tokenization to offload CPU pressure.
- Monitor queue depth and back-pressure tokens/s in autoscaler.