End-to-end response time (E2E RT) is the wall-clock duration from the moment a user submits a request until the client receives the last byte of the response.
Latency Component Breakdown
Target Budgets (chat, 1k output tokens)
Design Trade-offs
- Larger batches lower cost per token but raise queue delay.
- Streaming halves perceived latency but increases socket overhead.
- Edge inference reduces transit time yet limits GPU choice.
Current Trends (2025)
- Kernel fusion and speculative decoding shrink compute latency 30-50 %.
- gRPC over HTTP/3 reduces head-of-line blocking at tail latencies.
- Client SDKs pre-allocate chat bubbles based on token rate predictions.
Implementation Tips
- Use distributed tracing to attribute latency per segment.
- Alert on p95 not average—users notice tail delays.
- Budget network and compute separately when negotiating provider SLOs.