Latency Metrics

Latency metrics quantify the time delays an AI system introduces from user request to final response. Accurate measurement and interpretation guide capacity planning, SLA definitions, and user-experience improvements.

Definition and Scope

Latency is typically decomposed into several sequential components:

Network transit (client ↔ edge POP)
Load-balancer queuing
Model orchestration overhead (auth, routing, feature fetching)
Inference runtime (GPU/CPU execution)
Post-processing and serialization

Common Metrics

Metric	What It Measures	Typical Target
Mean latency	Average end-to-end duration	300–800 ms for chat APIs
P95	Time under which 95 % of requests finish	≤1.2 s
P99	Long-tail performance	≤2.5 s
Time to First Token (TTFT)	Wait until the first streamed byte	≤200 ms
Tokens / s	Decoding throughput once streaming starts	≥50 tok/s on GPT-3.5 tier

Measurement Best Practices

Collect server-side timestamps for every pipeline stage; client probes mis-attribute network jitter.
Bucket metrics by model, region, and request shape (context length) to expose hotspots.
Record cold vs warm starts separately—mixing hides provisioning bugs.
Align clock sources with NTP to <1 ms skew.

Design Trade-offs

Aggressive batching improves throughput but increases p99 queue latency.
Compression shrinks payloads at the cost of CPU time.
Early-exit streaming cuts perceived TTFT yet may under-utilize GPUs when output is short.

Current Trends (2025)

End-to-end distributed tracing (OpenTelemetry) adopted across inference stacks.
GPU direct-reply: kernels stream tokens over RDMA, shaving 35 ms Host↔Device hops.
Predictive autoscaling driven by quantile regression forecasts lowers p99 by 22 % in production¹.

Implementation Tips

Expose latency histograms, not just averages, on dashboards.
Alert on p95 SLA breaches for >5 min to ignore transient spikes.
Include prompt token count in logs to correlate with decode time.

Meta AI, Large-Scale LLM Serving at Millisecond Latency, 2025. ↩

Command Palette