Time to First Token (TTFT) is the elapsed time between sending a request to a text-generation model and receiving the first output token. It is a critical UX metric for conversational interfaces.
Components of TTFT
Design Trade-offs
- Caching prompts reduces preprocessing but increases cache memory.
- Larger batch sizes improve throughput but extend TTFT due to queueing.
- Speculative decoding cuts TTFT by ~40 % but may waste tokens if guesses are discarded.
Current Trends (2025)
- Compile-once CUDA graphs amortize launch overhead, shaving 15 ms per request.
- Edge POPs terminate TLS and relay to back-end via gRPC to save 2 RTTs.
- Token streaming CLIs show typing animation synced to measured TTFT for transparency1.
Implementation Tips
- Measure TTFT separately from total latency in dashboards.
- Alert when TTFT p95 exceeds 500 ms; users perceive lag above half a second.
- For SSE streams, send headers immediately so browser can start listening.
References
-
Stripe Dev Blog, Designing Low-Latency Chat Interfaces, 2025. ↩