Time to First Token

Benched.ai Editorial Team

Time to First Token (TTFT) is the elapsed time between sending a request to a text-generation model and receiving the first output token. It is a critical UX metric for conversational interfaces.

Components of TTFT

Stage	Typical Contribution
Network RTT	20–70 ms
Queue wait	0–200 ms
Prompt preprocessing	5–30 ms
First forward pass	80–400 ms
Serialization & flush	5–15 ms

Design Trade-offs

Caching prompts reduces preprocessing but increases cache memory.
Larger batch sizes improve throughput but extend TTFT due to queueing.
Speculative decoding cuts TTFT by ~40 % but may waste tokens if guesses are discarded.

Current Trends (2025)

Compile-once CUDA graphs amortize launch overhead, shaving 15 ms per request.
Edge POPs terminate TLS and relay to back-end via gRPC to save 2 RTTs.
Token streaming CLIs show typing animation synced to measured TTFT for transparency¹.

Implementation Tips

Measure TTFT separately from total latency in dashboards.
Alert when TTFT p95 exceeds 500 ms; users perceive lag above half a second.
For SSE streams, send headers immediately so browser can start listening.

Stripe Dev Blog, Designing Low-Latency Chat Interfaces, 2025. ↩