Command Palette

Search for a command to run...

Time to First Token

Benched.ai Editorial Team

Time to First Token (TTFT) is the elapsed time between sending a request to a text-generation model and receiving the first output token. It is a critical UX metric for conversational interfaces.

  Components of TTFT

StageTypical Contribution
Network RTT20–70 ms
Queue wait0–200 ms
Prompt preprocessing5–30 ms
First forward pass80–400 ms
Serialization & flush5–15 ms

  Design Trade-offs

  • Caching prompts reduces preprocessing but increases cache memory.
  • Larger batch sizes improve throughput but extend TTFT due to queueing.
  • Speculative decoding cuts TTFT by ~40 % but may waste tokens if guesses are discarded.

  Current Trends (2025)

  • Compile-once CUDA graphs amortize launch overhead, shaving 15 ms per request.
  • Edge POPs terminate TLS and relay to back-end via gRPC to save 2 RTTs.
  • Token streaming CLIs show typing animation synced to measured TTFT for transparency1.

  Implementation Tips

  1. Measure TTFT separately from total latency in dashboards.
  2. Alert when TTFT p95 exceeds 500 ms; users perceive lag above half a second.
  3. For SSE streams, send headers immediately so browser can start listening.

  References

  1. Stripe Dev Blog, Designing Low-Latency Chat Interfaces, 2025.