In text generation APIs, streaming delivers tokens incrementally over an open HTTP or WebSocket connection, whereas non-streaming waits until the full completion is ready before sending a single response.
Latency Comparison
Benefits of Streaming
- Faster perceived responsiveness (chat UX).
- Enables cancellation mid-generation, saving cost.
- Allows progressive rendering on front-end.
Downsides of Streaming
- More complex client code to handle partial messages.
- HTTP/2 or WebSocket required; serverless edge functions may not support upgrades.
- Harder to calculate exact token usage upfront for billing.
Design Trade-offs
- Streaming benefits long outputs; for short <100-token replies, gap is minimal.
- Non-streaming simplifies retry logic; streaming reconnections may duplicate text.
- Output-format constraints (JSON) risk partial invalid structures when streamed.
Current Trends (2025)
- gRPC bidirectional streaming adopted for mobile SDKs with built-in flow-control.
- Server push compression (Brotli) cuts bandwidth 30 % for large multi-modal streams.
- Estimation headers (
X-Projected-Tokens
) give early cost hints before stream ends1.
Implementation Tips
- Flush tokens every ~40 ms to match typing speed but avoid network overhead.
- Send an explicit
done
flag in the final frame. - Buffer last two tokens on client side to ensure sentence boundaries before vocalization.