A streaming API delivers model outputs over an open connection (HTTP SSE, gRPC, WebSocket) as soon as they are available, enabling low-latency incremental consumption.
Protocol Comparisons
Use Cases
- Chat interfaces needing real-time token display.
- Speech synthesis pipelines chaining ASR → LLM → TTS.
- Long-running data generation consumed by downstream workers.
Design Trade-offs
- Streaming reduces perceived latency but complicates retry logic.
- Persistent connections may exhaust load-balancer slots.
- Message framing overhead is higher than batch replies for small outputs.
Current Trends (2025)
- HTTP/3 QUIC streams lowering head-of-line blocking1.
- CDN edge workers terminating SSE and multiplexing to origin.
- Token-level billing calculated on the fly during stream.
Implementation Tips
- Flush heartbeat comments every 15 s to keep connection alive.
- Send final JSON frame with
"done": true
flag. - Cap maximum stream duration to defend against slow-loris attacks.
References
-
Cloudflare Blog, QUIC for Real-Time AI Streaming, 2025. ↩