Throttling policies deliberately slow or block client requests when resource usage exceeds predefined thresholds. They protect shared AI infrastructure from overload and ensure fair distribution of capacity across tenants.
Throttle Triggers
- Concurrent connection count per API key.
- Aggregate request rate (requests / minute).
- GPU seconds or token generation per rolling window.
- Spike detection versus historical baseline.
Policy Types
Design Trade-offs
- Hard rejects preserve latency for allowed calls but create spiky failure patterns for clients.
- Queues smooth load yet raise tail latency and memory usage.
- Slowdown policies keep success rates high but may frustrate end-users if applied silently.
Current Trends (2025)
- Token-based throttles instead of request counts so large prompts are moderated fairly.
- Adaptive limits driven by short-term demand forecasts reduce unnecessary rejections by 18 percent.
- Client SDKs expose
onThrottle
hooks for graceful back-off and UX messaging1.
Implementation Tips
- Return
HTTP 429
withRetry-After
header specifying seconds until new quota. - Log throttling events with reason code for capacity planning.
- Exempt internal health probes from throttle counters.
References
-
Cloudflare Blog, Adaptive Rate Limiting for Large Language Model APIs, 2025. ↩