Rate limiting controls the number of requests a client or tenant can issue to an AI API within a given time window. Proper limits protect shared infrastructure from abuse, manage fairness, and contain operating costs.
Common Algorithms
Recommended Limit Granularity
- Per API key or OAuth client ID.
- Distinct model tiers (GPT-4, GPT-3.5) with tighter caps on premium models.
- Separate read (inference) vs write (fine-tune) endpoints.
Design Trade-offs
- Generous burst capacity improves UX for interactive apps but may starve steady clients.
- Low limits reduce abuse but push developers to parallelize across keys.
- Header-based quotas are simple but can be spoofed without TLS termination.
Current Trends (2025)
- Adaptive rate limits that raise ceilings for low-error clients and throttle noisy ones.
- Usage-based billing replacing hard caps: pay-as-you-go beyond free tier.
- Real-time quota introspection endpoints so apps can plan retries instead of polling.
Implementation Tips
- Return
429 Too Many Requests
withRetry-After
header in seconds. - Include remaining tokens in response headers to aid client back-off logic.
- Store counters in Redis or Memcached with Lua script for atomicity.
- Log limit breaches; spikes often indicate bot attacks.