Error rate metrics quantify how often an AI system fails to deliver an acceptable response. They are vital for SLOs, capacity planning, and regression alerts.
Common Metrics
Aggregation Windows
Design Trade-offs
- Short windows detect spikes quickly but are noisy.
- Aggregating multiple error classes can mask high-severity categories like safety blocks.
Current Trends (2025)
- Separating "model errors" (bad generations) from "infrastructure errors" clarifies ownership.
- Client libraries expose retry-after headers to hide transient 429 spikes.
Implementation Tips
- Tag errors with
error_type
(timeout, invalid_request, safety) for root-cause analysis. - Sample user context for high-volume endpoints to reduce log cost.
- Alert on error budget burn rate, not raw counts.