Inference cost estimation forecasts the dollar spend associated with serving model requests under given load and hardware assumptions.
Cost Components
Example Calculation (LLM self-hosted)
Total GPU-seconds = (200+400)/(120) ≈ 5 s/request → 10 s/s @2 TPS → 0.0028 GPU-hr → $0.0056.
Add 20 % overhead → $0.0067/request.
Design Trade-offs
- Batch size increases GPU utilization but raises queuing delay.
- Quantization lowers cost/req but may degrade quality.
- Using API providers shifts CapEx to OpEx but adds per-token licensing.
Current Trends (2025)
- FP8 inference and MoE routing drop flops/token 35 %, cascading to price cuts.
- Cost dashboards integrate with OpenTelemetry spans to attribute dollars to user IDs.
- FinOps teams negotiate reserved H100 blocks for steady workloads at 40 % discount.
Implementation Tips
- Model p95 load, not average, when capacity-planning GPUs.
- Include hidden costs like audit logging storage.
- Validate vendor invoices against client-side token counts.