Command Palette

Search for a command to run...

Inference Cost Estimation

Benched.ai Editorial Team

Inference cost estimation forecasts the dollar spend associated with serving model requests under given load and hardware assumptions.

  Cost Components

ComponentFormulaNotes
Accelerator timeGPU-seconds × $/GPU-sOn-demand or spot
Memory footprintGB-hours × $/GB-hKV cache, weights
Egress bandwidthGB × $/GBCDN or inter-AZ
Licensing feeTokens × $/1k tokensProprietary APIs

  Example Calculation (LLM self-hosted)

VariableValue
Prompt tokens per request200
Completion tokens400
TPS2
GPU throughput120 tokens/s
GPU price$2/hr

Total GPU-seconds = (200+400)/(120) ≈ 5 s/request → 10 s/s @2 TPS → 0.0028 GPU-hr → $0.0056.

Add 20 % overhead → $0.0067/request.

  Design Trade-offs

  • Batch size increases GPU utilization but raises queuing delay.
  • Quantization lowers cost/req but may degrade quality.
  • Using API providers shifts CapEx to OpEx but adds per-token licensing.

  Current Trends (2025)

  • FP8 inference and MoE routing drop flops/token 35 %, cascading to price cuts.
  • Cost dashboards integrate with OpenTelemetry spans to attribute dollars to user IDs.
  • FinOps teams negotiate reserved H100 blocks for steady workloads at 40 % discount.

  Implementation Tips

  1. Model p95 load, not average, when capacity-planning GPUs.
  2. Include hidden costs like audit logging storage.
  3. Validate vendor invoices against client-side token counts.