Inference Cost Estimation

Benched.ai Editorial Team

Inference cost estimation forecasts the dollar spend associated with serving model requests under given load and hardware assumptions.

Cost Components

Component	Formula	Notes
Accelerator time	GPU-seconds × $/GPU-s	On-demand or spot
Memory footprint	GB-hours × $/GB-h	KV cache, weights
Egress bandwidth	GB × $/GB	CDN or inter-AZ
Licensing fee	Tokens × $/1k tokens	Proprietary APIs

Example Calculation (LLM self-hosted)

Variable	Value
Prompt tokens per request	200
Completion tokens	400
TPS	2
GPU throughput	120 tokens/s
GPU price	$2/hr

Total GPU-seconds = (200+400)/(120) ≈ 5 s/request → 10 s/s @2 TPS → 0.0028 GPU-hr → $0.0056.

Add 20 % overhead → $0.0067/request.

Design Trade-offs

Batch size increases GPU utilization but raises queuing delay.
Quantization lowers cost/req but may degrade quality.
Using API providers shifts CapEx to OpEx but adds per-token licensing.

Current Trends (2025)

FP8 inference and MoE routing drop flops/token 35 %, cascading to price cuts.
Cost dashboards integrate with OpenTelemetry spans to attribute dollars to user IDs.
FinOps teams negotiate reserved H100 blocks for steady workloads at 40 % discount.

Implementation Tips

Model p95 load, not average, when capacity-planning GPUs.
Include hidden costs like audit logging storage.
Validate vendor invoices against client-side token counts.