AI Model Pricing

AI model pricing is the scheme a vendor uses to charge customers for training or inference. Pricing influences architecture choices, cost allocation, and even prompt engineering practices.

Common Pricing Metrics

Metric	Unit	Typical Use Case
Tokens processed	per 1k tokens	Chat / completion APIs
Images generated	per image	Text-to-image services
Compute time	GPU-second	Managed fine-tuning jobs
Requests	per API call	Low-variability inference
Seats	per user per month	Enterprise SaaS chatbot

Representative 2025 Prices (public list)

Provider	Model Tier	Price per 1k Tokens	Price per 1M Images
OpenAI	GPT-4o	$0.03 prompt / $0.06 completion	—
Anthropic	Claude 3 Opus	$0.025 prompt / $0.05 completion	—
Midjourney	V6	—	$8
Stability AI	SDXL Turbo API	—	$1
Google Vertex	Gemini-1.5-Pro	$0.016 prompt / $0.032 completion	—

Prices exclude reserved-instance discounts and enterprise commit credits.

Discount Mechanisms

Mechanism	How It Works	Savings Range
Reserved throughput	Customer prepays for steady TPS	20–50 %
Spot GPU inference	Jobs run on reclaimed capacity	40–70 % but interruptible
Volume tiering	Price drops after N tokens	10–30 %
Data-sharing rebate	Lower price for allowing anonymized log sharing	5–15 %

Design Trade-offs

Token-based billing incentivizes prompt compression but can shift cost to retrieval pipelines.
Flat per-request pricing simplifies budgeting yet penalizes large outputs.
Seat licenses encourage internal adoption but decouple usage from cost, risking overage.

Current Trends (2025)

FP8 inference cuts vendor cost 35 %, leading to across-the-board price drops.
Providers offer dynamic SKU swapping—requests automatically routed to the cheapest capable model within latency SLA.
Open models plus serverless GPU runtimes create race-to-zero margins for commodity tiers.

Implementation Tips

Benchmark alternative models at equal quality before negotiating price—quality-adjusted cost often varies 3×.
Track tokens via client library, not server logs alone; retries can double billed usage.
Use batching or multiplexing to amortize request overhead under token-based billing.