Model Caching

Benched.ai Editorial Team

Model caching stores model weights and, optionally, key–value (KV) attention states in fast storage so that repeated invocations avoid costly disk or network fetches.

Cache Layers

Layer	Medium	TTL	Contents
GPU VRAM	HBM / GDDR	Request	Weights + live KV
Host DRAM	DDR5	Minutes	Idle weights
Local SSD	NVMe	Hours	Sharded checkpoints
Remote blob	S3 / GCS	Days	Original artifacts

Hit Ratio Impact (LLM inference)

Hit %	Latency ms	Cost Δ
100	30	baseline
80	120	+10 %
0	600	+40 %

Design Trade-offs

Larger GPU cache boosts TPS but limits multi-model co-loading.
Host DRAM caching improves cold starts yet uses expensive RAM.
SSD caches risk staleness; need checksum validation.

Current Trends (2025)

vLLM paged KV cache shares blocks across requests, raising hit ratio 20 pp.
Checkpoint deduplication stores FP16 shards once and hard-links across versions.
GPU direct-storage (GDS) streams weights from NVMe to H100 without CPU copy.

Implementation Tips

Evict by LRU weighted on recent QPS.
Validate SHA-256 of weights on first load.
Use async pre-warming after deploys to avoid p99 spikes.