Model caching stores model weights and, optionally, key–value (KV) attention states in fast storage so that repeated invocations avoid costly disk or network fetches.
Cache Layers
Hit Ratio Impact (LLM inference)
Design Trade-offs
- Larger GPU cache boosts TPS but limits multi-model co-loading.
- Host DRAM caching improves cold starts yet uses expensive RAM.
- SSD caches risk staleness; need checksum validation.
Current Trends (2025)
- vLLM paged KV cache shares blocks across requests, raising hit ratio 20 pp.
- Checkpoint deduplication stores FP16 shards once and hard-links across versions.
- GPU direct-storage (GDS) streams weights from NVMe to H100 without CPU copy.
Implementation Tips
- Evict by LRU weighted on recent QPS.
- Validate SHA-256 of weights on first load.
- Use async pre-warming after deploys to avoid p99 spikes.