Command Palette

Search for a command to run...

Model Caching

Benched.ai Editorial Team

Model caching stores model weights and, optionally, key–value (KV) attention states in fast storage so that repeated invocations avoid costly disk or network fetches.

  Cache Layers

LayerMediumTTLContents
GPU VRAMHBM / GDDRRequestWeights + live KV
Host DRAMDDR5MinutesIdle weights
Local SSDNVMeHoursSharded checkpoints
Remote blobS3 / GCSDaysOriginal artifacts

  Hit Ratio Impact (LLM inference)

Hit %Latency msCost Δ
10030baseline
80120+10 %
0600+40 %

  Design Trade-offs

  • Larger GPU cache boosts TPS but limits multi-model co-loading.
  • Host DRAM caching improves cold starts yet uses expensive RAM.
  • SSD caches risk staleness; need checksum validation.

  Current Trends (2025)

  • vLLM paged KV cache shares blocks across requests, raising hit ratio 20 pp.
  • Checkpoint deduplication stores FP16 shards once and hard-links across versions.
  • GPU direct-storage (GDS) streams weights from NVMe to H100 without CPU copy.

  Implementation Tips

  1. Evict by LRU weighted on recent QPS.
  2. Validate SHA-256 of weights on first load.
  3. Use async pre-warming after deploys to avoid p99 spikes.