Command Palette

Search for a command to run...

Cold Start

Benched.ai Editorial Team

A cold start occurs when an inference request arrives at a compute node that has not yet loaded the model weights or warmed up its runtime environment. The load-and-initialize path can add seconds of latency, degrade SLA compliance, and spike memory bandwidth as weights are pulled from remote object storage.

  Cold-Start Sequence

  1. Provision container or serverless function.
  2. Pull model artifact from disk / cloud store.
  3. Deserialize checkpoints into device memory.
  4. Compile or optimize kernels (e.g., CUDA graph capture).
  5. Serve first request and populate caches.

  Typical Cold-Start Times

ModelStorage SizeLoad + InitWarm Request
GPT-J-6B24 GB8–12 s400 ms
Llama-2-70B140 GB35–50 s900 ms
Whisper-large-v32.9 GB2–3 s300 ms

  Mitigation Strategies

  • Snapshotting GPU memory pages to local NVMe and lazy-loading unused layers.
  • Pool of pre-warmed replicas scaled by demand forecast.
  • Weight streaming with tensor parallelism so first tokens decode before full load.
  • Compile caching: save TensorRT or XLA artifacts to skip JIT on restart.

  Design Trade-offs

  • Keeping GPUs warm burns capacity during low traffic but eliminates latency spikes.
  • Smaller quantized checkpoints load faster yet may drop accuracy.
  • Serverless cold ramps avoid idle costs but shift latency onto tail percentiles.

  Current Trends (2025)

  • Layer-at-a-time lazy loading in vLLM reduces 70B cold starts to 9 s.
  • Container snapshot APIs on Kubernetes 1.30 restore 16 GB GPU state in 1.4 s.
  • Model-aware auto-scalers predict bursty traffic using time-series LSTM, pre-provisioning nodes 45 s ahead1.

  Implementation Tips

  1. Measure cold vs warm histograms separately.
  2. Store checkpoints on local SSD where possible; network block storage adds 2–3× latency.
  3. Pre-compute positional encodings and save with the model to skip first-token math.

  References

  1. AWS re:Invent talk ARC-402, 2025.