A cold start occurs when an inference request arrives at a compute node that has not yet loaded the model weights or warmed up its runtime environment. The load-and-initialize path can add seconds of latency, degrade SLA compliance, and spike memory bandwidth as weights are pulled from remote object storage.
Cold-Start Sequence
- Provision container or serverless function.
- Pull model artifact from disk / cloud store.
- Deserialize checkpoints into device memory.
- Compile or optimize kernels (e.g., CUDA graph capture).
- Serve first request and populate caches.
Typical Cold-Start Times
Mitigation Strategies
- Snapshotting GPU memory pages to local NVMe and lazy-loading unused layers.
- Pool of pre-warmed replicas scaled by demand forecast.
- Weight streaming with tensor parallelism so first tokens decode before full load.
- Compile caching: save TensorRT or XLA artifacts to skip JIT on restart.
Design Trade-offs
- Keeping GPUs warm burns capacity during low traffic but eliminates latency spikes.
- Smaller quantized checkpoints load faster yet may drop accuracy.
- Serverless cold ramps avoid idle costs but shift latency onto tail percentiles.
Current Trends (2025)
- Layer-at-a-time lazy loading in vLLM reduces 70B cold starts to 9 s.
- Container snapshot APIs on Kubernetes 1.30 restore 16 GB GPU state in 1.4 s.
- Model-aware auto-scalers predict bursty traffic using time-series LSTM, pre-provisioning nodes 45 s ahead1.
Implementation Tips
- Measure cold vs warm histograms separately.
- Store checkpoints on local SSD where possible; network block storage adds 2–3× latency.
- Pre-compute positional encodings and save with the model to skip first-token math.
References
-
AWS re:Invent talk ARC-402, 2025. ↩