Cold Start

Benched.ai Editorial Team

A cold start occurs when an inference request arrives at a compute node that has not yet loaded the model weights or warmed up its runtime environment. The load-and-initialize path can add seconds of latency, degrade SLA compliance, and spike memory bandwidth as weights are pulled from remote object storage.

Cold-Start Sequence

Provision container or serverless function.
Pull model artifact from disk / cloud store.
Deserialize checkpoints into device memory.
Compile or optimize kernels (e.g., CUDA graph capture).
Serve first request and populate caches.

Typical Cold-Start Times

Model	Storage Size	Load + Init	Warm Request
GPT-J-6B	24 GB	8–12 s	400 ms
Llama-2-70B	140 GB	35–50 s	900 ms
Whisper-large-v3	2.9 GB	2–3 s	300 ms

Mitigation Strategies

Snapshotting GPU memory pages to local NVMe and lazy-loading unused layers.
Pool of pre-warmed replicas scaled by demand forecast.
Weight streaming with tensor parallelism so first tokens decode before full load.
Compile caching: save TensorRT or XLA artifacts to skip JIT on restart.

Design Trade-offs

Keeping GPUs warm burns capacity during low traffic but eliminates latency spikes.
Smaller quantized checkpoints load faster yet may drop accuracy.
Serverless cold ramps avoid idle costs but shift latency onto tail percentiles.

Current Trends (2025)

Layer-at-a-time lazy loading in vLLM reduces 70B cold starts to 9 s.
Container snapshot APIs on Kubernetes 1.30 restore 16 GB GPU state in 1.4 s.
Model-aware auto-scalers predict bursty traffic using time-series LSTM, pre-provisioning nodes 45 s ahead¹.

Implementation Tips

Measure cold vs warm histograms separately.
Store checkpoints on local SSD where possible; network block storage adds 2–3× latency.
Pre-compute positional encodings and save with the model to skip first-token math.

AWS re:Invent talk ARC-402, 2025. ↩