A warm start occurs when an inference request lands on a server where the model weights are already loaded and kernels are compiled, allowing immediate execution with minimal latency.
Cold vs Warm Latency
Maintaining Warm State
- Keep a minimum replica count even during off-peak hours.
- Periodically send synthetic pings to prevent idle shutdown.
- Use container snapshot restore to reload GPU memory quickly.
Design Trade-offs
- Holding warm capacity incurs extra cost during low demand.
- Aggressive idling saves money but increases p99 latency when traffic resumes.
Current Trends (2025)
- Predictive autoscalers maintain rolling warm pool sized from forecasted five-minute demand window.
- NVSwitch memory sharing warms multiple models concurrently on the same GPU1.
Implementation Tips
- Separate health checks from synthetic warmers to avoid skewing metrics.
- Track warm-hit ratio (warm / total requests) as KPI.
- Tag warm nodes to avoid being selected for large batch offline jobs.
References
-
NVIDIA Whitepaper, Multi-Model Serving with NVSwitch, 2025. ↩