Dynamic model selection chooses among multiple candidate models at inference time based on request attributes such as input length, required latency, or cost budget. Selecting the lightest model that meets quality targets saves compute without sacrificing user experience.
Selection Policies
Design Trade-offs
- Maintaining many weights increases memory footprint.
- Confidence estimation errors can route hard queries to weak models, hurting quality.
- Switching models mid-conversation may change style; pin choice per session if brand voice matters.
Current Trends (2025)
- Mix of quantized 7B, 13B, 34B models served behind single endpoint with automatic scaling.
- Reinforcement-learning bandits outperform static heuristics by 8 % tokens saved at same accuracy.
- Client libraries expose
quality=fast
vsquality=best
flags that map to selection tiers.
Implementation Tips
- Log oracle labels from larger model to improve quality predictor.
- Cache decisions per prompt hash to avoid recomputing policy.
- Monitor cost savings vs regret to ensure policy drift doesn't degrade experience.