Hyperparameter tuning searches for the configuration (learning rate, batch size, dropout, etc.) that maximizes model quality under resource constraints.
Search Strategies
Typical Ranges for Transformer Fine-tuning
Design Trade-offs
- Bayesian methods converge with fewer trials but suffer from parallelism limits.
- Hyperband wastes fewer FLOPs via early stopping but may kill late-blooming configs.
- PBT adapts on the fly yet is complex to orchestrate.
Current Trends (2025)
- FP8 training shrinks memory so larger batch sizes fit, shifting optimal LR schedule.
- AutoML systems (Ray Tune v3, Vertex Vizier) integrate cost-aware objectives (3 × lower $/bleu).
- LLM-driven tuning agents generate search spaces from commit diffs.
Implementation Tips
- Log every trial's hyperparams and metrics into a reproducible artifact store.
- Use learning-rate finder to narrow grid before large search.
- Allocate a FLOP budget, not a fixed trial count, to compare search strategies fairly.