Alignment training is the set of procedures that steer a model toward behaviors consistent with human values, instructions, and safety constraints. It is typically applied after base pre-training.
Alignment Pipeline Phases
Comparison of Optimization Algorithms
Design Trade-offs
- Reinforcement methods boost helpfulness but risk reward hacking if reward model narrow.
- Human annotation is costly; synthetic preference generation scales but may propagate bias.
- KL-regularization towards the base model preserves knowledge at the expense of creativity.
Current Trends (2025)
- Automated red-teaming datasets fed into alignment loops for continuous hardening.
- Sparse finetuning (LoRA) cuts alignment compute 90 % relative to full-weight PPO.
- Open-sourced reward models (Anthropic HH-RM v2) enable community audits.
Implementation Tips
- Freeze embedding layers during alignment to prevent vocabulary drift.
- Log KL divergence every minibatch; spikes often indicate reward-model overfitting.
- Evaluate on safety benchmarks (HEL-Safety, Toxicity v2) each training epoch.