Command Palette

Search for a command to run...

Alignment Training

Benched.ai Editorial Team

Alignment training is the set of procedures that steer a model toward behaviors consistent with human values, instructions, and safety constraints. It is typically applied after base pre-training.

  Alignment Pipeline Phases

PhaseGoalTypical DatasetCommon Methods
Supervised Fine-Tuning (SFT)Teach following instructionsHuman-written input/output pairsCross-entropy loss
Preference CollectionGather comparisons between outputsA/B annotator choicesPairwise ranking
Reward Model FitLearn scalar reward from preferencesPreference pairsBinary cross-entropy
Policy OptimizationMaximize reward while staying close to SFT policyPPO, DPO, RLAIFTRL-X, DeepSpeed RL
Safety TuningReduce toxic or disallowed contentAdversarial promptsConstitutional AI, supervised rejection

  Comparison of Optimization Algorithms

AlgorithmSample EfficiencyTraining StabilityImplementation Complexity
PPOMediumGood with KL penaltyModerate
DPOHighRequires careful temperatureLow
Best-of-N samplingVery high (no grad)StableVery low

  Design Trade-offs

  • Reinforcement methods boost helpfulness but risk reward hacking if reward model narrow.
  • Human annotation is costly; synthetic preference generation scales but may propagate bias.
  • KL-regularization towards the base model preserves knowledge at the expense of creativity.

  Current Trends (2025)

  • Automated red-teaming datasets fed into alignment loops for continuous hardening.
  • Sparse finetuning (LoRA) cuts alignment compute 90 % relative to full-weight PPO.
  • Open-sourced reward models (Anthropic HH-RM v2) enable community audits.

  Implementation Tips

  1. Freeze embedding layers during alignment to prevent vocabulary drift.
  2. Log KL divergence every minibatch; spikes often indicate reward-model overfitting.
  3. Evaluate on safety benchmarks (HEL-Safety, Toxicity v2) each training epoch.