Command Palette

Search for a command to run...

Reinforcement Learning from Human Feedback

Benched.ai Editorial Team

Reinforcement Learning from Human Feedback (RLHF) is a training paradigm that aligns language models to human preferences by optimizing policies against a learned reward model instead of hard-coded objectives like next-token likelihood. The approach augments traditional pre-training with a loop of preference collection, reward modeling, and policy optimization, yielding models that are more helpful, harmless, and truthful.

  Definition and Scope

RLHF occupies the fine-tuning phase that follows large-scale unsupervised pre-training. It aims to correct behaviors—hallucinations, toxicity, refusal handling—that cannot be captured easily through static datasets. The standard loop consists of:

  1. Supervised Fine-Tuning (SFT) on curated instruction-response pairs.
  2. Reward Model (RM) Training using human preference rankings.
  3. Policy Optimization with RL algorithms (PPO, DPO) to maximize the RM score while staying close to the SFT policy.

  Canonical Pipeline

StepData SourceObjectiveCommon Tools
Instruction DatasetAnthropic Helpful-Harmless1Cross-entropyJAX/TPU, DeepSpeed ZeRO-3
Preference Pair CollectionCrowd LLM annotatorsPairwise rankingSurge AI, Scale Rapid
Reward Model FitPairs to scalarBinary cross-entropyPyTorch, PEFT LoRA
Policy OptimizationSFT policy + RMKL-regularized RLtrlX PPO, DPO2
EvaluationAdversarial MT-Bench, RED scoresWin rate, SafetyOpenAI Evals, Anthropic Bias Bench

  Reward Shaping Details

The reward combines the RM score and a KL penalty that anchors the policy to the SFT model, where a larger KL coefficient (\beta) keeps the aligned policy closer to the supervised baseline while a smaller (\beta) lets the optimizer pursue higher reward at the cost of potential divergence.

  Performance Metrics

MetricTypical Value (GPT-3.5-level)Notes
MT-Bench Win Rate65-75 % vs SFT-only3Higher indicates better task compliance
Toxicity (Perspective)<2 % T-score >0.5Lower is safer
Helpful-Harmless Score (HH)0.75-0.85Anthropic eval
KL Divergence0.5-1.0 natsToo high ⇒ policy drifts, overfits RM

  Design Trade-offs

  • Increasing preference data size offers diminishing returns beyond ~100 k pairs.
  • PPO exposes instability; DPO offers non-RL convex objective that is easier to tune.
  • Over-optimizing RM without fresh human oversight risks reward hacking.

  Current Trends (2025)

  • Wider adoption of Direct Preference Optimization (DPO) with 30 % compute savings vs PPO4.
  • Active Preference Sampling: uncertainty-based selection cuts annotation budget by 3×.
  • Multi-turn Alignment: hierarchical RMs model long conversations instead of single responses.
  • Constitutional RLHF: uses automatically generated critiques to bootstrap feedback, reducing human labor5.

  Implementation Tips

  1. Mix preference data across domains to avoid mode collapse.
  2. Periodically refresh the KL baseline to the latest aligned checkpoint.
  3. Evaluate with unseen adversarial prompts every epoch to detect overfitting.
  4. Store RM logits, not just labels, to enable off-policy correction later.

  References

  1. Bai et al., Training a Helpful and Harmless Assistant with RLHF, Anthropic 2022.

  2. Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023.

  3. Zheng et al., Judging LLMs by MT-Bench, 2023.

  4. Internal OpenAI alignment experiments, 2024.

  5. Bai et al., Constitutional AI: Harmlessness from AI Feedback, 2024.