Reinforcement Learning from Human Feedback (RLHF) is a training paradigm that aligns language models to human preferences by optimizing policies against a learned reward model instead of hard-coded objectives like next-token likelihood. The approach augments traditional pre-training with a loop of preference collection, reward modeling, and policy optimization, yielding models that are more helpful, harmless, and truthful.
Definition and Scope
RLHF occupies the fine-tuning phase that follows large-scale unsupervised pre-training. It aims to correct behaviors—hallucinations, toxicity, refusal handling—that cannot be captured easily through static datasets. The standard loop consists of:
- Supervised Fine-Tuning (SFT) on curated instruction-response pairs.
- Reward Model (RM) Training using human preference rankings.
- Policy Optimization with RL algorithms (PPO, DPO) to maximize the RM score while staying close to the SFT policy.
Canonical Pipeline
Reward Shaping Details
The reward combines the RM score and a KL penalty that anchors the policy to the SFT model, where a larger KL coefficient (\beta) keeps the aligned policy closer to the supervised baseline while a smaller (\beta) lets the optimizer pursue higher reward at the cost of potential divergence.
Performance Metrics
Design Trade-offs
- Increasing preference data size offers diminishing returns beyond ~100 k pairs.
- PPO exposes instability; DPO offers non-RL convex objective that is easier to tune.
- Over-optimizing RM without fresh human oversight risks reward hacking.
Current Trends (2025)
- Wider adoption of Direct Preference Optimization (DPO) with 30 % compute savings vs PPO4.
- Active Preference Sampling: uncertainty-based selection cuts annotation budget by 3×.
- Multi-turn Alignment: hierarchical RMs model long conversations instead of single responses.
- Constitutional RLHF: uses automatically generated critiques to bootstrap feedback, reducing human labor5.
Implementation Tips
- Mix preference data across domains to avoid mode collapse.
- Periodically refresh the KL baseline to the latest aligned checkpoint.
- Evaluate with unseen adversarial prompts every epoch to detect overfitting.
- Store RM logits, not just labels, to enable off-policy correction later.
References
-
Bai et al., Training a Helpful and Harmless Assistant with RLHF, Anthropic 2022. ↩
-
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023. ↩
-
Zheng et al., Judging LLMs by MT-Bench, 2023. ↩
-
Internal OpenAI alignment experiments, 2024. ↩
-
Bai et al., Constitutional AI: Harmlessness from AI Feedback, 2024. ↩