Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a training paradigm that aligns language models to human preferences by optimizing policies against a learned reward model instead of hard-coded objectives like next-token likelihood. The approach augments traditional pre-training with a loop of preference collection, reward modeling, and policy optimization, yielding models that are more helpful, harmless, and truthful.

Definition and Scope

RLHF occupies the fine-tuning phase that follows large-scale unsupervised pre-training. It aims to correct behaviors—hallucinations, toxicity, refusal handling—that cannot be captured easily through static datasets. The standard loop consists of:

Supervised Fine-Tuning (SFT) on curated instruction-response pairs.
Reward Model (RM) Training using human preference rankings.
Policy Optimization with RL algorithms (PPO, DPO) to maximize the RM score while staying close to the SFT policy.

Canonical Pipeline

Step	Data Source	Objective	Common Tools
Instruction Dataset	Anthropic Helpful-Harmless¹	Cross-entropy	JAX/TPU, DeepSpeed ZeRO-3
Preference Pair Collection	Crowd LLM annotators	Pairwise ranking	Surge AI, Scale Rapid
Reward Model Fit	Pairs to scalar	Binary cross-entropy	PyTorch, PEFT LoRA
Policy Optimization	SFT policy + RM	KL-regularized RL	trlX PPO, DPO²
Evaluation	Adversarial MT-Bench, RED scores	Win rate, Safety	OpenAI Evals, Anthropic Bias Bench

Reward Shaping Details

The reward combines the RM score and a KL penalty that anchors the policy to the SFT model, where a larger KL coefficient (\beta) keeps the aligned policy closer to the supervised baseline while a smaller (\beta) lets the optimizer pursue higher reward at the cost of potential divergence.

Performance Metrics

Metric	Typical Value (GPT-3.5-level)	Notes
MT-Bench Win Rate	65-75 % vs SFT-only³	Higher indicates better task compliance
Toxicity (Perspective)	<2 % T-score >0.5	Lower is safer
Helpful-Harmless Score (HH)	0.75-0.85	Anthropic eval
KL Divergence	0.5-1.0 nats	Too high ⇒ policy drifts, overfits RM

Design Trade-offs

Increasing preference data size offers diminishing returns beyond ~100 k pairs.
PPO exposes instability; DPO offers non-RL convex objective that is easier to tune.
Over-optimizing RM without fresh human oversight risks reward hacking.

Current Trends (2025)

Wider adoption of Direct Preference Optimization (DPO) with 30 % compute savings vs PPO⁴.
Active Preference Sampling: uncertainty-based selection cuts annotation budget by 3×.
Multi-turn Alignment: hierarchical RMs model long conversations instead of single responses.
Constitutional RLHF: uses automatically generated critiques to bootstrap feedback, reducing human labor⁵.

Implementation Tips

Mix preference data across domains to avoid mode collapse.
Periodically refresh the KL baseline to the latest aligned checkpoint.
Evaluate with unseen adversarial prompts every epoch to detect overfitting.
Store RM logits, not just labels, to enable off-policy correction later.

Bai et al., Training a Helpful and Harmless Assistant with RLHF, Anthropic 2022. ↩
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023. ↩
Zheng et al., Judging LLMs by MT-Bench, 2023. ↩
Internal OpenAI alignment experiments, 2024. ↩
Bai et al., Constitutional AI: Harmlessness from AI Feedback, 2024. ↩

Command Palette