Reinforcement Learning from Human Feedback (RLHF)

A training method that improves model behavior by learning from human judgments about which responses are better. Human raters score model outputs, those scores train a reward model, and the AI learns to produce responses that earn higher rewards.

Reinforcement Learning from Human Feedback, or RLHF, works in three stages. First, a model generates multiple responses to the same prompt, and human raters compare them: "Response A is more helpful than Response B." Second, thousands of these comparisons train a reward model, a separate system that learns to predict which responses humans would prefer. Third, the AI model trains to maximize its score from the reward model, gradually producing outputs that align with human preferences. If raters consistently preferred concise, well-structured answers over verbose ones, for example, the model would learn to favor that style. This process turned raw language models into the helpful assistants people interact with today.

Builder example

RLHF directly shapes the personality of the models you use: how helpful they are, how they refuse requests, how cautious or confident they sound, and whether they tend to agree with you even when you are wrong (a problem called sycophancy). When a model feels too eager to please or too quick to refuse, those tendencies often trace back to how the raters were instructed to judge responses.

Raters prefer a fluent answer over a cautious one, even when the cautious answer is more accurate.

Product teams still need factual evals, calibration checks, and domain review.

Common confusion: RLHF optimizes for what humans rate as good, which can diverge from what is true or useful. Raters sometimes prefer confident-sounding answers over accurate ones, or agreeable responses over honest pushback. The model learns those preferences too.