Glossary definitionBrowse the neighboring terms

Training / Standard term

Direct Preference Optimization (DPO)

A simpler alternative to reinforcement learning from human feedback (RLHF) that trains a model's preferences by comparing pairs of good and bad answers directly, skipping the need to build a separate reward model.

Direct Preference Optimization (DPO) is a simpler alternative to reinforcement learning from human feedback (RLHF) that trains a model's preferences by comparing pairs of good and bad answers directly, skipping the need to build a separate reward model. You give the model pairs of responses where one is preferred and one is rejected, and it learns directly from those comparisons. Given a user question, the model sees that "Response A" was rated better than "Response B" and adjusts to favor A-like outputs. Traditional RLHF requires multiple stages: collecting preferences, training a reward model, then running reinforcement learning. DPO collapses this into a single training step, often with comparable quality and significantly less engineering overhead.

Builder example

DPO appears frequently in open-model fine-tuning pipelines and model release notes. If you are tuning an open-weights model for your domain, DPO is likely the preference method your tooling supports. The quality of your preference pairs, meaning which responses you label as good or bad, determines the outcome more than the algorithm itself.

You have examples of answers your users prefer and answers they reject.

Direct Preference Optimization (DPO) can tune the model toward that preference pattern, while factual checks remain separate.

Common confusion: The simpler training process does not eliminate value judgments. Someone still has to decide which response in each pair is the preferred one, and those choices shape model behavior just as much as reinforcement learning from human feedback (RLHF) choices do.