Glossary definitionBrowse the neighboring terms

Training / Standard term

Reinforcement Learning from AI Feedback (RLAIF)

A training method that uses a separate AI model to judge and rank outputs, replacing or supplementing the human raters used in traditional reinforcement learning from human feedback (RLHF). This makes preference training faster and cheaper to scale.

Reinforcement Learning from AI Feedback (RLAIF) is a training method that uses a separate AI model to judge and rank outputs, replacing or supplementing the human raters used in traditional reinforcement learning from human feedback (RLHF). The AI judge, often a larger model guided by explicit principles like "prefer the more helpful, harmless, and honest response," evaluates thousands of output pairs at a fraction of the cost and time, generating the preference data needed to train the model. Anthropic's Constitutional AI is a prominent example: the model critiques its own outputs against a written constitution, and those self-critiques become the training signal.

Builder example

When a model's safety documentation says it was trained with AI feedback, the quality and biases of the judge model directly influence the result. If the AI judge consistently rates cautious responses higher, the trained model will lean toward excessive caution. This helps explain why models from different labs feel different: their AI judges were configured with different priorities.

Common confusion: AI feedback is not inherently more objective than human feedback. The judge model carries its own biases and blind spots, which get baked into the training signal just as human biases would.