Reinforcement Learning from AI Feedback (RLAIF)

A training method that uses a separate AI model to judge and rank outputs, replacing or supplementing the human raters used in traditional RLHF. This makes preference training faster and cheaper to scale.

In standard RLHF, human raters compare pairs of model outputs and choose which is better. Reinforcement Learning from AI Feedback, or RLAIF, replaces some or all of those human judgments with an AI judge, often a larger model guided by explicit principles like "prefer the more helpful, harmless, and honest response." The AI judge evaluates thousands of output pairs at a fraction of the cost and time, generating the preference data needed to train the model. Anthropic's Constitutional AI is a prominent example: the model critiques its own outputs against a written constitution, and those self-critiques become the training signal.

Builder example

When a model's safety documentation says it was trained with AI feedback, the quality and biases of the judge model directly influence the result. If the AI judge consistently rates cautious responses higher, the trained model will lean toward excessive caution. This helps explain why models from different labs feel different: their AI judges were configured with different priorities.

Common confusion: AI feedback is not inherently more objective than human feedback. The judge model carries its own biases and blind spots, which get baked into the training signal just as human biases would.