Constitutional AI

A training method where a model improves its own outputs by following a written set of principles, reducing the need for human raters to judge every response.

Anthropic developed this approach by giving a model a list of explicit behavioral rules called a "constitution." The model generates a response, reviews its own answer against those rules, rewrites it to comply, and uses the improved version as training data. A principle might say "Choose the response that is least likely to encourage illegal activity," and the model would learn to self-correct toward that standard. This replaces much of the expensive human rating step used in traditional RLHF.

Builder example

The broader lesson applies to anyone building AI products: writing explicit behavioral principles makes your system's values inspectable and debuggable. When you can point to a document that says "prefer concise answers over verbose ones," you can trace why the model behaves a certain way and adjust the principle directly.

Common confusion: A constitution does not guarantee ethical behavior. The principles are only as good as the people who wrote them, and the training process can still produce unexpected gaps.