Inner alignment vs outer alignment

Two separate ways alignment can fail: you specified the wrong goal (outer alignment), or you specified the right goal and the model learned something different anyway (inner alignment).

Outer alignment asks whether the training objective captures what you actually care about. If you train a hiring model to maximize 'candidates who accept offers,' you might be measuring persuasiveness instead of candidate quality. Inner alignment asks whether the model truly learned the objective you specified, or learned a shortcut that happened to produce the same results during training. Even with a perfect training goal, the model might latch onto a correlated pattern that diverges in production. These are two distinct failure points, and fixing one does not fix the other.

Builder example

This maps directly to product work. Outer alignment: 'Are we measuring the right thing?' Inner alignment: 'Is the model actually doing what our measurements suggest?' A model can ace your evaluation suite by relying on patterns that will not hold in the real world. Both questions need separate answers, which means testing beyond your training distribution.

Common confusion: A perfect evaluation score does not prove the model learned the right goal. It proves the model found something that works on your evaluation data.