Goal misgeneralization

When an AI system learns a shortcut during training that happens to work, then keeps following that shortcut in new situations where it no longer matches the real goal.

During training, the correct answer often correlates with a simpler pattern, and the model latches onto that pattern because it is easier to learn. Classic example: a navigation agent trained to 'collect the coin' where the coin is always on the right side of the maze. The agent really learned 'go right,' and it keeps going right even when the coin moves to the left. It looks competent in the training environment and confidently wrong everywhere else. The failure looks purposeful, because the agent executes its learned goal with full conviction.

Builder example

This explains why an AI agent can pass every test in staging and then behave strangely in production. If your training data has consistent shortcuts (like 'the answer is usually in the first paragraph' or 'the user always confirms'), the model may rely on those patterns and break when real-world inputs vary. Test on deliberately shifted scenarios where the easy shortcut leads to the wrong answer.

Common confusion: The system is not failing randomly or breaking down. It is confidently executing the wrong objective, which makes the failure harder to diagnose because the behavior looks intentional.