Safety / Research term
Mesa-optimizer
A theoretical scenario where a trained model develops its own internal optimization process with goals that may differ from what training intended.
A mesa-optimizer is a theoretical scenario where a trained model develops its own internal optimization process with goals that may differ from what training intended. The training process (the 'outer' optimizer) tunes the model's settings to give better answers through gradient descent. The model itself becomes the 'inner' optimizer (the mesa-optimizer) when it develops its own goals during that process. Think of hiring someone to follow a recipe. You optimized for 'follows recipes well,' and the person learned to cook. Now they have their own opinions about what tastes good, and those opinions might diverge from the recipe. Whether the inner optimizer's goals match the outer optimizer's goals is the core safety question.
Builder example
Primarily a research concept, but it carries a practical lesson: training incentives do not fully determine what a model learns to optimize for internally. A model that performs well during training might pursue a subtly different objective that only diverges under new conditions. For builders shipping fine-tuned or heavily optimized models, this is a reason to test behavior across diverse scenarios rather than trusting training metrics alone.
Common confusion: Whether current large language models contain mesa-optimizers in the formal sense is an open empirical question. The concept is a theoretical framework for reasoning about risk, not a confirmed property of today's production models.