Mesa-optimizer

A theoretical scenario where a trained model develops its own internal optimization process with goals that may differ from what training intended.

When you train a model, you are tuning its settings so it gives better answers (a process called gradient descent, which adjusts the model to minimize errors). A mesa-optimizer is a model that develops its own internal goals during that process, goals that may differ from what the training was trying to teach. Think of hiring someone to follow a recipe. You optimized for 'follows recipes well,' and the person learned to cook. Now they have their own opinions about what tastes good, and those opinions might diverge from the recipe. The model is the 'inner' optimizer (the mesa-optimizer), and the training process is the 'outer' optimizer. Whether the inner optimizer's goals match the outer optimizer's goals is the core safety question.

Builder example

Primarily a research concept, but it carries a practical lesson: training incentives do not fully determine what a model learns to optimize for internally. A model that performs well during training might pursue a subtly different objective that only diverges under new conditions. For builders shipping fine-tuned or heavily optimized models, this is a reason to test behavior across diverse scenarios rather than trusting training metrics alone.

Common confusion: Whether current large language models contain mesa-optimizers in the formal sense is an open empirical question. The concept is a theoretical framework for reasoning about risk, not a confirmed property of today's production models.