Mechanistic interpretability

The research field that reverse-engineers what happens inside a neural network to explain why it produces a given answer, the way a biologist dissects an organism to understand how it works.

Most AI evaluation treats the model as a black box: send input, check the output, measure accuracy. Mechanistic interpretability goes deeper. Researchers trace the model's internal processing to find recognizable patterns (features), computational pathways (circuits), and internal procedures (algorithms) that explain a specific output. The goal is to reach a level of understanding where you can predict behavior from the model's internals, the same way understanding a circuit board lets you predict what a device will do before you turn it on.

Builder example

This research is the foundation for the next generation of AI safety and quality tools. Today, product teams rely on output-level testing (evals) to catch problems. Mechanistic interpretability could eventually provide internal monitoring, detecting that a model is about to hallucinate, refuse, or reason deceptively before those behaviors reach the output. Teams that follow this research direction will be better positioned to adopt those tools as they mature.

Researchers identify a feature associated with a behavior such as sycophancy or refusal.

That internal signal can become one input into monitoring, though it still needs external validation.

Common confusion: Mechanistic interpretability is completely different from asking a model to explain its own reasoning. When a model writes "I chose this answer because...," that explanation is generated text, which may be wrong or fabricated. Mechanistic interpretability examines the actual computations, independent of what the model claims about itself.