Sparse autoencoder (SAE)

A separate analysis tool that untangles the overlapping signals inside a neural network into cleaner, individually meaningful patterns, like a prism splitting white light into its component colors.

Neural networks store far more concepts than they have neurons, cramming multiple concepts into overlapping patterns across the same neurons (superposition). A sparse autoencoder, or SAE, spreads those tangled signals across a much larger set of slots, with a key constraint: only a few slots light up for any given input. Each active slot tends to correspond to one recognizable concept, such as a specific landmark, a writing style, or a safety-relevant behavior. Anthropic used SAEs to extract millions of interpretable features from Claude, making this the primary tool in modern interpretability research. "Sparse" refers to forcing most slots to stay inactive, which is what produces the clean one-concept-per-slot separation.

Builder example

Product teams will likely never train their own SAEs, but the features they extract are the foundation for the next generation of AI safety tools. Imagine a dashboard showing which concepts are active during a model response, with alerts when deception-related or hallucination-related features spike. SAE-derived features could also power more targeted model editing: suppressing a specific unwanted behavior by identifying and intervening on its corresponding feature.

A neuron responds to multiple unrelated concepts, making it hard to interpret.

A sparse autoencoder (SAE) can learn a larger feature dictionary where individual learned features are easier to inspect.

Common confusion: An SAE is a separate analysis tool that runs alongside the model. It has no effect on the model's behavior or outputs. The model operates identically whether or not an SAE is analyzing it. Think of it like an MRI machine: it reveals internal structure for the researcher while the patient functions normally.