Monosemanticity

When one internal pattern inside a model maps cleanly to one recognizable concept, the way a single key on a piano plays exactly one note.

Monosemanticity is the interpretability ideal: each internal pattern corresponds to exactly one human-recognizable concept, making the model's internals readable. Individual neurons rarely achieve this because they respond to multiple unrelated concepts at once (polysemanticity). Researchers use tools like sparse autoencoders to extract cleaner patterns from the model's internal activity, where each pattern tends to correspond to a single concept like "the Golden Gate Bridge" or "deceptive reasoning." Anthropic's Scaling Monosemanticity work extracted millions of these cleaner patterns from Claude, demonstrating that the approach works at production scale.

Builder example

Monosemantic features are the building blocks for future AI monitoring systems. If each feature reliably maps to one concept, product teams could build detectors that flag when "deception" or "hallucination" features activate during a response. The practical barrier today is validation: confirming that a labeled feature actually controls the behavior it appears to represent, across diverse inputs and contexts.

Common confusion: Finding a feature that researchers label "sycophancy" does not automatically mean you can suppress sycophancy by turning it off. The label describes what correlates with the pattern's activation. Whether manipulating it reliably changes behavior requires separate causal testing.