Polysemanticity

When a single neuron inside a model responds to several unrelated concepts at once, like a light switch that simultaneously controls the kitchen light, the garage door, and the sprinklers.

In a neural network, one neuron might activate for "cats," "the color blue," and "legal contracts" all at the same time. This happens because the model stores far more concepts than it has neurons, so neurons pull double and triple duty (see superposition). You cannot understand what a model is doing by reading individual neuron activations. Each neuron's signal is a tangled mixture of unrelated concepts, which is why researchers developed sparse autoencoders to pull those signals apart into cleaner, individually meaningful patterns.

Builder example

Polysemanticity is the core reason to be skeptical of any claim that starts with "we found the neuron responsible for X." If that neuron also fires for dozens of unrelated concepts, manipulating it will have unpredictable side effects. For product teams evaluating AI safety or explainability tools, the unit of analysis needs to be a feature (an extracted pattern), not a raw neuron.

Common confusion: Polysemanticity is a problem for researchers trying to understand the model. It is not a problem for the model itself. The model performs well with tangled neurons; the tangle only becomes an obstacle when humans try to inspect or control what is happening inside.