Feature (interpretability)

A recognizable concept encoded as a pattern of activity across many neurons inside a model, like "Golden Gate Bridge," "code bug," "sycophantic tone," or "the French language."

Models organize their knowledge into internal patterns that often correspond to concepts a human would recognize. Researchers call these patterns features. One feature might activate whenever the model processes text about a specific landmark; another might activate for deceptive reasoning. Features rarely live in a single neuron. They spread across many neurons in overlapping ways, so extracting them requires specialized tools like sparse autoencoders that tease apart tangled signals into individually meaningful patterns.

Builder example

Features are the most promising unit for future AI safety and debugging tools. Reliable feature monitoring during a response could power real-time detectors for hallucination, manipulation, or policy violations at a level deeper than output-scanning. Product teams will likely interact with features through monitoring dashboards and safety APIs rather than manipulating them directly.

Common confusion: In interpretability research, "feature" means an internal pattern the model uses. This is different from how product teams use the word (a capability or function). An interpretability feature is a direction in the model's internal representation space that corresponds to a recognizable concept.