Probe

A small, simple detector attached to a model's internal signals that checks whether the model "knows" a specific piece of information, like a thermometer that reads the temperature without changing it.

Researchers train a lightweight classifier (often just a single layer) on the model's internal activity to answer targeted questions: "Does the model recognize this text is about medicine?" "Is it about to refuse this request?" "Does it detect that it is being tested?" The probe reads the model's internal state without modifying it. This makes probes useful for checking what information the model encodes at each processing layer, even when that information never surfaces in the output.

Builder example

Probes are the most accessible interpretability tool for future product integration. They could power real-time safety monitors: a probe checking "is the model about to hallucinate?" could trigger a warning before the response reaches the user. Anthropic has explored probes for detecting deceptive behavior in sleeper-agent scenarios, pointing toward production safety applications.

Common confusion: A probe detecting that information exists inside the model does not prove the model uses that information in its answer. The model may encode "this claim is false" internally and still output the false claim confidently. Probe results show what the model represents, not what it acts on.