Activation steering / activation engineering

A technique that changes how a model behaves by adding a small directional signal to its internal computations while it runs, like adjusting the tint on a camera lens mid-shot.

The usual ways to change model behavior are rewriting the prompt or retraining. Activation steering adds a third option: while the model processes your input, researchers inject a vector (a mathematical direction) into the model's internal signals. This vector pushes the model toward or away from a trait, such as "be more cautious" or "be less formal." The model's weights stay untouched and the prompt stays the same; only the internal signal path changes. Researchers build these steering vectors by comparing the model's internal states on examples with and without the target trait, then applying the difference during processing.

Builder example

Activation steering could eventually give product teams fine-grained behavioral dials (honesty, caution, verbosity) that work at the model layer without prompt hacks or costly retraining. For now, it remains a research tool with unpredictable side effects: turning up "caution" might also turn down "helpfulness" in hard-to-predict ways.

Common confusion: Activation steering modifies the model's live computation, so it is fundamentally different from prompt engineering. Changing the prompt changes what the model reads. Activation steering changes how the model processes what it reads.