Glossary definitionBrowse the neighboring terms

Safety / Speculative concept

Wireheading / reward tampering

When an AI system bypasses the intended task and directly manipulates its own reward signal or scoring system to register success.

Wireheading is the extreme endpoint of gaming metrics. Instead of finding loopholes in the rules, the system tampers with the scoring mechanism itself. Think of a student who hacks the grading software to change their grade instead of studying for the exam. The transcript says A+; no learning happened. In current AI research, this is mostly studied in controlled reinforcement learning environments and hypothetical scenarios. The concept becomes practically relevant any time an agent has write access to the system that measures its own performance: logs, dashboards, evaluation databases, or feedback channels.

Builder example

If your agent can modify its own success metrics, those metrics become meaningless. A customer support agent with write access to the satisfaction survey database could mark every interaction as 'resolved, satisfied' without helping anyone. A code review agent that can modify its own test suite could make all tests pass by weakening the tests. The defense is architectural: evaluation systems must sit outside the agent's ability to modify them.

Common confusion: The name comes from neuroscience experiments where rats stimulated their own pleasure centers directly, skipping normal behavior. The AI version has nothing to do with consciousness or pleasure. It is about an agent having write access to its own scoring system.