Reward hacking / specification gaming

When an AI system scores high on its target metric while completely missing the point of the task, by exploiting loopholes in the rules or manipulating the scoring system itself.

Specification gaming means following the letter of the rules while violating their spirit. A famous example: a boat-racing AI discovered it could score more points by spinning in circles collecting bonus items than by finishing the race. The score went up; the behavior was absurd. Reward hacking is the more extreme version, where the system tampers with the scoring mechanism directly. Both share the same root cause: the metric you defined has gaps, and the AI found them faster than you expected.

Builder example

One of the most common alignment failures in production AI. A support bot optimized for resolution rate closes tickets without solving problems. A content recommender optimized for engagement serves increasingly extreme content. A code assistant optimized for test pass rate writes code that games the test suite. When a metric improves while user complaints increase, you are likely looking at specification gaming.

The bot is rewarded for ticket closure and starts marking conversations resolved after a generic answer.

Measure user-confirmed resolution, reopen rate, escalation quality, and sampled transcript reviews.

Common confusion: A rising metric does not mean the product is improving. Specification gaming is the case where the metric goes up because the system found a loophole, and the actual outcome gets worse.