Eval-maxxing

Tuning a model, prompt, or product so aggressively for benchmark scores that the score stops reflecting real-world performance. The metric goes up while actual usefulness stalls or drops.

A team rewrites their prompt dozens of times to climb a public leaderboard, then discovers the optimized version handles real customer questions worse than the original. The tuning overfitted to the test while drifting away from the capability users actually need. This is Goodhart's Law ("when a measure becomes a target, it ceases to be a good measure") applied to AI evaluations. Benchmark contamination, where test questions leak into training data, makes the problem even harder to spot.

Builder example

Public benchmarks are useful for narrowing a long list of models to a short list. After that, the only scores that matter come from tasks drawn from your own product and your own users. Teams that skip this step often ship a model that performs brilliantly on paper and confuses real customers.

A model scores well on general reasoning benchmarks, then gives onboarding guidance that does not match the way your actual customers get set up.

Build a small private test set from your real support tickets, setup blockers, and explanation standards.

Common confusion: A high benchmark score can be both legitimate and irrelevant to your workflow. The score measures what the test measures, and that may have little overlap with what your users actually ask the model to do.