Benchmark contamination

When test questions leak into a model's training data, inflating its scores. The model scores well by recognizing familiar questions, the same way a student aces a test they already have the answers to.

Public benchmarks become unreliable when their questions appear in training datasets. A model trained on leaked MMLU questions can pattern-match to memorized answers without genuine understanding. The score looks impressive on a leaderboard, but real-world performance on novel problems may be substantially worse. Because most popular benchmarks have been scraped into training data at some point, contamination is widespread.

Builder example

Choosing a model based on contaminated leaderboard scores leads to disappointment in production. The model that tops a public benchmark may underperform on your actual tasks, while a lower-ranked model with genuine capability handles them well. You discover this mismatch after integration, when it costs real time and money to fix.

Common confusion: A contaminated benchmark still has historical value for tracking trends over time. It just cannot reliably predict how a model will perform on new, unseen tasks.