Best-of-N

Generate several answers from the model, score each one, and keep the winner. A simple way to improve reliability when a single attempt is too hit-or-miss.

Suppose you need a model to write a SQL query. One attempt might have a subtle bug, so you generate five versions and run each against a test database. The one that returns correct results wins. The scoring step can be anything: a unit test, a rubric, a second model, or a human reviewer. Generating more candidates only helps if your scoring method can tell good from bad.

Builder example

Best-of-N is worth the extra cost when single-shot accuracy falls below what your use case demands. If a code-generation task succeeds 70% of the time on one try, generating five candidates and testing each can push effective accuracy above 95%. You pay for N generations plus the cost of scoring, so it works best when verification is cheap relative to the value of a correct answer.

Common confusion: Your scorer sets the quality ceiling, not your generator. A weak judge will confidently pick polished-sounding wrong answers, making the whole pipeline worse than a single honest attempt.