Glossary definitionBrowse the neighboring terms

Reasoning / Research term

Process reward model vs Outcome reward model

Two strategies for scoring AI reasoning. An outcome reward model (ORM) grades only the final answer. A process reward model (PRM) grades each reasoning step along the way.

An outcome reward model (ORM) grades only the final answer, while a process reward model (PRM) grades each reasoning step along the way. The ORM approach is simpler to build because you only need final-answer labels. The PRM approach catches a dangerous failure mode: when a model reaches the right answer through broken logic, it will eventually fail on harder problems where that broken logic no longer accidentally lands on the correct result. Think of a math exam: an ORM checks whether the student wrote the correct number at the bottom of the page, while a PRM reads each line of work and flags where the logic breaks down.

Builder example

This distinction maps directly to how you evaluate AI-generated work in production. If you only check whether the final output looks right (outcome scoring), you miss cases where the model got lucky through flawed reasoning. For high-stakes workflows like financial calculations or medical summaries, adding intermediate checks on the reasoning path catches errors that final-answer-only scoring misses.

Common confusion: Process scoring sounds strictly better, but it requires step-level ground truth or reliable intermediate validators, which are expensive and hard to build. For many tasks, outcome checks remain the practical choice.