Process reward model vs Outcome reward model

Two strategies for scoring AI reasoning. An outcome reward model (ORM) grades only the final answer. A process reward model (PRM) grades each reasoning step along the way.

Think of a math exam. An ORM checks whether the student wrote the correct number at the bottom of the page. A PRM reads each line of work and flags where the logic breaks down. The ORM approach is simpler to build because you only need final-answer labels. The PRM approach catches a dangerous failure mode: when a model reaches the right answer through broken logic, it will eventually fail on harder problems where that broken logic no longer accidentally lands on the correct result.

Builder example

This distinction maps directly to how you evaluate AI-generated work in production. If you only check whether the final output looks right (outcome scoring), you miss cases where the model got lucky through flawed reasoning. For high-stakes workflows like financial calculations or medical summaries, adding intermediate checks on the reasoning path catches errors that final-answer-only scoring misses.

Common confusion: Process scoring sounds strictly better, but it requires step-level ground truth or reliable intermediate validators, which are expensive and hard to build. For many tasks, outcome checks remain the practical choice.