Reinforcement Learning with Verifiable Rewards (RLVR)

A training method that rewards models based on objectively checkable outcomes, like whether a math answer is correct or whether code passes its test suite, removing the subjectivity of human or AI judges from the training signal.

In standard RLHF, a human rater decides which response is better, introducing subjectivity. Reinforcement Learning with Verifiable Rewards, or RLVR, replaces that judgment with an automated check: run the code and see if the tests pass, verify the math against the known answer, or check whether a proof holds. Because the reward signal is unambiguous, the model gets clearer feedback about what "correct" means. DeepSeek-R1 and similar reasoning models used this approach extensively, which is a key reason they perform so well on math, coding, and formal logic tasks. The limitation: it works best in domains where correctness can be checked automatically.

Builder example

RLVR explains why modern reasoning models excel at coding and math: those domains have cheap, reliable verification. Whenever your task has a clear right answer you can check programmatically, models trained with RLVR will tend to perform well. Tasks involving judgment, strategy, or nuance, where "correct" is ambiguous, are harder to improve with this approach.

Common confusion: RLVR does not make a model good at everything. It specifically strengthens performance in domains with verifiable answers and has less impact on open-ended tasks like writing advice or strategic recommendations.