Test-time compute

Extra computation the model performs while generating an answer, as opposed to during training. The model "thinks harder" at the moment you ask it something, trading speed and cost for better accuracy on difficult problems.

Training a model is like studying for an exam. Test-time compute is like taking extra time during the exam to double-check your answers. When a problem is hard, the model can try multiple approaches, verify intermediate steps, or explore different reasoning paths before settling on a final answer. This is the principle behind reasoning models that visibly "think longer" on complex questions. The more compute you allocate at inference time, the better the model performs on tasks that benefit from deliberation.

Builder example

Test-time compute gives you an adjustable quality dial. For a customer support chatbot handling simple FAQ questions, extra thinking time wastes money with no accuracy gain. For an agent writing and debugging multi-file code changes, a higher reasoning budget can mean the difference between a working solution and an expensive failure. Match the compute budget to the task difficulty.

Common confusion: More thinking time does not guarantee better answers. If the model is working from bad context, flawed instructions, or a poor scoring signal, extra compute just means it arrives at the wrong answer more slowly and expensively.