Pretraining

The first and most expensive training stage, where a model reads massive amounts of text, code, and other data to learn general language patterns, factual knowledge, and reasoning abilities.

During pretraining, a model processes trillions of words and code tokens from books, websites, code repositories, and other sources. Its task is simple: predict what comes next. Through billions of these predictions, the model develops broad capabilities in language understanding, factual recall, logical reasoning, and code generation. Modern pretraining often incorporates images and other data types alongside text. This stage typically costs millions of dollars in computing resources and takes weeks or months to complete, so only well-funded labs can do it from scratch.

Builder example

Pretraining sets the ceiling for what a model can do. Later stages like instruction tuning and RLHF shape how the model behaves, but they cannot add knowledge or reasoning abilities that pretraining missed. If a model was pretrained on very little medical literature, fine-tuning it for medical tasks will have limited results. Choosing a model whose pretraining data covers your domain is more effective than trying to patch knowledge gaps afterward.

Common confusion: Pretraining creates the model's knowledge and reasoning foundation, but the helpful assistant personality you interact with comes from later post-training stages. A raw pretrained model would just autocomplete your text rather than answer your question.