Distillation

Training a smaller, cheaper model to reproduce the behavior of a larger, more capable one, so you get similar quality at a fraction of the cost.

A large "teacher" model generates high-quality outputs on thousands of examples. A smaller "student" model then trains on those outputs, learning to approximate the teacher's reasoning and style. A 70-billion-parameter model might produce excellent customer support responses, and a 7-billion-parameter student could learn to handle 90% of those cases at one-tenth the running cost. The tradeoff: distilled models tend to lose performance on rare or unusually complex inputs where the teacher's extra capacity mattered most.

Builder example

Distillation is why small models sometimes punch far above their weight class. When you see a compact model handling tasks that seem too sophisticated for its size, distillation from a stronger teacher is usually the explanation. For production cost planning, you can often deploy a distilled model for routine traffic and reserve the full-size model for edge cases.

A frontier model labels thousands of support tickets, then a smaller model learns to reproduce the labels cheaply.

Audit the labels, sample mistakes, and keep the stronger model for ambiguous cases.

Common confusion: Distilled models inherit the teacher's blind spots along with its strengths. If the teacher consistently gets a certain type of question wrong, the student will too.