Synthetic data

Training or evaluation data generated by an AI model rather than collected from humans or the real world, letting you create large datasets quickly when quality control is in place.

Collecting thousands of training examples from humans is slow and expensive. Synthetic data solves this by having a capable model generate the examples: write questions and answers, create coding problems with solutions, or produce reasoning chains. You might use a frontier model to generate 50,000 customer support conversations to fine-tune a smaller model. The risk: AI-generated data tends to cluster around common patterns and repeat the generating model's biases, missing rare edge cases that real-world data would capture. When synthetic data is generated and reused without filtering, the model can narrow over time, a problem researchers call model collapse.

Builder example

If you are fine-tuning a model, synthetic data can dramatically reduce the cost and time of building your training set. The critical step is quality control: generated examples need filtering through automated tests, human spot-checks, or external validators to catch errors before they get baked into the model. Mixing synthetic examples with real-world data helps preserve diversity and edge-case coverage.

Common confusion: "Synthetic" means "generated," and the quality ranges from excellent to harmful depending on the generation and filtering pipeline. High-quality synthetic data with good verification can be as effective as human-collected data for many tasks.