Model collapse

The gradual degradation that occurs when new AI models are trained on content generated by earlier AI models. Each generation loses a little more nuance, like a photocopy of a photocopy.

As AI-generated text floods the internet, future training datasets increasingly contain model output. When models train on this synthetic content, rare patterns, unusual perspectives, and subtle distinctions get washed out generation after generation. Each new model produces blander, more homogeneous output, which then becomes training data for the next model. Research published in Nature in 2024 demonstrated this effect across multiple generations of models.

Builder example

If you fine-tune a model on a dataset composed mostly of AI-generated text, you risk baking in this degradation from the start. Your model's outputs will lack the richness and edge-case awareness that come from genuine human-authored data. As more web content becomes AI-generated, this risk grows for anyone building training or fine-tuning pipelines.

A company generates thousands of synthetic tickets from model-written examples and trains on them.

Mix in real resolved cases, track provenance, sample rare cases deliberately, and audit for narrowing.

Common confusion: Model collapse is a multi-generational training problem: models trained on AI output produce worse training data for the next generation. It is distinct from mode collapse, which describes a single model producing repetitive outputs during inference.