Skip to main content

Degenerative AI: The Risks of Training Systems on their own Data

Header Image

The New York Times recently published a thought-provoking interactive article examining the dangers of training generative AI models on their own output. The crux of the article is the risk that these models, which create text and images, might eventually be trained on their own generated content rather than diverse, human-curated data. This process of self-looping can lead to a disturbing convergence towards incoherence—where outputs become increasingly homogeneous and biased over time. 

The key insight here is that generative AI systems are based on probabilities. They're designed to predict and generate the most likely next word or image based on their training data. However, if their outputs start seeding future training datasets, the models will inevitably become more likely to generate the “most probable,” iteratively converging on narrower and narrower forms of expression. This is neither good for creativity nor for maintaining a realistic diversity of responses. 

Unless we actively curate training data, these systems might lose the richness and variability that their operations require. Of course, the process of data curation—picking and choosing diverse and representative training examples—goes against the overall philosophy of scale, which relies on vast, loosely filtered datasets. But without effective curation, we are likely to see an amplification of existing biases, along with a loss of diversity in language and imagery. 

One solution lies in the introduction of a more nuanced intelligence model that integrates human or other curatorial oversight. In other words, designing secondary intelligent systems capable of making informed decisions about which data should be used for training, ensuring the quality and diversity of inputs. 

This need for a more anthropocentric approach to AI training highlights the flaws in current methodologies. Where human oversight remains critical, the same principles should apply to intelligent systems. It’s time to consider the notion that part of what intelligence means involves making decisions about what to learn and how to learn it. 

If an AI system is learning to recognize hand-drawn digits, it might be fed a set of curated flashcards featuring well-chosen examples. But imagine a system whose job it is to select the best training examples for its own learning. This self-selective approach might truly transcend the limitations of current generative models. 

Currently, human curation of synthetic data yields better results. So why not integrate this level of discernment directly into machine models at the start? Right now, we often rely on a human-in-the-loop reinforcement learning model, marking generated text for appropriateness or relevance. But future systems could preemptively avoid generating substandard or objectionable content by never being exposed to it in the training phase. 

The issue extends beyond text and into image generation, with the growing possibility of overwhelming both mediums with AI-generated content. We’re accelerating towards a world where synthetic data could dominate our informational landscape, making the need for thoughtful curation even more urgent. 

While current AI systems depend heavily on sheer volume to improve and scale, human intelligence demonstrates that quality, not quantity, can lead to more effective learning processes. As such, it’s imperative to rethink AI training methodologies so that curatorial intelligence becomes an intrinsic part of AI development. 

The real challenge—and opportunity—lies in creating intelligent systems that not only produce high-quality content but are also discerning about the data they learn from. By fostering this level of autonomy and wisdom, we can better navigate the risks and rewards of AI, ensuring that these systems enrich our world rather than diminishing its diversity and depth. 

Kristian Hammond
Bill and Cathy Osborn Professor of Computer Science
Director of the Center for Advancing Safety of Machine Intelligence (CASMI)
Director of the Master of Science in Artificial Intelligence (MSAI) Program

Back to top