What happens when AI learns from AI? Researchers warn of impending chaos with ‘model collapse’

Researchers have described a degenerative process called “model collapse” where the model becomes detached from reality and corrupted by its own output.

artificial intelligence generic featured 1

Widespread AI usage will likely eventually lead to AI-generated content being fed to large language models as training data. (Image: geralt/Pixabay)

Listen to this article

00:00

1x 1.5x 1.8x

As more and more content creation workflows rely on ChatGPT and other tools to boost efficiency, we can expect a significant portion of the web content to be AI-generated eventually. Unfortunately, this could pose a serious risk for the future of large language models, which currently depend on human-produced data scraped from the web.

A group of researchers from Cambridge and Oxford universities, the University of Toronto, and Imperial College in London with a research paper caution about a scenario where LLMs end up using AI-generated data as part of their training data. Titled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” the paper describes a degenerative process called “model collapse” where the model becomes detached from reality and corrupted by its own output.

(Image credits: “The Curse of Recursion: Training on Generated Data Makes Models Forget” research paper)

This can become a possibility with the growing use of artificial intelligence tools. Widespread AI usage will likely eventually lead to AI-generated content being fed to large language models as training data, leading to inaccuracies and distortions in their output.

Story continues below this ad

This problem was observed in large language models, Variational Autoencoders, and Gaussian Mixture Models, which over time begin to “forget the true underlying data distribution,” leading to an inaccurate depiction of reality because the data they’re trained on becomes so polluted that it loses any semblance to real-world data.

Also read | Sam Altman’s future plans for ChatGPT, OpenAI’s GPU crisis revealed in leaked blog post

Given the serious risk of model collapse, the researchers highlight the importance of access to the original distribution data, which generally is human-produced. After all, AI language models are designed to interact with humans, and therefore, need to be in touch with reality to properly simulate our world.

To solve this problem, the researchers have proposed several smarter approaches to training large language models. One approach is the “first-mover advantage,” which stresses that access to the original human-generated data source is preserved.

However, since differentiating AI-generated data from human-produced data is difficult, the research paper articulates that “community-wide coordination” is essential to ensure that different parties involved in LLM creation and deployment share the information needed to determine the source of data.

Story continues below this ad

“Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that was crawled from the Internet prior to the mass adoption of the technology, or direct access to data generated by humans at scale,” adds the paper.

But amid the rising use of generative AI and concerns around the technology taking over jobs, there is a silver lining for human creators. The research paper theorises that as the internet gets populated with AI-generated data, human-created content will grow increasingly valuable – even if it’s only as a source for uncontaminated data to train large language models.

Zohaib Ahmed

Zohaib is a tech enthusiast and a journalist who covers the latest trends and innovations at The Indian Express's Tech Desk. A graduate in Computer Applications, he firmly believes that technology exists to serve us and not the other way around. He is fascinated by artificial intelligence and all kinds of gizmos, and enjoys writing about how they impact our lives and society. After a day's work, he winds down by putting on the latest sci-fi flick. • Experience: 3 years • Education: Bachelor in Computer Applications • Previous experience: Android Police, Gizmochina • Social: Instagram, Twitter, LinkedIn ... Read More

Tags:
artificial intelligence