Single Real Datapoint Prevents AI 'Model Collapse', Scientists Find

Scientists believe they may have discovered a way to overcome 'model collapse', a phenomenon that threatens the training of artificial intelligence as we know it. To improve, AI systems such as ChatGPT must be trained with increasing amounts of real data. However, much of that data is sourced from internet content, which itself is often generated by similar AI models.

The volume of authentic data is rapidly diminishing and is predicted to be exhausted as early as this year. Data produced by other AI systems could quickly lead to 'model collapse', a scenario where AI engages in 'data cannibalism'—training on its own outputs—resulting in rapidly diminishing usefulness and a higher propensity for dangerous falsehoods.

Researchers have now suggested that incorporating just a single datapoint from the outside world can prevent this problem. Their study utilized a set of statistical models known as 'Exponential Families'. The work demonstrated that training systems exclusively on self-generated data invariably leads to model collapse. However, introducing one datapoint from outside the model—such as previously acquired knowledge—prevents the effect, even when the amount of machine-generated data is infinitely large.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

'Previous work on model collapse primarily examined large, complicated LLMs, where it's unclear how these models work and if results are repeatable—this is why unexplained hallucinations occur, where an AI generates a wrong answer without explanation,' said Yasser Roudi, Professor of Disordered Systems at King's College London. 'By focusing on a simple model, we can establish why adding just one data point prevents them from generating gibberish from an objective, statistical standpoint.'

The researchers emphasized that such collapse is not limited to chatbots but could affect critical infrastructure, including self-driving cars. 'From this foundation, we can establish principles that will be vital in future AI construction,' Professor Roudi added. 'As larger models are deployed in areas touching our lives, from ChatGPT to self-driving cars, and synthetic data takes on a larger share of AI training, computer scientists will have the tools to prevent this potentially disastrous scenario.'

The findings are reported in a paper published in the journal Physical Review Letters.