Menu

Synthetic Data is a Dangerous Teacher

January 31, 2024

In the race to scale up, the number of AIs being trained on low-quality datasets has increased, which will further exacerbate all kinds of inequalities.

When Dall-E, a text-to-image visual-linguistic model, was launched in April 2022, it claimed to have attracted over a million users in the first three months. This was followed in January 2023 by ChatGPT, which apparently reached 100 million active users just two months after launch. Both mark notable moments in the development of generative AI, resulting in an explosion of AI-generated content across the web. The bad news is that in 2024, this also means we will see an explosion of made-up, meaningless information, misinformation, and disinformation, exacerbating the social negative stereotypes encoded in these AI models.

The AI ​​revolution has been spurred not by any recent theoretical breakthroughs (in fact, much of the fundamental work underlying artificial neural networks has been around for decades) but by the “availability” of large data sets. Ideally, an AI model captures a given phenomenon (human language, cognition, or the visual world) in a way that represents the real phenomenon as closely as possible.

For example, for a large language model (LLM) to produce human-like text, it is important that the model is fed with large amounts of data that somehow represent human language, interaction, and communication. The belief is that the larger the data set, the better it captures human events in all their natural beauty, ugliness, and even cruelty. We are in an era marked by an obsession with scaling up models, datasets, and GPUs. For example, current Masters have now entered the era of trillion-parameter machine learning models, which means they need billion-dimensional datasets. Where can we find it? On the internet.

These web-sourced data are assumed to capture the “ground truth” for human communication and interaction, a proxy against which language can be modeled. Although various researchers have shown that online datasets are often of poor quality, exacerbate negative stereotypes, and often contain problematic content such as racial slurs and hate speech directed at marginalized groups, this has not stopped major AI companies from using this data. Such data is in the race to scale up.

With generative AI, this problem is about to get much worse. Rather than objectively representing the social world from input data, these models encode and reinforce social stereotypes. In fact, recent studies show that generative models encode and reproduce racist and discriminatory attitudes towards historically marginalized identities, cultures, and languages.

It is currently difficult, if not impossible, to be sure how much text, images, audio and video data is being created and at what rate, even with state-of-the-art detection tools. Stanford University researchers Hans Hanley and Zakir Durumeric predict that there will be a 1 percent increase in the number of synthetic articles published on Reddit between January 2022, 31 and March 2023, 68, and a 131 percent increase in misinformation news articles. Boomy, the online music creator company, claims to have produced 14,5 million songs (or 14 percent of recorded music) so far. In 2021, Nvidia predicted that by 2030 there will be more synthetic data in AI models than real data. One thing is certain: the web is being invaded by synthetically generated data.

What is worrying is that these large amounts of generative AI outputs will be used as training material for future generative AI models. As a result, in 2024, a significant part of the training materials for generative models will consist of synthetic data produced from generative models. We will soon be stuck in a recursive cycle where we train AI models using only synthetic data produced by AI models. Many of these will be contaminated with stereotypes that will continue to exacerbate historical and social inequalities. Unfortunately, this will also be the data we will use to train generative models applied to high-risk industries including medicine, therapy, education, and law. We have not yet faced the disastrous consequences of this. By 2024, the prolific AI content explosion we find so impressive now will have turned into a massive toxic dump that will come back to bite us.

Source: Wired magazine

Records That May Interest You:

February 5 2024

February 2 2024