Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI

Stefan Baack

Zusammenfassung
ChatGPT and other generative AI products would not have been possible without Common Crawl, a massive repository of web crawl data. Yet most people never heard of it. In this talk, we will show what Common Crawl is and how it has influenced generative AI to date.
Lightning Box 2
Kurz-Vortrag
Englisch
Conference

Common Crawl is the largest freely available archive of web crawl data and one of the most important sources of training data for large language models powering generative AI products like the free version of ChatGPT. It is used so frequently and makes up such large proportions of the overall training data in many cases that it arguably has become a foundational building block for generative AI products. Despite its importance, there is a lack of understanding about Common Crawl itself, which has invited false and problematic assumptions that Common Crawl represents the “entire internet” and enables AI builders to train their models on the “sum of human knowledge.” Tapping interviews with Common Crawl staffers and an analysis of online documentation, we will discuss what Common Crawl’s popularity means for the transparency and fairness of generative AI products.

In the talk, we will first highlight how Common Crawl collects its data, and how AI builders use it to train generative AI models. Then we will discuss the consequences. On the one hand, Common Crawl’s popularity has contributed to making generative AI more transparent to scrutiny in many ways. Thanks to its openness, Common Crawl has also enabled more generative AI research and development to take place beyond well-resourced leading AI companies. At the same time, many AI builders have used Common Crawl as a training dataset in ways that are problematic. Often, there is not enough information about how Common Crawl’s massive crawl data was filtered for harmful content before the model training, and there is a common reliance on rudimentary automated filtering techniques that fail to remove a lot of harmful content, as well as negatively impact the representation of digitally marginalized communities. We offer recommendations for Common Crawl and AI builders to improve the transparency and fairness of generative AI.

Photo of me
Research and data analyst