Community Insights: Best Practices for Open Datasets for LLM training

Kasia Odrozek

Zusammenfassung
As large language models increasingly shape our digital ecosystem, the methods of data collection and curation have become a complex battleground of legal, ethical, and technical challenges. This talk discusses pioneering community efforts toward creating open and responsible AI training datasets.
Loft
Kurz-Vortrag
Englisch
Conference

The landscape of AI training data is at a turning point. As large language models shape our digital world, data collection faces legal, ethical, and technical challenges. Concerns from creators have led to lawsuits and reduced transparency by AI companies, affecting accountability and innovation. While open access data could help, building large-scale competitive models with it remains difficult due to issues like unreliable metadata, high digitization costs, a “consent crisis,” and the need for expert skills.

In this session, we explore pioneering community efforts to create open, responsibly managed AI datasets. We’ll discuss insights from a June 2024 gathering of 30 dataset builders (including Hugging Face, Pleias, Cohere4AI, LLM360 and others) and a co-created paper on best practices for open LLM training datasets. Join us as we chart a path toward a transparent, ethical, and public AI ecosystem.

TThis programme session is supported by Stiftung Mercator. / Dieser Programmpunkt wird durch die Stiftung Mercator unterstützt.

Kasia Odrozek
Independent advisor