re:publica x srh CAMPUS
3.-5. September 2025
SRH Berlin University
The landscape of AI training data is at a turning point. As large language models shape our digital world, data collection faces legal, ethical, and technical challenges. Concerns from creators have led to lawsuits and reduced transparency by AI companies, affecting accountability and innovation. While open access data could help, building large-scale competitive models with it remains difficult due to issues like unreliable metadata, high digitization costs, a “consent crisis,” and the need for expert skills.
In this session, we explore pioneering community efforts to create open, responsibly managed AI datasets. We’ll discuss insights from a June 2024 gathering of 30 dataset builders (including Hugging Face, Pleias, Cohere4AI, LLM360 and others) and a co-created paper on best practices for open LLM training datasets. Join us as we chart a path toward a transparent, ethical, and public AI ecosystem.
TThis programme session is supported by Stiftung Mercator. / Dieser Programmpunkt wird durch die Stiftung Mercator unterstützt.