Free the (AI training) data !?! Negotiating data availability, accessibility and quality

Uli Köppen, Christina Elmer, Andreas Hauschke, Simon David Hirsbrunner

Artificial Intelligence (AI) technologies are hungry for data. As a result, data for AI training, validation and testing are becoming the critical resource of our information age. Our panel discusses how to make these data resources better available and AI development more accessible, inclusive and trustworthy.

Artificial Intelligence (AI) systems are trained with massive amounts of data. The availability and quality of these training datasets are critical factors in determining how reliable, trustworthy, and ethical AI products and services are and will be in the future. Training datasets are not only a valuable resource, but have become a profitable product and are marketed on a global scale. They are not only obtained as a by-catch of our online activities, but also - in case of synthetic data - designed specifically for the purpose of training, testing and evaluating AI models. 

But who gets access to training data and who doesn't? What does this mean for power relations in the age of AI and surveillance capitalism? Making training data as open and accessible as possible is a promising strategy to render AI more trustworthy and more democratic. This would improve the source credibility and explainability of authoritative, but opaque applications such as ChatGPT. Accessible training data cannot only effectively be scrutinized by various actors, but also potentially be supplemented and improved by harnessing the wisdom of the crowd. Advantages and disadvantages, potentials and risks of datasets could be evaluated and documented. Last but not least, open training data would also make AI more affordable and thus accessible for less wealthy actors such as journalists, NGOs, administrations and developers in the Global South. At the same time, we must protect privacy, digital self-determination and usage rights, as well as preventing the free flow of wrong, misinforming and otherwise harmful information. In other worlds, we might make training data as open as possible and as closed as necessary.

At the panel discussion, we address multiple access barriers for training data and different ways to overcome them. This will include topics like the potentials of open data for AI, privacy protection obligations and safeguards, requirements and standards for high quality data, privileged access via data trusts, potentials and risks of synthetic data as well as specific considerations for general purpose AI. The event will include lightning talks by three renowned experts from academia, media, and regulation, and allow for an exchange with the audience.

Uli Köppen
Head of AI + Automation Lab, Co-Lead BR Data
Christina Elmer Portrait
Professorin für Digitalen Journalismus & Datenjournalismus
Bild von Andreas Hauschke
Project Manager for trustworthy AI