Data Commons Can Save Open AI
This year will be remembered as a breakthrough year for open source AI systems. The mood has shifted away from the fear of risks associated with open source AI that dominated public debate in the last two years. The release of DeepSeek’s open weights models proved once more that “there is no moat” and that open solutions can both compete with closed foundation models and support innovation in open development ecosystems.
In recent months, a steady development stream of new models, versions and derivatives has become the norm. It’s not an exaggeration to say that there’s a boom in open source AI models.
Unfortunately, against this backdrop, we see a problem in the foreground: the data landscape has stalled. There has been much less progress in public or open training datasets, even if everyone agrees that data is the key resource needed to build better AI systems.
Where’s the Open Data? Hugging the Crawl
One of the more important dataset releases in 2024 was HuggingFace’s FineWeb, dubbed as the “finest 15 trillion tokens that the web has to offer.” It is a cleaned and optimized version of Common Crawl dumps, which have been the stock source of training data for almost all LLMs. Another major release, AI2’s Dolma dataset, also refines Common Crawl data and combines it with selected sources of open data.
Recent progress in open datasets shows promise for building fully open AI models without legal constraints. Pleias, a French startup, has created Common Corpus, an LLM training dataset based only on permissively licensed sources. Spawning has created PD12M, a public domain dataset with over 12 million image-text pairs.
While these advances benefit AI development and enable open source AI creation, they focus primarily on extracting maximum value from existing resources through aggregation and refinement.
The Unspoken Cost of Proprietary AI Data
Open source AI development remains at a constant disadvantage. Private AI labs that release closed models and don’t disclose data sources are tapping into various types of proprietary data or data for which they don’t have a legal basis for reuse. Stefano Maffulli from the Open Source Initiative (OSI) describes this as strip-mining the data generated by people and feeding it into a proprietary system that grants access at a price.
The stakes for data sharing are high and go beyond issues related to AI training. Stefan Verhuulst argues that we might be entering a “prolonged data winter.” While corporate AI labs continue to rely on various proprietary sources of data, we see signals that data sharing is decreasing: web domains are restricting access for AI-related web crawls, and social networks are removing even the limited forms of data access that exist. The data winter will be especially hard for open source AI developers, who lack the budgets necessary to purchase proprietary data, and for whom principles of data transparency and access even further limit the data sources that they can work with.
From Exploitation to Collaboration
What kind of collective action can help prevent a data winter while strengthening approaches that combine data sharing with responsible governance, ensure data quality, and protect data rights?
Last summer, the Open Source Initiative and Open Future convened a group of experts to explore this challenge and propose a path forward. A recently released report from the convening, “Data Governance in Open Source AI” argues that collective action is needed to release more data and improve data governance to balance open sharing with responsible release.
Two paradigm shifts are needed. First, AI developers can no longer afford to build datasets alone, treating vast bodies of knowledge, culture and information as a raw resource they can turn into tokens. Datasets must be viewed as tools for solving AI development challenges and addressing other stakeholders’ needs. This entails collaboration, first of all, with stewards and owners of various open and public collections held by archives, research institutes, cultural organizations, and civic projects.
Secondly, we need to build on the foundations of open data, but increasingly think of data as a commons. An open data approach can have great value for AI development and is especially suited for publicly held resources. Yet many types of data could be useful, but for which open sharing falls short of preventing data exploitation. We need various data sharing and governance models to balance openness with control. At the turn of 2024, a promising data trust pilot was launched not by an AI lab, but an art gallery: Serpentine Labs created a data trust to govern the Choral AI dataset, a collection of choir recordings.
The Next Revolution Won’t Be Scraped
The need for going beyond the data scraping paradigm, and for these two shifts, can be illustrated using the example of BlueSky. The platform shares the data publicly through an open API that’s particularly well-suited for machine uses. In late 2024, a HuggingFace data archivist downloaded 1 million posts and packaged them into a publicly available training dataset. Several days later, the dataset was taken down under pressure from BlueSky users objecting to their data being used. As a result, BlueSky began developing a framework for fine-grained expression of “user intents for data reuse,” which is now being consulted with the user community.
Hopefully, a refined approach to using BlueSky data for AI training will become the major dataset innovation that open source AI developers need and that demonstrates the value of participatory governance and the commons.
Building Collective Power to Keep AI Open, Transparent and Just
While many technical challenges are still related to issues such as data quality and bias, data transparency, or environmental sustainability, multiple teams are solving them in open development ecosystems.
They require institutional support to ensure the sustainability of data commons efforts. At the AI Summit in Paris, the Current AI initiative was launched, with an initial budget of $400 million and a focus on data sharing rather than AI development. This creates an opportunity to establish a new data commons ecosystem that is as successful as the open source AI ecosystem.
To a large extent, the real innovation we’ll see in open source AI going forward isn’t in the models — it’s in the data sets. We must do everything in our power to ensure that future datasets are built upon a data commons with stewardship and control.