Analysis: EU’s AI platform requirements pose challenges to rights holders, says Danish Rights Alliance

August 18, 2025

Sponsor ad - 728w x 90h (at 72 dpi)

The European Commission’s effort to develop a balanced AI Code of Practices (CoP) that protects the interests of rights-holders while enabling a productive business model for AI platform providers has followed a process that began with a declaration of intent in 2023, followed by a draft Code, a comment period and a revised final draft that went into effect on August 2, 2025.

Both the March draft of the CoP and the finalized version now in effect have been subject to a chorus of criticism from rights holders, concerned that AI platform providers were not being held to rigorous standards of provenance for the materials they use to train their large language models, nor were there clearly defined standards of transparency. AI platform providers say the Code of Practice places vague, unrealistic and costly burdens upon them.

Sponsor ad

Meanwhile, enforcement does not begin until August 2, 2026, a full year after the regulation went into effect. History tells us that AI providers only comply with regulation when forced to do so.

In August, the Danish Rights Alliance produced an analysis of the transparency obligations to be followed by AI platform providers, summarized here. The full article is linked below

Insufficient transparency requirements

According to the AI Regulation, providers of AI models for general use must prepare and publish a sufficiently detailed summary of the content used to train the model. But the template does not provide sufficient information for the effective exploitation and enforcement of copyright.

1. Datasets:

For the use of datasets, only “large” publicly available datasets must be listed with name and link. This means that AI providers are only required to publish details about the content if the dataset constitutes more than 3 percent of all publicly available datasets used in training within a specific category (e.g. text, audio or video). If the dataset constitutes less than 3 percent, the content must only be described in general terms.

Rights Alliance looked at the case of the Books3 dataset, which was used by, among others, Meta to train their Llama 1 AI model. In addition to Books3, Meta used a publicly available dataset with text from Common Crawl of 3.3 TB of data, as well as data from a number of other public datasets, which together meant that Meta used data equivalent to approximately 4.7 TB. Since Books3 consisted of a maximum of 85 GB of text data, Books3 constituted less than 1.7 percent of the total amount of training data within the text category. This means that Meta would not have to disclose the name and link to Books3 if they placed Llama 1 on the European market today.

Since almost all providers of AI models for general use use Common Crawl data, this is expected to be a general challenge for sufficient transparency for all popular AI models.

2. Collection of training materials from internet domains

According to the template, AI providers are only required to list the most “relevant” domains they have collected data from. This corresponds to the top 10 percent of domains, measured by the amount of data collected from a specific domain, in a representative manner across all content categories.

This limitation in the degree of transparency means that we will probably not be informed about domains belonging to rights-holders whose native language is less widespread than, say, English, Spanish, Mandarin, French, etc.

3. Collection of training data from illegal file sharing services

It has repeatedly emerged in US lawsuits about AI and copyright that AI providers such as Meta, Anthropic and OpenAI have collected training data from illegal file sharing services such as LibGen.

Since there is no collection with crawlers or bots when AI platform providers have downloaded content from LibGen and the like, the Rights Alliance is concerned that those providers will list this under section 2.6 “other sources of data”, where only a “narrative” description of data sources and data is required, if they choose to describe content collected from illegal file sharing services at all.

Analysis: EU’s AI platform requirements pose challenges to rights holders, says Danish Rights Alliance

Recent News

LatAm: Annualized, online piracy losses exceeded US$8 Billion. Pay TV operators...

European Commission releases Assessment of 2023 Piracy Recommendation, but no new...

Photocall sports pirate shuts down, had provided access to over 1,100...

Recent research

LatAm: Annualized, online piracy losses exceeded US$8 Billion. Pay TV operators...

US GAO identifies national security risks discerned from government data, instructive...

Dissecting YouTube’s malware distribution network, an alternative to phishing and deceptive...

Disclaimer