Before artificial intelligence platform large language models can produce anything useful, they must be ‘trained,’ meaning that they must be ‘fed’ with content that the models then use to draw inferences and produce results.
Much of what these platforms ingest comes from the Internet and much of that in turn is copyrighted material. It’s thought that a lot of this copyrighted content is not licensed by AI platform providers, but nobody’s really sure because platform providers have become increasingly opaque about the sources of data used in their models.
Research published in a September 2024 report by the Danish Rights Alliance shines a light on the extent of this situation, and concludes that rather than owning up to their use of unlicensed content, almost all of the 13 AI platforms evaluted for the report have chosen obfuscation.
What is transparency and why is it important?
The Rights Alliance defines ‘transparency’ as the AI model provider’s wilingness to disclose what content was included in the training data (title and copyright information), the source of the content (URL or name of platform or service), and when the content was initially collected.
In addition to the obvious need to attribute ownership and ensure that the owner is compensated or recognized by the user, platform providers must also comply with transparency obligations set forth in the European Union’s recently-passed AI Act. (Article 53).
Case studies
The Rights Alliance report contains 13 case studies to illustrate the transparency practices of selected AI models. Seven of them are text-generating or multi-modal AI models.
The most transparent of these platforms is from the US-based open source organization EleutherAI. “EleutherAI provides direct access to a copy of its training data while also providing a list of dataset titles and a narrative explanation covering content type, source of content and references to third-party papers, thus enabling rightsholders to determine if their content has been used to train GPT-NeoX-20B. In turn this allows rightsholders to determine if they want to enforce their rights against the model provider,” said the Rights Alliance.
The other eight text-generating model providers were far less forthcoming.
Meta Platforms: One of them was Meta’s Llama model set (Llama 1, 2, and 3). While transparency was limited in Llama 1, the Rights Alliance was able to determine that Meta was using the Books3 dataset, which led Rights Alliance to identify works by Danish publishers and authors that were in Books3, and subsequently initiate copyright enforcement activities, which ultimately resulted in Books3’s takedown. There was no transparency about the Llama 2 or 3 data sets, so Rights Alliance was unable to determine whether copyrighted content was used to train them.
Mistral AI: Mistral AI provides two general purpose models, one of which was developed in collaboration with NVIDIA. Other than to say that its data is sourced from the Web, Mistral AI bluntly says that “we do not communicate on our training datasets…”
Google: With respect to its Gemini and open source Gemma models, Google speaks only in generalities. A 2023 Google technical report said that its “Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.” and that its Gemma 2 “…models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 13 trillion tokens, the 9B model was trained with 8 trillion tokens, and 2B model was trained with 2 trillion tokens.” These fall into three categories: Web documents, code and mathematics.”
OpenAI: With respect to the training datasets used by OpenAI’s GPT-4 platform, the Rights Alliance report could obtain no details …”about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” OpenAI says that it sources its data from “select publicly available data…(and) “(through partnerships to get) non-publicly available data, such as pay-walled content, archives, and metadata. For example, we partnered with Shutterstock on building and delivering AI-generated images,” they said.
Microsoft: Microsoft’s Phi models are trained extensively on synthetically generated training data from OpenAI’s models; using “OpenAI’s GPT-3.5 model to generate datasets with synthetic textbook-like content.” Its Phi2 model “contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others… (augmented by) carefully selected web data that is filtered based on educational value and content quality.” According to the Rights Alliance, Microsoft does not list datasets used or give narrative explanations of content used to train Phi-3-mini and later models. Here Microsoft uses descriptions such as “publicly available web data” from ‘various open internet sources’.”
Anthropic: Anthropic’s current Claude model’s “training data are described as originating from “Publicly available information via the Internet”, which leaves rightsholders with no way to determine if their content has been used to train Anthropic models.”
Video generators
Two of the AI platforms evaluated were video generators. Runaway AI’s Gen3-alpha “was trained on both images and videos, but they don’t go into further detail… It was revealed by 404 Media that Runway AI had scraped thousands of YouTube videos to train their models on, without consent from the rightsholders.”
OpenAI’s Sora was not yet available at the time of this report, but Rights Alliance cited an April 2024 New York Times article that said OpenAI scraped content from YouTube to train their AI models, which violates YouTubes terms and was done without permission of the videos’ rights holders.
Music generators
Suno AI’s Suno “only disclosed information regarding their training data after the record companies were able to generate output that was almost identical or so similar to recordings owned by the labels that it couldn’t have been generated without Suno training its models on the record companies’ content.” Uncharted Labs’ Udio music generator has experienced similar scrutiny
Image generating platforms
StabilityAI “did not provide rightsholders with any insight into what was used to train (Stable Diffusion’s AI) models and therefore makes it impossible to determine if specific protected content was used in the training process.”
Black Forest Labs released three text-to-image generators in August 2024 and provided “no transparency into training data making it impossible to determine whether specific copyright protected content has been used to train the FLUX.1 models.”
Incremental wins
Since 2023, the Danish Rights Alliance has been behind wins against data providers Books3 and Common Crawl, both of which had been making unlicensed content available to AI platforms; by getting those providers to stop.
The New York Times is said to be the only other rights-holder to have achieved the removal of illegal copies of its own content from Common Crawl. According to research done by The Times, about 80% of the content used by OpenAI since 2007 originated from Common Crawl.
Methodology
The analysis is based on a systematic review of the major AI models, for which the Rights Alliance had mapped what the models are trained on and where the content originates (to the extent possible), as well as whether or not the transparency of AI developers gives rights holders sufficient opportunity to identify whether their work has been used and thereby to enforce their rights.
Further reading
New report uncovers lack of transparency at the biggeest AI services. Press release. September 13, 2024. Danish Rights Alliance (Rettighedsalliancen.dk)
Report on AI model providers’ training data transparency and enforcement of copyrights. Research report (Full report). September 5, 2024. By Thomas Heldrup, head of Content Protection & Enforcement. Danish Rights Alliance (Rettighedsalliancen.dk)
Why it matters
Knowing the sources of these AI platforms’ content seems a straightforward path to copyright enforcement. The fact that most of these platforms have little intent to disclose their sources makes it difficult if not impossible to determine what copyrighted content they might use and therefore pursue licensing arrangements.
While the term ‘piracy’ is not commonly used in describing the problem of content being used by AI platforms without permission (e.g. unlicensed), the lack of transparency provided by most of these AI platform providers may arouse suspicion.
Of course, licensing fees paid by the AI platform providers would be expenses that impact those companies’ profitability, which could cast them as less attractive to early investors. Platforms also defend their lack of transparency by citing the proprietary nature of their algorithms – the secret sauce meant to demonstrate that “their rocket scientists are smarter than other rocket scientists,” to paraphrase The Right Stuff.
As illustrated by the Rights Alliance’s success in getting Books3 to recognize and stop using unlcensed content, some providers do in fact stop when asked.
The Google Books decision of 2015 provides additional guidance. Google Books originated in 2004 as The Library Project, which aimed to help researchers by placing the world’s information at their fingertips – a lofty goal but one that did attract the attention of authors and publishers at that time.
A group of individual authors and the Authors Guild sued Google petitioned Google to stop, but in 2015 the US Supreme Court declined to hear the appeal of a decision by a lower court that the content used by Google Books was ‘Fair Use,’ and therefore was legal.
To populate its Google Books, Google scanned and digitized an estimated 20 million books from university libraries that gave Google permission to do so. It was also estimated that at least four million were protected by copyright but Google never receive permission to reproduce them.