“Generative AI model providers across the spectrum of LLMs, image, video and music generation (platforms) have obtained infringing copies of copyright protected works that were sourced from ‘classic pirate sources’ such as illegal filesharing and streaming sites,” said the author of a report released in March by the Danish Rights Alliance.
The report also describes how generative AI developers obtain datasets containing pirated content. Datasets from Common Crawl are also evaluated by the report.
Common Crawl, one of the most popular sources of training data for LLMs, is a US based nonprofit that crawls and scrapes text from the internet, is included in the report because it never obtained permission from rightsholders to copy, store and distribute the massive amounts of protected content it does including press publications, books and song lyrics.
Third party datasets
This report also shows how AI model providers have used publicly available datasets compiled by third parties that contain infringing copies of copyright protected works from illegal sources.

These third-party datasets are distributed in many ways ranging from AI focused user-generated sharing platforms such as HuggingFace and Kaggle, but also via Torrent filesharing or simply by sharing directly by persons via online messaging and chat services such as Discord.
What are some of these datasets?
One of the most popular datasets used by AI providers containing infringing copies of copyright protected works is “The Pile,” an 800GB dataset of diverse text for language modeling.
Some of the datasets within The Pile are:
- OpenSubtitles, which contains files with plain text subtitles to movies and tv-shows sourced from OpenSubtitles.org3, a pirate site where users upload and share infringing copies of subtitles
- Books3, which contains 196.640 files containing the plain text of books. The books in Books3 originated from the illegal filesharing site Bibliotik.me
- Common Crawl (CC), a US based nonprofit that crawls the Internet creating copies of all text found on the websites it visits. CC then stores the text on servers donated by Amazon Web Services and lets anyone download the datasets for free
- Redpajama, compiled by a US based company Together.AI, is both a subset used by Google’s A4 dataset, and is a set that contains Books3.
- Slimpajama, which was collected by the US company Cerebras and contains a filtered version of the Redpajama dataset
- LibGen (short for Library Genisis) is a classic pirate filesharing site dedicated to sharing illegal copies of book
- Z-lib (short for Z-library) is a classic pirate filesharing site dedicated to sharing illegal copies of books that has also been the target of the FBI
- Anna’s Archive works as an aggregator of shadow libraries such as LibGen and Z-Library that together contain millions of illegal copies of e-books.
Books3 and Libgen are referenced in several past articles from Piracy Monitor.
Commercial AI platform providers also use illegal datasets to train their own models, including:
- Apple provides access to a range of LLMs called OpenELM.9, which was pre-trained using The Pile and Redpajama
- Anthropic provides access to a range of LLMs called Claude, which is trained using The Pile and YouTube datasets.
- DeepSeek provides a range of LLMs trained using The Pile, Redpajama and “more than 1 million illegal copies of e-books sourced from Anna’s Archive
- Meta Platforms has used LibGen, Z-Lib, Anna’s Archive and Books3 to train its LLaMA AI models, which was known and approved by Meta CEO Mark Zuckerberg.
- Microsoft claims to have used the dataset Slimpajama (contains Books 3 and Common Crawl) to train their Phi Version 2 model
- NVIDIA claims it has used The Pile (contains Books3, OpenSubtitles, Common Crawl) to train its NeMo Megatron-GPT 1.3B model
- OpenAI is said to have trained its GPT models using Wikipedia and “two internet-based books corpora (Books1 and Books2)… (and) datasets sourced from LibGen”
- Runway AI reportedly scraped thousands of YouTube videos to train their models on, without consent from the rightsholders
- Suno AI admits that their AI model was trained on “tens of millions of recordings” which likely originated from cyberlockers or via BitTorrent technology. Another way would be to use stream ripper technology
Acquisition is not always direct
The wide range of dataset distribution methods reveals how AI model providers don’t always engage in direct or commercial negotiations and agreements with the providers of AI training datasets.

In many cases they obtain datasets from user-generated platforms such as HuggingFace or via downloading of torrent files without any interaction with the company compiling the dataset.
Further reading
Report on pirated content used in the training of generative AI. Report. March 2025. By Thomas Heldrup, Head of Content Protection & Enforcement, Danish Rights Alliance.
Why it matters
“It has become well understood that the major AI companies have collected and used copies of copyright protected content to train generative AI, such as LLMs, without permission from the relevant rightsholders,” said the report. “But what is becoming more apparent by the day is the prevalence of content sourced from pirate sites in AI training data.
“The wide range of dataset distribution methods reveals how AI model providers don’t always engage in direct or commercial negotiations and agreements with the providers of AI training datasets. In many cases they obtain datasets from user- generate platforms such as HuggingFace or via downloading of torrent files without any interaction with the company compiling the dataset.”