Meta, Google and other AI platforms are facing a growing chorus of accusations from stakeholders who say they knowingly ingest copyrighted content to drive the advertising revenue that is central to their business models; without compensating or recognizing their owners and rights-holders.
On July 16, the Chair of a US Senate judiciary subcommittee testified about the intentional use of copyrighted content by Meta and others of using more “200 terabytes of copyrighted work – or in other words, billions of pages that would fill approximately 22 Libraries of Congress,” to train their generative AI large language models.
“We’re talking about piracy,” said U.S. Senator Josh Hawley (R-Missouri) **, as he described the platform providers’ role in using unlicensed copyrighted content to fuel artificial intelligence (AI) models. The hearing also featured witness testimony from bestselling author David Baldacci as well as AI experts and law professors in support of the Senator’s claims.
“Today’s hearing is about the largest intellectual property theft in American history. . . . AI companies are training their models on stolen material, period. . . . And we’re not talking about these companies simply scouring the internet for what’s publicly available,” said Sen. Hawley.
The fireworks continued: “Are we going to protect [Americans’ creative community], or are we going to allow a few mega-corporations to vacuum it all up, digest it, and make billions of dollars in profits—maybe trillions—and pay nobody for it. That’s not America,” he exclaimed.
Evidence is well documented
In February, 2025, evidence submitted in the piracy matter spearheaded by several authors (the “Kadrey case”), included emails between Meta employees about using Meta IP addresses “to load through torrents pirate content,” saying that “torrenting from a corporate laptop doesn’t feel right.” Allegedly, Meta CEO Mark Zuckerberg was aware of the use of pirated materials.
Ingredient datasets containing unlicensed content that are being ingested by Meta and others include CommonCrawl, Libgen, Books3 and others. When LLaMA was introduced in 2023, Meta explicitly listed Books3 as one of its data sources.
A report by the Danish Rights Alliance detailed this situation earlier in 2025. Some of them have been acknowledged by their aggregators as containing unlicensed source material and have removed the infringing material from their datasets. After being approached by the Rights Alliance, the host of Books3 removed infringing instances of Danish content from the platform.
Opaque about their sources
An earlier report, from Stanford University’s Center for Research on Foundation Models, part of its Institute for Human-Centered Artificial Intelligence (Stanford HAI), published a Foundation Model Transparency Index (FMTI) that rates ten foundation model companies.
Stanford uses the term ‘Foundation Models’ to refer to models that can be trained “on a huge amount of data, and adapted to many applications.” The Index evaluated 100 different aspects of transparency and scored ten different publicly-available models on a scale from 1 to 100. The scores ranged from 54 down to 12, indicating that they all leave room for improvement.
Further reading
Chairman Hawley exposes Big Tech’s complicity in piracy to train AI models & willfulness to bankrupt U.S. creative community. Press release. July 16, 2025. US Senator Josh Hawley (R-Missouri)
Too big to prosecute? Examining the AI industry’s mass ingestion of copyrighted works for AI training. Video of the full hearing. July 16, 2025. US Senate Subcommittee on the Judiciary
Report: “Classic pirate sources” are widely used to train AI datasets, says Danish Rights Alliance. Article. March 20, 2025. by Steven Hawley. Piracy Monitor
Employee statements claim Meta used pirated material to train LLaMA AI in Kadrey case. Article. February 17, 2025. by Steven Hawley. Piracy Monitor
Stanford compares AI Foundation models for transparency; non rate higher than 54%. Article. January 23, 2024. by Steven Hawley. Piracy Monitor
AI data-set supplier Common Crawl agrees to stop illegal copying of Danish content. Article. September 2, 2024. by Steven Hawley. Piracy Monitor
What happened to Google’s effort to scan millions of university library books? Article. August 10, 2017. by Jennifer Howard. EdSurge.
Why it matters
The roots of this matter go back decades, beginning when the fledgeling Google announced in 2002 that it would create a library of all the world’s books online. Ten years later, the effort was still ensnared in legal battles and the effort came to an end. At that time, it was well-intentioned.
Now, these efforts have evolved into powerful commercial ventures: purveyors of disinformation and corrupted by the profit motive; apparently with little intention to acknowledge or dedicate a budget to compensate the originators or rights-holders.
Piracy Monitor hopes that this, and similar awareness campaigns elsewhere, will help turn the tide.
** Well known for his fist-pump during the US Capitol insurrection of January 6, 2021, Josh Hawley is relation to Piracy Monitor’s Steve Hawley.










