LLaMA is a set of large language models (LLMs) – AI software platforms – created and maintained by Facebook parent Meta Platforms, to produce “convincingly naturalistic text outputs in response to user prompts. These models are ‘trained’ by copying massive amounts of text and extracting expressive information from it. This body of text is called the training dataset.”
In a class action complaint filed against Meta Platforms in July 2023, three plaintiffs – Richard Kadrey, Sarah Silverman and Christopher Golden – claimed that their copyrighted materials were copied and ingested as part of training LLaMA. The complaint also created a class which consists of creators whose content was also used by LLaMA.
The case is now widely referred to as the “Kadrey case.” Many of plaintiffs’ copyrighted books appear in the dataset that Meta has admitted to using to train LLaMA.
The complaint characterized LLaMA’s output as being “entirely and uniquely reliant on the material in its training dataset. Every time it assembles a text output, the model relies on the information it extracted from its training dataset.” Plaintiffs and Class members did not consent to the use of their copyrighted books as training material for LLaMA.
Internal emails
In February 2025, a series of internal emails from 2023 were listed in an Appendix to the Kadrey case, claiming copyright infringement through Meta’s practice of ingesting illegally-sourced content into LLaMA.
The employees were conversing about concerns expressed by Meta research engineer, about using Meta IP addresses “to load through torrents pirate content,” and said “torrenting from a corporate laptop doesn’t feel right.”
One of the emails said that in-house counsel had advised them to “halt licensing efforts to obtain copyrighted works and instead use pirated works.” Another internal email referred to “‘Fair use datasets’ such as Books3 and LibGen” (short for Library Genesis); both of which are identified as pirate repositories (or, “shadow libraries” as characterized in this case).
When LLaMA was introduced in 2023, Meta listed Books3 as one of its data sources. (See: LLaMA: Open and Efficient Foundation Language Models, Section 2 Approach)
Further details
In a statement (ref: Doc 443 linked below) filed later in February by several Meta employees against their own company as defendant – seemingly in favor of the plaintiffs – testimony by the same Meta engineer was referenced:
Once Meta’s witnesses began testifying about this practice, it became clear that Meta’s copyright infringement was far more brazen than Plaintiffs previously knew or assumed. Notably, two Meta witnesses testified that Meta has used “torrenting” to acquire millions of copyrighted works from pirated (i.e. illegal) databases. In the process of torrenting this massive data, moreover, Meta also “seeded” it to others in the online piracy community. Ex. C at 348–351. The full extent of Meta’s torrenting formatively bears on Meta’s intentional copying and use of pirated books and awareness that this conduct was legally problematic given Meta’s efforts to prevent the public from being able to trace its torrenting activity back to Meta IP addresses and Facebook servers
Meta’s torrenting-related data is thus directly relevant to Plaintiffs’ copyright infringement claim because it reflects some of the copyrighted data that Meta downloaded from the shadow/pirated libraries at issue in this case, and it is also evidence of Meta distributing this copyrighted data without consent from the actual copyright holders, which is an independent infringing act.
The statement positioned the data in question to be incomplete. Apparently, even though the plaintiffs requested some of this data, some was withheld:
“(The Meta engineer) also testified that Meta stores its “supervised fine-tuning data” for its Llama models on a specific hard drive cluster … Plaintiffs have observed what appeared to be gaps in Meta’s mitigation data productions, and Plaintiffs now know from (the engineer) that this set of fine-tuning data exists within a discrete data location that has not been produced.
“… That data regulates Llama by (1) training the model to identify when copyrighted material has been emitted and (2) preparing alternative answers when copyrighted emissions occur. The supervised fine-tuning data consists of copyrighted works themselves: in short, the model is fine-tuned to say, “Don’t emit this.” Thus, not only does Meta’s supervised fine-tuning data itself consist of Plaintiffs’ and putative class members’ works, but whether Llama models frequently regurgitate copyrighted material unless fine-tuned also bears on Meta’s fair use argument that Llama models’ outputs are ‘transformative.'”
Meta said in response:
“In this motion, Plaintiffs distort deposition testimony to once again demand documents they did not ask for in discovery and that are not relevant to the lone remaining copyright infringement claim—namely, Meta’s alleged “BitTorrent client” software, application logs from the alleged use of torrenting, or peer lists created during the alleged use of torrenting. Plaintiffs did not ask for this information in their document requests. And Plaintiffs readily admit that Meta already has produced documents regarding discussions of alleged torrenting within Meta, including documents regarding any alleged decisions to use torrents to acquire data for training the Llama models. There is nothing to compel here.”
According to reporting by Ars Technica, Meta executives all the way to Mark Zuckerberg “were aware of the use of pirated material to train AI models at the company.”
Further reading
Discovery Letter Brief. Kadrey v. Meta Platforms Inc., Case 3:23-cv-03417-VC Document 443. Filed 02/14/25. US District Court for the Northern District of California, San Francisco Division
Appendix A. Case 3:23-cv-03417-VC Document 417-1 Filed 02/05/25. US District Court for the Northern District of California, San Francisco Division
Archive of court documents in the Kadrey v Meta Platforms case (3:23-cv-03417). Court Listener (This archive is updated with new documents as the case progresses. It begins with the initial complaint)
Court documents show not only did Meta torrent terabytes of pirated books to train AI models, employees wouldn’t stop emailing each other about it: ‘Torrenting from a corporate laptop doesn’t feel right.’ Article. February 7, 2025. by Ted Litchfield. ArsTechnica
You just found out your book was used to train AI. Now what? Article. September 27, 2023. The Authors Guild. (Article about Books3)
What are “Pirate” Repositories? Article. Open Access landing page. Minnesota State University Mankato (Article about LibGen)
Why it matters
In recent years, there’s been a gold rush by technology companies to train their AI platforms as quickly as possible, with as much content as possible, at the lowest possible cost. Unfortunately, the practice of ingesting unlicensed (i.e. stolen) content without compensating the rights holder/owner – and then being opaque about the practice on purpose – has been an integral part of that process for most of them.
From the broad copyright perspective, the Kadrey case has three counts: Direct copyright infringement of rights and vicarious copyright infringement under 17 U.S.C. § 106, removal of copyright-management information and false assertion of copyright under 17 U.S.C. § 1202(b), Unfair Competition (unlawful business practices) under Cal. Bus. & Prof. Code §§ 17200, Unjust Enrichment and Negligence under California Common Law.
From an academic point of view, researchers seek dependable sources. But while “These sites … appear to be harmless, … in reality you may not know under what conditions the materials are shared…”, said Minnesota State University. “The bottom line is that you may not always know where the materials are coming from that you find on these sites.”