OpenAI trains its GPT model using pirated e-books, contends authors’ lawsuit

Sponsor ad - 728w x 90h (at 72 dpi)

Two individual authors filed a Class Action complaint against OpenAI and its business entities for copyright infringement, violation of the Digital Millennium Copyright act, unjust enrichment, common law unfair competition and negligence, for using their works without license, to train the dataset used by OpenAI’s ChatGPT, by extracting content from their works.

One of the datasets used to train OpenAI’s model (and others, including Google’s, Amazon’s), is BookCorpus, which in turn copies and hosts unpublished works from a Web site called Smashwords.com that makes those works available at no cost; many of which are under copyright but are used “without consent, credit, or compensation to the authors.”

Sponsor ad

In a July 2020 paper about its GPT-3 model, OpenAI disclosed that 15% of its dataset came from” two Internet-based books corpora,” called Books1 and Books2.  This current complaint against OpenAI contends that among the 294,000 titles in the Books2 dataset were many works that are available through torrent platforms, and referenced another platform (called Books3) containing about 200,000 books; all of which include books copied from “shadow libraries.”

Opacity to hide piracy?

When OpenAI launched its GPT-4 platform in March 2023, it disclosed no details about its dataset, saying that “[g]iven both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about . . . dataset construction.”

The complaint goes on to detail OpenAI’s ChatGPT business model, which collects a monthly fee from users to access the platform’s Web site or API.

The complaint also reflects inaccuracies and omissions in search results:

Extract from ChatGPT complaint. Source: Case 3:23-cv-03223-AMO Document 1 Filed 06/28/23, US District Court, Northern District of California

OpenAI is not the only potentially culpable target

Independent studies are finding evidence that many generative AI models are being trained using pirated content.  The Washington Post conducted an investigative study to analyze the data sets used to train Google’s C4 data set, “to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data,” and published results in April 2023.  Google C4 is a snapshot of 15 million Web sites used by both Google and Facebook to train their own large language generative AI models.

In addition to finding privately-hosted content such as voter registration data and content that resides behind paywalls in platforms like Kickstarter and Patreon, the Post also found b-ok.org, which is a market for pirated e-books that has been seized by the US DOJ.  Artists have filed copyright infringement claims against text-to-image generators as well, said the Post.  Their analysis also exposed a range of biases that touched on religion, gender and race.

The Post noted that at least 27 other sites in C4 were identified in the US Trade Representative’s Notorious Markets report.

The Post found that OpenAI’s GPT-3 training data set is about 40 times the size of Google’s, and reports that they “do not document the contents of their training data — even internally — for fear of finding personal information about identifiable individuals, copyrighted material and other data grabbed without consent.”

Demands by the plaintiffs include:

  • An order for this action to proceed as a class action,
  • Statutory and other damages under 17 U.S.C. § 504 for violations of the copyrights of Plaintiffs and the Class by Defendants,
  • Statutory or actual damages plus additional profits gained by OpenAI, including tripling the recovery of costs,
  • Changes to ChatGPT to prevent further copyright violation,
  • Costs and attorney fees,Interest on damages, and,
  • Notifications to members of the class affected, at defendant’s cost.

The “Class” defined in the complaint extends to defendants, “co-conspirators,” parent companies, officers, directors … subsidiaries, affiliates, or agents…” and others.

If the plaintiffs win, the cost would be heavy, although statutory damages are limited to $150,000 per work.

Further reading

Paul Tremblay, an individual, and Mona Awad, an individual v OpenAI Inc (and subsidiaries, all Delaware LLCs).  Complaint. Document #1, Case No. 3:23-cv-03223-AMO.  June 28, 2023. US District Court for the Northern District of California, San Francisco Division.

We’ve filed lawsuits challenging ChatGPT and LLaMA, industrial strength plagiarists that violate the rights of book authors.  Article. June 28, 2023. Joseph Saveri and Matthew Butterick. LLM Litigation.

Inside the secret list of websites that make AI like ChatGPT sound smart.  Article. April 19, 2023. The Washington Post (Paywall)

Documenting the English Collossal Clean Crawled Corpus. Research paper. Published 2021. Semantic Scholar

Why it matters

This complaint opens another line of discussion about the intersection of artificial intelligence and piracy, the theft of content used to generate creative works without credit or compensation to the creators.  This joins other uses of AI, including the detection of piracy (infringing consumption) by detecting anomalies such as spikes in attempts to access a premium content item or a private end-user streaming account.  Another use of AI is to assist pirates in evading detection.

Many of the companies leading the generative AI category are large and highly profitable, and have a financial incentive to minimize their costs.  To be blunt and perhaps cynical, some of these platforms claim to differentiate between legal and infringing use and promote themselves as watchdogs against piracy; but are lax in enforcing detection and countermeasures.

The nature of the content is secondary, since Web sites scraped for the datasets used by generative AI platforms reference or host all manner of content.

From our Sponsors