Inquiring minds want to know: Did China’s DeepSeek scrape OpenAI?

Sponsor ad - 728w x 90h (at 72 dpi)

In an ironic twist, AI platform provider OpenAI and its primary investor Microsoft are questioning whether China’s open-source DeepSeek R1 generative artificial intelligence platform – which burst onto the scene this week, decimating the value of semiconductor stocks like nVidia – has ‘harvested’ data from OpenAI’s platform repositories.

Bloomberg News reported this week that Microsoft detected data exfiltration by individuals associated with DeepSeek, via OpenAI’s API (applications programming interface).  Most software and online platforms have APIs that enable them to exchange data upon external request, which serves to create demand for those products and platforms.

Sponsor ad

Notorious obfuscation

The irony of this situation lies in the fact that OpenAI itself has been suspect to the same practice.  While they deny it, little is known outside of the company itself and its public disclosures.

In a September 2024 study the Danish Rights Alliance could obtain no details with respect to the training datasets used by OpenAI’s GPT-4 platform, …”about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”  OpenAI had said that it sources its data from “select publicly available data…(and) “(through partnerships to get) non-publicly available data, such as pay-walled content, archives, and metadata.”

A separate study by Stanford University’s Center for Research on Foundation Models established a transparency rating system from zero (not transparent) to 100 percent (most transparent); assigning a rating of 47% to OpenAI’s GPT-4. The most transparent platform was Meta’s Llama2, at 54% – not a glowing endorsement, either.

Copyright complaint

Also noteworthy is the 2023 lawsuit by The New York Times against OpenAI and Microsoft which accused the companies of exploiting its journalism through Microsoft’s Bing Chat (later branded Copilot) and OpenAI’s ChatGPT.  The Complaint listed hundreds of examples and noted that the training set for OpenAI’s GPT-3, GPT-3.5 and GPT-4 models was comprised of 45 terabytes of data – “the equivalent of a Microsoft Word document at is over 3.7 billion pages long,” said the Complaint.

In response to this issue, a bill called the COPIED Act of 2024 (S.4674, the Content Origin Protection and Integrity from Edited and Deepfaked Media Act of 2024) was introduced and read twice before the United States Senate on July 11, 2024 and referred to the Senate Committee on Commerce, Science and Transportation.  It appears not to have progressed any further.  Regulators outside the United States have had greater legislative success.

We’re from the government and we’re here to help?

In response to an inquiry by Reuters, OpenAI responded that “We engage in counter-measures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe as we go forward that it is critically important that we are working closely with the U.S. government to best protect the most capable models from efforts by adversaries and competitors to take U.S. technology.”

This incident is occurring against a backdrop of US government upheaval as the new presidential administration takes root.  This same week, the new acting head of the US Department of Homeland Security fired the members of the Cyber Safety Review Board (CSRB), which is the advisory committee that has been investigating one of the biggest hacks ever by Chinese operators, ostensibly to contain ‘government waste.’  Some question the move as being politically motivated.

Further reading

Microsoft probing if DeepSeek-linked group improperly obtained OpenAI data.  Article. by Dina Bass and Shirin Ghaffary. January 28, 2025. Bloomberg (Paywall)

Microsoft probes if DeepSeek-linked group improperly obtained data, Bloomberg News reports.  Article. January 29, 2025. Reuters

Report: Data transparency and enforcement of copyright by AI model providers found lacking. Article. September 13, 2024. by Steven Hawley. Piracy Monitor

Stanford compares AI Foundation models for transparency; none rate higher than 54%. Article. January 23, 2024. by Steven Hawley Piracy Monitor

US Senate group intros bill to combat copyright violations in AI-generated deepfakes. Article. July 12, 2024. by Steven Hawley. Piracy Monitor

The Times sues OpenAI and Microsoft over AI use of copyrighted work.  By Michael M. Grynbaum and Ryan Mac. December 27, 2023. The New York Times (Paywall)

The New York Times (Plaintiff) vs Microsoft, OpenAI (et al), Defendants. Case 1:23-cv-11195, Document 1. Filed December 27, 2023. United States District Court, Southern District of New York.

Why it matters

Investigations by copyright stakeholders have found that nearly all of the generative AI platforms have been opaque about the sources of their data, fearing that their platforms are exposed for having ingested content without licenses to do so.  One might interpret this situation as an act of piracy, particularly if evidence of unlicensed use emerges from behind any ongoing obfuscation.

From our Sponsors