AI data-set supplier Common Crawler agrees to stop illegal copying of Danish content

Sponsor ad - 728w x 90h (at 72 dpi)

Following a formal request from the Danish Rights Alliance, the Web archive known as Common Crawl has agreed to stop copying content from websites belonging to Danish media houses. Common Crawl had been known to copy full-length articles from the media houses’ websites without obtaining permission from or providing compensation to the rights holders.

Also in response to the Rights Alliance request, Common Crawl will review its existing data set, with a view to removing content belonging to the relevant Danish media houses.

Sponsor ad

A similar victory was announced by the Rights Alliance in 2023, when the organization won the removal of the controversial Books3 training data set, which included up to 200,000 illegal copies of Danish and international authors.

Common Crawl seeds others as well

Common Crawl is not the only target of rights enforcement, as content from that Web archive forms the basis of many data sets on the web, which are used by multiple tech companies to train their artificial intelligence platforms.

Google’s C4 dataset is another example, which is based on copies from Common Crawl.  That dataset has been used by OpenAI, Meta and Google, among others, to train generative AI. In July and August 2024 alone, the C4 dataset was downloaded nearly 200,000 times from the Hugging Face platform.

As the Rights Alliance uncovers further use of illegal copies of Danish media content in training data, they will continue to support the rights of those media houses.

New York Times connection

The New York Times is the only other rights-holder to have achieved the removal of illegal copies of its own content from Common Crawl.

But in a 2023 still pending, The New York Times sued Microsoft and OpenAI for scraping the NYT’s own original content.  Microsoft recently announced changes to its end user Terms of Service, adding terms relating to artificial intelligence that in essence say that Microsoft is not responsible for what other parties process through their services.  Who’s to say whether this was a defensive or a pro-active move to help reduce the blow, should Microsoft and OpenAI lose the case.  Observers say that the case may end up being escalated to the US Supreme Court.

According to disclosures made by OpenAI and reported by The New York Times, the vast majority of content used by OpenAI’s GPT-3 platform was sourced from Common Crawl.

Source: The New York Times, using data disclosed by OpenAI

Further reading

The web archive Common Crawl stops illegal copying of Danish media houses’ content.  Press release. September 2, 2024. Danish Rights Alliance.

Denmark: Authors win takedown battle to stop AI engines from using their stolen works for training. Article. August 14, 2023. By Steven Hawley. Piracy Monitor

The New York Times Company, Plaintiff, v. Microsoft Corporation and OpenAI (multiple entities).  Document 1, Complaint. Filed December 27, 2023. United States District Court for the Southern District of New York

Four Takeaways on the Race to Amass Data for A.I.  Article. April 6, 2024. By Cecilia Kang, Cade Mets and Stewart A. Thompson. The New York Times

Why it matters

Copyright has increasingly been challenged by ‘webcrawlers’ that copy content from their websites with the aim of making articles illegally available in datasets used to train generative artificial intelligence services. The exclusive right to control their own content is crucial to preserving the foundations of Danish journalism.

“When content is made available for free and can be freely used by developers of artificial intelligence, their incentive to pay for the rights holders’ content disappears,” said Thomas Heldrup, Head of Content Protection and Enforcement for Rights Alliance. “By enforcing against the illegal copying of content used to train artificial intelligence, we can give control back to rights holders and strengthen their position in negotiations with AI developers. At the same time, it sends a clear signal to AI developers that the use of creative content requires permission from the respective rights holders,” he said.

Print Friendly, PDF & Email
From our Sponsors