Danish Rights Alliance: ‘Full transparency around generative AI training data is critical’

Sponsor ad - 728w x 90h (at 72 dpi)

Rights AllianceIn August, the Danish Rights Alliance, a media industry and anti-piracy advocacy organization, announced that Books3, a data-set of about 200,000 illegally used e-books, had been taken down in July.  Acting in the interest of Danish authors and content created in Denmark, the Rights Alliance claims that this was the first case in which a stolen dataset was taken down following a request presented to the host of the dataset.

Books3 has also been cited as a source of stolen content in a high profile lawsuit by two US authors against OpenAI, developer of ChatGPT.

Sponsor ad

The main lesson in this victory is that it would not have been possible to assert rights without knowledge of the content and origins of the data-set, said the Rights Alliance, which wants to ensure the legacy of this case. “Full transparency around AI tranining data is critical to the effective enforcement of creative content,” they said.

AI engine providers are reluctant

Rights Alliance contends that “There is a clear trend among tech giants such as OpenAI, Google and Meta that they are reluctant to publish what data their generative artificial intelligence is trained on and where this data originates.”

It’s true: in a class-action copyright infringement complaint filed against OpenAI in June 2023, two individual plaintiffs identify multiple data-sets with questionable origin.  When OpenAI launched its GPT-4 platform in March 2023, it disclosed no details about its dataset, saying in a report that “[g]iven both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about . . . dataset construction.”

Press release

The Books3 case emphasizes the need for transparency in the training of artificial intelligence. Article. September 4, 2023. Rights Alliance (Rettighedsalliancen). Auto-translated from Danish to English by Google Translate

Why it matters

“We already see that the developers of artificial intelligence withhold information about the data their models are trained on,” said Maria Fredenslund, the director of the Rights Alliance.

“It was a special case with Books3, as the creators of the data set had made public its origin, and at the same time some artificial intelligence developers had indicated that they had used Books3. The case is therefore a real example of transparency being necessary for rights holders to enforce their content,” she said.

The Books3 case also highlights that the transparency requirement in the EU’s AI regulation is not sufficient, since it does not oblige the developers of artificial intelligence to publish where the content of their training data originates.

“We call for a stricter requirement for transparency in the EU’s AI Regulation, so that rights holders have a real opportunity to check whether their content is used to train artificial intelligence,” said Fredenslund.

But how do you stop tampering?

Transparency is an important lesson to learn, so rights-holders can discern the source of data whose sources may be suspect.  But not everyone will play by these principles, which in turn makes it important that original content can be identified in some way that is robust against tampering or the alteration of its metadata.  So, how do we keep the content and its chain of ownership from being obfuscated or removed by pirates?

In March Adobe – which has a vested interest in protecting its Adobe Stock content library – announced and began testing copyright assurance functionality for its Firefly AI technology, which Adobe has been offering through a stand-alone public beta Web site.  Firefly has been trained onto millions of professional- grade, licensed images in Adobe Stock, “along with openly licensed content and public domain content where the copyright has expired,” said Adobe.

Solutions to tampering and fraud are mature in the world of images and video, in the form of forensic watermarking; for streaming, through session-based watermarking and for software apps, in the form of code and key obfuscation.

Otherwise, it’s difficult to say how “robustness” can be achieved, short of modifying AI-generated content itself, or the metadata of AI-used data sets.  Blockchain could be one answer, but can it be mandated by policy?

In July, the US government announced a voluntary initiative in July, with the support of seven companies that include Amazon, Google, Meta Microsoft and OpenAI, toward working on this problem to assert the identity and ownership of data-sets used to train AI.  Interestingly, Adobe has not signed on (publicly).

Further reading

Denmark: Authors win takedown battle to stop AI engines from using their stolen works for training. Article. by Steven Hawley. August 14, 2023. Piracy Monitor.

(Some) platform providers agree to support US AI initiative, may ‘watermark’ AI-generated content.  Article. July 21, 2023. by Steven Hawley. Piracy Monitor

OpenAI trains its GPT model using pirated e-books, contends authors’ lawsuit.  Article. June 30, 2023. by Steven Hawley. Piracy Monitor.

From our Sponsors