In late January, The Dutch anti-piracy organization BREIN took a generative AI large language model called GEITje-7B offline, which was trained on tens of thousands of Dutch-language books from an illegal source. The source is Library Genesis, a service that has been found unlawful by the Dutch courts, is being blocked by Dutch access providers at BREIN’s request.
Library Genesis, also known as LibGen, has also gained notoriety from its use by Facebook parent company Meta Platforms in training Meta’s LLaMA large language model, which is documented in a US class-action lawsuit that was initiated by several individuals in 2023 and is yet unresolved at the time of this writing.
Second case in two weeks
In the second case, a model was trained on, among other things, many billions of tokens of Dutch-language literature, news and textbooks, according to its creator. In its documentation, the creator didn’t elaborate on what kind of materials those would be specifically, but with so much data, it was highly unlikely to be exclusively copyright-free material. The AI was primarily offered as a chatbot and could be downloaded and run by anyone.
BREIN had reached out to the creator of the model and asked what those training data were, where the data came from, and whether the creator had a license to collect and process the data in that way. If these rights were lacking, then obviously the model would have to be taken offline. The alternative was a lawsuit.
Data sets for training AI have been known to be filled with materials from illegal sources. The names of certain so-called shadow libraries come up regularly in this context. On such unauthorized websites, protected works can be downloaded for free; these illegal sources have already been blocked by the Dutch access providers at BREIN’s request. If AI datasets and language models are based on such illegal copies, this is obviously undesirable for the authors and producers of the original works, and BREIN will take action.
In that second case, BREIN noted that the person behind the LLM undoubtedly understood the situation and decided to take his model offline without further discussion. The BREIN Foundation was satisfied with that result and continues to search for datasets and language models that violate copyright on a large scale.
Further reading
BREIN takes down a large language model for the second time in as many weeks. Press release. February 6, 2025. BREIN Foundation
BREIN takes LLM offline. Press release. January 27, 2025. BREIN Foundation
Why it matters
Obfuscation of sources has become a common practice among developers of artificial intelligence platforms employing large language models, according to separate studies by Stanford University and by the Danish Rights Alliance.
In addition to the obligation to attribute ownership under copyright law and to ensure that the owner is compensated or recognized by the user, platform providers must also comply with transparency obligations set forth in the European Union’s AI Act. (Article 53).
BREIN noted that in the United States, dozens of lawsuits are already pending against providers of AI models. In Europe, the first cases are now also brought before the courts. Gradually the realization is dawning that copyright must be respected and we are seeing the first licensing agreements being signed. For example between OpenAI and the Financial Times and recently also the preliminary agreement between the major music companies and Anthropic.