In November, the Dutch anti-piracy agency BREIN reported that Common Crawl is removing over 2 million news articles from its database, which has contained articles published on well-known Dutch news websites and in digital newspapers. No permission was granted for these articles.
Common Crawl is an American non-profit organization that scrapes the internet and makes its database available free of charge to consumers and businesses, including generative AI services that use these datasets to train their AI models.
Research showed that virtually all major Generative AI language models were (partly) trained on Common Crawl data. This includes Apple’s openELM, Microsoft’s Phi, OpenAI’s ChatgPT, NVIDIA’s Nemo Megatron, Deepseek’s DeepseekV3, and Anthropic’s Claude.
Acting on behalf of several Dutch news publishers, BREIN requested Common Crawl to remove these unauthorized copied webpages from their database so that AI services could no longer train their models based on this content without authorization. Common Crawl complied with the request
Legally licensed alternatives
An alternative to scraping is to obtain such content legally through licensing, making their data available for training in exchange for compensation. The Guardian entered into an arrangement with OpenAI in February 2025, which applies only to articles published from that date foward. The New York Times announced a similar deal in May, reportedly worth at least $20 million per year to the Times.
BREIN cited GPT-NL as the first large-scale Dutch AI language model trained entirely on legally obtained data. Training of this model began in June 2025. It is an initiative of the Dutch organizations TNO, NFI, and SURF in collaboration with, among others, the industry organization NDP Nieuwsmedia, whose members provided a huge dataset.
Why it matters
Common Crawl’s web archive consists of petabytes of mostly copyrighted works, including many news articles, that Common Crawl has been collecting since 2008. Common Crawl updates its data archive monthly with newly published material.
The NVJ (Dutch Association of Journalists) is advocating for a collective compensation scheme to ensure that journalists (and other creators) receive fair compensation when their work is used for AI training. Until such a ban or compensation agreement is reached with the companies behind the language models and the publishers, it is crucial to take these databases offline.
Further reading
Common Crawl removes 2 million articles. Press release. November 4, 2025. BREIN Foundation










