Danish media outlets have demanded that the nonprofit web archive Common Crawl remove copies of its articles from previous data sets and stop crawling its websites immediately. This request was issued amid growing outrage over how artificial intelligence companies like OpenAI are using copyrighted materials.
Common Crawl plans to comply with the request, first issued Monday. CEO Rich Skrenta says the organization is “not equipped” to fight media companies and publishers in court.
The Danish Rights Alliance (DRA), an association representing copyright holders in Denmark, led the campaign. The request was made on behalf of four media outlets, including Berlingske Media and the newspaper Jyllands-Posten. The New York Times made a similar request from Common Crawl last year, before filing a lawsuit against OpenAI for using his work without permission. In its complaintthe New York Times highlighted how the Common Crawl data was the “most weighted data set” in GPT-3.
Thomas Heldrup, head of content protection and enforcement at the DRA, says this new effort was inspired by the Times. “Common Crawl is unique in that we are seeing a lot of big AI companies using their data,” says Heldrup. He sees his corpus as a threat to media companies trying to deal with AI titans.
Although Common Crawl has been essential to the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco-based organization was best known before the rise of AI for its value as a research tool. “Common Crawl is caught in this conflict over copyright and generative AI,” says Stefan Baack, a data analyst at the Mozilla Foundation who recently published a report on Common Crawl’s role in AI training. “For many years it was a small niche project that almost no one knew about.”
Before 2023, Common Crawl did not receive a single request to redact data. Now, in addition to requests from the New York Times and this group of Danish editors, he is also receiving an increase in requests that have not been made public.
In addition to this sharp increase in redact data demands, Common Crawl’s web crawler, CCBot, is also increasingly prevented from accumulating new data from publishers. According to AI detection startup Originality AI, which often tracks the use of web crawlers, more than 44 percent of the world’s major news and media sites block CCBot. Aside from BuzzFeed, which began blocking it in 2018, most of the prominent outlets it analyzed, including Reuters, the Washington Post and CBC, rejected the tracker last year alone. “They are being blocked more and more,” says Baack.
Common Crawl’s quick compliance with this type of request is driven by the reality of keeping a small nonprofit afloat. However, compliance does not equate to ideological agreement. Skrenta sees this push to remove archival materials from data repositories like Common Crawl as nothing less than an affront to the Internet as we know it. “It’s an existential threat,” she says. “They will kill the open network.”