Major websites are saying no to Apple’s AI scraping

written by Elijah August 29, 2024 0 comments

In an independent analysis this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed — 294 out of 1,167 publications, mostly English-language and based in the US — are blocking Applebot-Extended. By comparison, Welsh found that 53 percent of the news websites in his sample block OpenAI’s bot. Google introduced its own AI-specific bot, Google-Extended, last September; it’s blocked by nearly 43 percent of those sites, a sign that Applebot-Extended can still fly under the radar. However, as Welsh tells WIRED, the figure has been “gradually creeping up” since he started looking.

Welsh has An ongoing project “There has been a bit of a split among news organizations about whether or not they want to block these bots,” he says. “I don’t have the answer to why each news organization made their decision. Obviously, we can read about a lot of them making licensing deals, where they get paid in exchange for letting the bots in, so maybe that’s a factor.”

Last year, The New York Times reported that Apple was trying to strike AI deals with publishers. Since then, competitors like OpenAI and Perplexity have announced partnerships with a range of media outlets, social platforms and other popular websites. “A lot of the world’s biggest publishers are clearly taking a strategic approach,” says Jon Gillham, founder of Originality AI. “I think in some cases there is a commercial strategy involved, such as holding back data until a partnership agreement is signed.”

There is some evidence to support Gillham’s theory. For example, Condé Nast websites used to block OpenAI’s crawling bots. After the company announced a partnership with OpenAI last week, it unblocked the company’s bots. (Condé Nast declined to comment for this story.) Meanwhile, Buzzfeed spokesperson Juliana Clifton told WIRED that the company, which is currently blocking Applebot-Extended, puts all AI crawling bots it can identify on its block list unless their owner has entered into a (usually paid) partnership with the company, which also owns the Huffington Post.

Because the robots.txt file must be manually edited and there are so many new AI agents debuting, it can be difficult to keep a block list up to date. “People just don’t know what to block,” says Dark Visitors founder Gavin King. Dark Visitors offers a freemium service that automatically updates a client’s site’s robots.txt file, and King says publishers make up a large portion of his customers due to copyright concerns.

Robots.txt may seem like arcane webmaster territory, but given its outsized importance to digital publishers in the age of artificial intelligence, it is now the domain of media executives. WIRED has learned that two CEOs of major media companies directly decide which bots to block.

Some outlets have explicitly noted that they block AI data-mining tools because they do not currently have partnerships with their owners. “We are blocking Applebot-Extended across all Vox Media properties, as we have done with many other AI data-mining tools when we do not have a commercial agreement with the other party,” says Lauren Starke, Vox Media’s senior vice president of communications. “We believe in protecting the value of our published work.”

Major websites are saying no to Apple’s AI scraping

Doctors reveal the unusual symptoms that show your control underwear is damaging your health, including one very embarrassing problem

How watching this 96-year-old woman in the dock convinced me that those who helped the Nazis MUST face justice… no matter how old they are

You may also like