The race to block OpenAI scraping bots is slowing down

written by Elijah October 7, 2024 0 comments

It’s too early to say how the series of deals between AI companies and publishers will play out. However, OpenAI has already scored a clear victory: its web trackers are no longer being blocked by mainstream media outlets at the rate they once were.

The rise of generative AI sparked a data gold rush and a subsequent data protection rush (for most news websites, at least) in which publishers attempted to block AI trackers and prevent their work would become training data without consent. When Apple introduced a new artificial intelligence agent this summer, for example, a number of major media outlets quickly opted out of Apple’s web scraping using the Robot Exclusion Protocol, or robots.txt, the file that allows webmasters control the robots. There are so many new AI robots on the scene that it can feel like playing whack-a-mole to keep up.

OpenAI’s GPTBot has the highest name recognition and also crashes more frequently than competitors like Google AI. The number of high-ranking media websites using robots.txt to “disallow” OpenAI’s GPTBot increased dramatically from its launch in August 2023 until that fall, and then increased steadily (but more gradually) since November 2023. 2023 to April 2024, according to 1,000 Popular Media analysis from Ontario-based AI detection startup AI Originality. At its peak, the maximum was just over a third of websites; now it’s down by closer to a quarter. Within a smaller group of top media outlets, the block rate is still above 50 percent, but is down from highs earlier this year of nearly 90 percent.

But last May, after Dotdash Meredith announced a licensing deal with OpenAI, that number dropped significantly. Then it fell again at the end of May, when Vox announced its own deal, and again in August, when WIRED’s parent company, Condé Nast, reached a deal. The trend toward increased lockdown appears to be over, at least for now.

These falls make obvious sense. When companies partner and give permission for their data to be used, they are no longer incentivized to block it, so they would update their robots.txt files to allow crawling; Make enough deals and the overall percentage of sites blocking trackers will almost certainly decrease. Some outlets unblocked OpenAI trackers on the same day they announced a deal, such as The Atlantic. Others took a few days to a few weeks, like Vox, which announced its partnership at the end of May but unlocked GPTBot on its properties towards the end of June.

Robots.txt is not legally binding, but has long served as the standard governing the behavior of web crawlers. For most of the Internet’s existence, people who ran web pages expected others to comply. When a WIRED investigation earlier this summer found that AI startup Perplexity was likely choosing to ignore robots.txt commands, Amazon’s cloud division launched an investigation to determine whether Perplexity had violated its rules. It’s not a good idea to ignore the robots.txt file, which is probably why so many prominent AI companies (including OpenAI)explicitly declare They use it to determine what to track. Originality AI CEO Jon Gillham believes this adds additional urgency to OpenAI’s drive to strike deals. “It is clear that OpenAI sees the blockade as a threat to its future ambitions,” says Gillham.

The race to block OpenAI scraping bots is slowing down

Ex-MLB star and No. 1 pick Matt Bush charged with DWI after ‘trying to flee multi-vehicle crash’

Olympics star reveals why selling racy photos on OnlyFans for $20 a month is ‘completely changing her life for the better’ – with LA 2028 on the horizon

You may also like