Amazon is investigating confusion over scraping abuse allegations

written by Elijah June 27, 2024 0 comments

Amazon’s cloud division has launched an investigation into Perplexity AI, questioning whether the AI search startup is violating Amazon Web Services rules by taking down websites that tried to block it, WIRED has learned.

An AWS spokesperson, who spoke to WIRED on condition of not being named, confirmed the company’s investigation into Perplexity. WIRED had previously discovered that the startup, which has support from Jeff Bezos’ family fund, Nvidia, and was recently valued worth $3 billion, appears to rely on content from deleted websites that had been banned from access via the Robot Exclusion Protocol, a common web standard. While the Robot Exclusion Protocol is not legally binding, the terms of service generally are.

The Robot Exclusion Protocol is a decades-old web standard that involves placing a plain text file (such as wired.com/robots.txt) on a domain to indicate which pages robots and automated crawlers should not access. While companies using scrapers may choose to ignore this protocol, most have traditionally respected it. Amazon spokesperson told WIRED that AWS customers must comply with the robots.txt standard while crawling websites.

“AWS’s Terms of Service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our Terms and all applicable laws,” the spokesperson said in a statement.

Perplexity’s practices are discussed below A June 11 report from Forbes who accused the startup of stealing at least one of his items. WIRED investigations confirmed the practice and found further evidence of scraping abuse and plagiarism by systems linked to Perplexity’s AI-powered search chatbot. Engineers at Condé Nast, WIRED’s parent company, block the Perplexity tracker on all of their websites using a robots.txt file. But WIRED discovered that the company had access to a server using an unpublished IP address (44.221.181.252) that visited Condé Nast properties at least hundreds of times over the past three months, apparently to crawl Condé Nast websites. .

The machine associated with Perplexity appears to be involved in widespread crawling of news websites that prohibit bots from accessing their content. Spokespeople for The Guardian, Forbes and The New York Times also say they detected the IP address on their servers multiple times.

WIRED traced the IP address to a virtual machine known as an Elastic Compute Cloud (EC2) instance hosted on AWS, which began its investigation after we asked whether using AWS infrastructure to crawl websites that banned it violated the company’s terms of service.

Last week, Perplexity CEO Aravind Srinivas responded to WIRED’s investigation by first saying that the questions we posed to the company “reflect a deep and fundamental misunderstanding of how Perplexity and the Internet work.” Srinivas then he told Fast Company that the secret IP address WIRED observed while crawling Condé Nast websites and a test site we created was operated by a third-party company that performs web crawling and indexing services. He declined to name the company, citing a confidentiality agreement. Asked if he would tell the third party to stop tracking WIRED, Srinivas replied, “It’s complicated.”

Amazon is investigating confusion over scraping abuse allegations

46-year-old mother missing for a month; now her son says he saw her die

Love Island’s Joey Essex secretly kisses bombshell Jessy on the terrace as his romance with Grace falls apart after shocking reconnection: ‘He earned his paycheck!’

You may also like