OpenAI now allows you to block your web crawler from scraping your site to help train GPT models.
in a blog post, OpenAI said that website operators can specifically disallow its GPTBot crawler in their site’s Robots.txt file or block its IP address. “Web pages crawled with the GPTBot user agent can potentially be used to improve future models and are filtered to remove sources that require paid access, are known to collect personally identifiable information (PII), or have text that violates our policies” said OpenAI. in the blog post. For sources that do not fit the excluded criteria, “allowing GPTBot to access your site can help make the AI models more accurate and improve its overall capabilities and security.”
Locking down the GPTBot may be the first step in OpenAI that allows Internet users to opt out of having their data used to train their large language models. It follows some early attempts to create a flag that would exclude training content, such as a “NoAI” tag devised by DeviantArt last year. It does not retroactively remove content previously extracted from a site from the ChatGPT training data.
The Internet provided much of the training data for large language models, such as OpenAI’s GPT and Google’s Bard models. However, OpenAI will not confirm whether it obtained your data through social media posts, copyrighted works, or what parts of the Internet it collected for information. And obtaining data for AI training has become increasingly contentious. Sites, including Reddit and Twitter, have pushed to crack down on the free use of their users’ posts by artificial intelligence companies, while authors and other creatives have sued over alleged unauthorized use of their posts. plays. Lawmakers also grappled with consent and data privacy issues in several Senate hearings on AI regulation last month.
As reported by Axios, companies like Adobe have floated the idea of marking data as unfit for training through an anti-phishing law. AI companies, including OpenAI, have signed an agreement with the White House to develop a watermarking system to let people know if something was generated by AI, but have not promised to stop using data from the Internet for training.