Share


What is GPTBot | OpenAI's New Web Crawler


By gobrain

Jul 11th, 2024

Data is king in the ever-changing world of Artificial Intelligence (AI). AIs are trained on massive amounts of text and visual data from various sources including books, websites, articles, and more.

While AI companies often gather data from websites, they use web crawlers that are computer programs designed to automatically browse the web.

Now, OpenAI introduced its new Crawler, GPTBot. Let's discover it.

What is GPTBot

GPTBot is a web crawler developed by OpenAI, specifically designed to surf the internet for information. It gathers this information to feed and improve OpenAI's various AI features, like ChatGPT and Sora.

It's recognized by the user agent string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

OpenAI's GPTBot is designed to explore websites that are publicly available and don't require payment to access. It also avoids uses of websites to train the model that gather personal information or contain text violating OpenAI's policies.

OpenAI has come under criticism for its potential to intrude on the privacy of website owners and users. To address these concerns, OpenAI has taken measures to exclude websites that collect personal information and prevent GPTBot from using a website.

How To Restrict GPTBot From Scraping Your Websites

Some website owners may not want openAI to obtain data from their websites. Therefore, website owners have the right to control whether GPTBot accesses their sites through a robots.txt file

The following robots.txt file will tell GPTBot that it is not allowed to crawl the entire website.

User-agent: GPTBot
Disallow: /

You can also block or allow GPTBot for a specific directory or page as shown below

User-agent: GPTBot
Allow: /path/to/allow
Disallow: /path/to/disallow

This will tell GPTBot that pages allowed and disallowed for crawling and extracting data.

GPTBot vs. ChatGPT-User

ChatGPT-User is another user agent created by OpenAI, but for a different purpose than GPTBot. It's important to understand the difference between OpenAI's two user agents:

  • ChatGPT-User: This lets you browse the web directly through OpenAI's technology.
  • GPTBot: This is the web crawler that gathers information for AI development.

Keep in mind, while they have separate purposes, any restrictions you set via robots.txt will apply to both.

Conclusion

In summary, GPTBot is new crawler of OpenAI to gather data from websites for training their models. OpenAI aims to prevent privacy concerns and filter inaccurate data on the web.

For more informations on OpenAI's new crawler, you can visit the here and here.

Thank you for reading.