What is GPTBot | OpenAI's Web Crawler

By gobrain

Dec 31st, 2023

Here is the OpenAI's new web crawler: GPTBot

AI Bots are trained on massive amounts of text data from various sources including books, websites, articles, and more. While AI companies companies often gather data from websites, they use web crawlers that are computer programs designed to automatically browse the internet.

Now, OpenAI introduces its new Crawler, GPTBot

What is GPTBot

The OpenAI GPTBot is a web crawler used by OpenAI to gather text and code from the internet for training their AI models. It's recognized by the user agent string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +

GPTBot is designed to explore websites that are publicly available and don't require payment to access. It also avoids uses of websites to train the model that gather personal information or contain text violating OpenAI's policies.

OpenAI has come under criticism for its potential to intrude on the privacy of website owners and users. To address these concerns, OpenAI has taken measures to exclude websites that collect personal information and prevent GPTBot from using a website.

How To Block GPTBot From Scraping Websites

Some website owners may not want openAI to obtain data from their websites. Therefore, GPTBot can be blocked from scraping your website by adding the following lines to your robots.txt file:

User-agent: GPTBot
Disallow: /

This will tell GPTBot that it is not allowed to crawl your website. You can also block GPTBot for a specific page:

User-agent: GPTBot
Allow: /path/to/allow
Disallow: /path/to/disallow

This will tell GPTBot that pages allowed and disallowed for crawling and extracting data.


In summary, GPTBot is new crawler of OpenAI to gather data from websites for training their models. OpenAI aims to prevent privacy concerns and filter inaccurate data on the web.

