Skip to content

Commit

Permalink
chore: add OAI-SearchBot
Browse files Browse the repository at this point in the history
  • Loading branch information
cdransf committed Jul 26, 2024
1 parent c17cae6 commit 2972926
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 0 deletions.
1 change: 1 addition & 0 deletions robots.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: PerplexityBot
Expand Down
1 change: 1 addition & 0 deletions table-of-bot-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
|GoogleOther-Video | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
|GPTBot | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
| img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | At the discretion of img2dataset users. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
|OAI-SearchBot | [OpenAI](https://openai.com) | [Yes](https://platform.openai.com/docs/bots) | Search result generation. | No information | Crawls sites to surface as results in SearchGPT. |
|omgili | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
|omgilibot | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
|PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts. | Operated by Perplexity to obtain results in response to user queries. |
Expand Down

0 comments on commit 2972926

Please sign in to comment.