Dealing with Aggressive Bots and Crawlers

Some bots and crawlers are particularly aggressive and may put undue load on your servers, resulting in potential outages and performance issues or slow response times.
This document describes available approaches to block these crawlers.

Using robots.txt

robots.txt is a web standard that tells crawlers, indexers, and bots how to behave. This file includes instructions describing which web paths are and aren’t crawlable and can set limits on the number of requests a crawler may send.

Softrip has a constantly-updated standard robots.txt with our recommended settings; reach out to your Softrip contact for the latest version.

Sample Softrip standard robots.txt:

User-agent: * Crawl-delay: 5 Disallow: /apps/ Disallow: /appsint/ Disallow: /aspnet_client/ Disallow: /bin/ Disallow: /bin-BKP/ Disallow: /certificates/ Disallow: /cms/ Disallow: /crm/ Disallow: /Home/ Disallow: /includes/ Disallow: /pdf/ Disallow: /policy/ Disallow: /product/ Disallow: /res/ Disallow: /reservations/ Disallow: /rss/ Disallow: /shared/ Disallow: /softripnext/ Disallow: /STNAttach/ Disallow: /STNView/ Disallow: /stw/ Disallow: /stsw/ Disallow: /temp/ Disallow: /test/ Disallow: /testing/ Disallow: /view_invoice/ Disallow: /view_voucher/ Disallow: /webctrl_client/ Disallow: /groups/* Allow: /groups/$ Disallow: /Cms/ Disallow: /cms/ Allow: /cms/xmlsitemap #Block Amazon crawler User-agent: Amazonbot Disallow: / #Block dotbot User-agent: dotbot Disallow: / #Block Yandex User-agent: Yandex Disallow: / #Block all Semrush crawlers/bots User-agent: SemrushBot Disallow: / User-agent: SplitSignalBot Disallow: / User-agent: SiteAuditBot Disallow: / User-agent: SemrushBot-BA Disallow: / User-agent: SemrushBot-SI Disallow: / User-agent: SemrushBot-SWA Disallow: / User-agent: SemrushBot-CT Disallow: / User-agent: SemrushBot-BM Disallow: / #Block PetalBot User-agent: PetalBot Disallow: / # Block Claude (LLM Scraper) User-agent: ClaudeBot Crawl-delay: 100 Disallow: / # Block Common Crawl (LLM Scraper) User-agent: CCBot Crawl-delay: 100 Disallow: / # Block GPT bot (OpenAI Scraper) User-agent: GPTBot Crawl-delay: 100 Disallow: / # Block OAI-SearchBot (OpenAI Search Bot) User-agent: OAI-SearchBot Crawl-delay: 100 Disallow: / # Block Facebook/Meta User-agent: facebookexternalhit Crawl-delay: 100 Disallow: / # Block Facebook/Meta User-agent: meta-externalagent Crawl-delay: 100 Disallow: /

Using IIS Request Filtering

Some bots and crawlers ignore robots.txt (for example, “Facebook external hit”). In addition, crawlers that do respect robots.txt may cache it for a long time and a fix on that file may not take effect for hours.

In those cases, IIS can be configured to reject requests for specific user agents.

In your site’s root web.config, add the following section with a denyString for each user agent you want to block:

<configuration> [...] <system.webServer> [...] <security> <requestFiltering> <filteringRules> <filteringRule name="Block Bots and Crawlers" scanUrl="false" scanQueryString="false"> <scanHeaders> <add requestHeader="User-Agent" /> </scanHeaders> <denyStrings> <add string="facebookexternalhit" /> <!-- Block Facebook crawler --> <add string="meta-externalagent" /> <!-- Meta/facebook --> <add string="GPTBot" /> <!-- Block OpenAI GPT crawler --> <add string="OAI-SearchBot" /> <!-- Block OpenAI GPT crawler --> </denyStrings> </filteringRule> </filteringRules> </requestFiltering> </security> </system.webServer> </configuration>