Some bots and crawlers are particularly aggressive and may put undue load on your servers, resulting in potential outages and performance issues or slow response times.
This document describes available approaches to block these crawlers.
Using robots.txt
robots.txt
is a web standard that tells crawlers, indexers, and bots how to behave. This file includes instructions describing which web paths are and aren’t crawlable and can set limits on the number of requests a crawler may send.
Softrip has a constantly-updated standard robots.txt
with our recommended settings; reach out to your Softrip contact for the latest version.
Sample Softrip standard robots.txt
:
User-agent: * Crawl-delay: 5 Disallow: /apps/ Disallow: /appsint/ Disallow: /aspnet_client/ Disallow: /bin/ Disallow: /bin-BKP/ Disallow: /certificates/ Disallow: /cms/ Disallow: /crm/ Disallow: /Home/ Disallow: /includes/ Disallow: /pdf/ Disallow: /policy/ Disallow: /product/ Disallow: /res/ Disallow: /reservations/ Disallow: /rss/ Disallow: /shared/ Disallow: /softripnext/ Disallow: /STNAttach/ Disallow: /STNView/ Disallow: /stw/ Disallow: /stsw/ Disallow: /temp/ Disallow: /test/ Disallow: /testing/ Disallow: /view_invoice/ Disallow: /view_voucher/ Disallow: /webctrl_client/ Disallow: /groups/* Allow: /groups/$ Disallow: /Cms/ Disallow: /cms/ Allow: /cms/xmlsitemap #Block Amazon crawler User-agent: Amazonbot Disallow: / #Block dotbot User-agent: dotbot Disallow: / #Block Yandex User-agent: Yandex Disallow: / #Block all Semrush crawlers/bots User-agent: SemrushBot Disallow: / User-agent: SplitSignalBot Disallow: / User-agent: SiteAuditBot Disallow: / User-agent: SemrushBot-BA Disallow: / User-agent: SemrushBot-SI Disallow: / User-agent: SemrushBot-SWA Disallow: / User-agent: SemrushBot-CT Disallow: / User-agent: SemrushBot-BM Disallow: / #Block PetalBot User-agent: PetalBot Disallow: / # Block Claude (LLM Scraper) User-agent: ClaudeBot Disallow: / # Block Common Crawl (LLM Scraper) User-agent: CCBot Disallow: / # Block GPT bot (OpenAI Scraper) User-agent: GPTBot Disallow: / # Block Facebook User-agent: facebookexternalhit Crawl-delay: 10 Disallow: /
Using IIS Request Filtering
Some bots and crawlers ignore robots.txt
(for example, “Facebook external hit”). In addition, crawlers that do respect robots.txt
may cache it for a long time and a fix on that file may not take effect for hours.
In those cases, IIS can be configured to reject requests for specific user agents.
In your site’s root web.config
, add the following section with a denyString
for each user agent you want to block:
<configuration> [...] <system.webServer> [...] <security> <requestFiltering> <filteringRules> <filteringRule name="Block Bots and Crawlers" scanUrl="false" scanQueryString="false"> <scanHeaders> <add requestHeader="User-Agent" /> </scanHeaders> <denyStrings> <add string="facebookexternalhit" /> <!-- Block Facebook crawler DDoS --> <add string="GPTBot" /> <!-- Block OpenAI GPT crawler --> </denyStrings> </filteringRule> </filteringRules> </requestFiltering> </security> </system.webServer> </configuration>