...
Using robots.txt
robots.txt
is a web standard that tells crawlers, indexers, and bots how to behave. This file includes instructions describing which web paths are and aren’t crawlable and can set limits on the number of requests a crawler may send.
...
Code Block |
---|
User-agent: * Crawl-delay: 5 Disallow: /apps/ Disallow: /appsint/ Disallow: /aspnet_client/ Disallow: /bin/ Disallow: /bin-BKP/ Disallow: /certificates/ Disallow: /cms/ Disallow: /crm/ Disallow: /Home/ Disallow: /includes/ Disallow: /pdf/ Disallow: /policy/ Disallow: /product/ Disallow: /res/ Disallow: /reservations/ Disallow: /rss/ Disallow: /shared/ Disallow: /softripnext/ Disallow: /STNAttach/ Disallow: /STNView/ Disallow: /stw/ Disallow: /stsw/ Disallow: /temp/ Disallow: /test/ Disallow: /testing/ Disallow: /view_invoice/ Disallow: /view_voucher/ Disallow: /webctrl_client/ Disallow: /groups/* Allow: /groups/$ Disallow: /Cms/ Disallow: /cms/ Allow: /cms/xmlsitemap #Block Amazon crawler User-agent: Amazonbot Disallow: / #Block dotbot User-agent: dotbot Disallow: / #Block Yandex User-agent: Yandex Disallow: / #Block all Semrush crawlers/bots User-agent: SemrushBot Disallow: / User-agent: SplitSignalBot Disallow: / User-agent: SiteAuditBot Disallow: / User-agent: SemrushBot-BA Disallow: / User-agent: SemrushBot-SI Disallow: / User-agent: SemrushBot-SWA Disallow: / User-agent: SemrushBot-CT Disallow: / User-agent: SemrushBot-BM Disallow: / #Block PetalBot User-agent: PetalBot Disallow: / # Block Claude (LLM Scraper) User-agent: ClaudeBot Crawl-delay: 100 Disallow: / # Block Common Crawl (LLM Scraper) User-agent: CCBot Crawl-delay: 100 Disallow: / # Block GPT bot (OpenAI Scraper) User-agent: GPTBot Crawl-delay: 100 Disallow: / # Block OAI-SearchBot (OpenAI Search Bot) User-agent: OAI-SearchBot Crawl-delay: 100 Disallow: / # Block Facebook/Meta User-agent: facebookexternalhit Crawl-delay: 100 Disallow: / # Block Facebook/Meta User-agent: meta-externalagent Crawl-delay: 10100 Disallow: / |
Using IIS Request Filtering
...
Code Block | ||
---|---|---|
| ||
<configuration> [...] <system.webServer> [...] <security> <requestFiltering> <filteringRules> <filteringRule name="Block Bots and Crawlers" scanUrl="false" scanQueryString="false"> <scanHeaders> <add requestHeader="User-Agent" /> </scanHeaders> <denyStrings> <add string="facebookexternalhit" /> <!-- Block Facebook crawler DDoS--> <add string="meta-externalagent" /> <!-- Meta/facebook --> <add string="GPTBot" /> <!-- Block OpenAI GPT crawler --> <add string="OAI-SearchBot" /> <!-- Block OpenAI GPT crawler --> </denyStrings> </filteringRule> </filteringRules> </requestFiltering> </security> </system.webServer> </configuration> |