A bot, also known as a web robot, web spider or web crawler, is a software application designed to automatically perform simple and repetitive tasks in a more effective, structured, and concise manner than any human can ever do.
The most common use of bots is in web spidering or web crawling.
SEMrushBot is the search bot software that SEMrush sends out to discover and collect new and updated web data.
Data collected by SEMrushBot is used for:
- the public backlink search engine index maintained as a dedicated tool called Backlink Analytics (webgraph of links)
- the Site Audit tool, which analyzes on-page SEO, technical and usability issues
- the Backlink Audit tool, which helps discover and clean up potentially dangerous backlinks of your profile
- the Link Building tool, which helps you find prospects, reach out to them and monitor your newly acquired backlinks
- the SEO Writing Assistant tool to check if URL is accessible
- the Brand Monitoring tool to index and search for articles
- the Content Analyzer and Post Tracking tools reports
- the On Page SEO Checker and SEO Content template tools reports
- the Topic Research tool reports
SEMrushBot’s crawl process starts with a list of webpage URLs. When SEMrushBot visits these URLs, it saves hyperlinks from the page for further crawling. This list, also known as the "crawl frontier", is repeatedly visited according to a set of SEMrush policies to effectively map a site for updates: content changes, new pages, and dead links.
Bots are crawling your web pages to help parse your site content, so the relevant information within your site is easily indexed and more readily available to users searching for the content you provide.
Although most bots are harmless and even quite beneficial, you may still want to prevent them from crawling your site (please note, however, that not everyone on the web is using a bot to help index your site). The easiest and quickest way to do this is to use the robots.txt file. This text file contains instructions on how a bot should process your site data.
Important: The robots.txt file must be placed in the top directory of the website host to which it applies. Otherwise, it will have no effect on the SEMrushBot behavior.
To stop SEMrushBot from crawling your site, add the following rules to your robots.txt file:
To block SEMrushBot from crawling your site for a webgraph of links:
User-agent: SemrushBotSEMrushBot for Backlink Analytics also supports the following non-standard extensions to robots.txt:
- Crawl-delay directives. Our crawler can take intervals of up to 10 seconds between requests to a site. Higher values will be cut down to this 10-second limit. If no crawl-delay is specified, SEMrushBot will adjust the frequency of requests to your site according to the current server load.
- The use of wildcards (*).
- If you have subdomains, you need to place a robots.txt file on each subdomain. Otherwise, SEMrushBot will not address any other file in your domain, and will consider that it is allowed to crawl everything on your subdomain.
- The robots.txt file must always return an HTTP 200 status code. If a 4xx status code is returned, SEMrushBot will assume that no robots.txt exists and there are no crawl restrictions. Returning a 5xx status code for your robots.txt file will prevent SEMrushBot from crawling your entire site. Our crawler can handle robots.txt files with a 3xx status code.
Please note that it may take up to one hour or 100 requests for SEMrushBot to discover changes made to your robots.txt.
To block SEMrushBot from crawling your site for different SEO and technical issues:
To block SEMrushBot from crawling your site for Backlink Audit tool:
To block SEMrushBot from crawling your site for On Page SEO Checker tool and similar tools:
To block SEMrushBot from checking URLs your site for SWA tool:
To block SEMrushBot from crawling your site for Content Analyzer and Post Tracking tools:
To block SEMrushBot from crawling your site for Brand Monitoring:
To prevent the "file not found" error messages in your web browser server log, create an empty "robots.txt" file.
Do not try to block SEMrushBot via IP as we do not use any consecutive IP blocks.
Why does SEMrush try to crawl a page that doesn’t exist or has strange URL parameters?
Generally, if SEMrushBot detects non-existent pages on your website, it stops crawling them. However, our crawler may continue to look for a page that no longer exists if other sites across the web link to it.
Why does SEMrushBot attempt to log in, try different passwords and submit survey forms?
This is because your login or survey form is submitted using the GET method. The input data will become part of the requested URL and can be accessed by anyone who knows the URL, including our crawler. Although, we’re trying to omit such forms, we recommend that you switch the form method to POST.
SEMrushBot doesn’t obey our robots.txt commands after we have undergone a site migration.
If you want to migrate a site (for example, from HTTP to HTTPS), do not forget to migrate a robots.txt file.
For more information about bots, please refer to http://www.robotstxt.org/.
If you have any questions about SEMrushBot, please contact us at firstname.lastname@example.org and we will respond as soon as possible.
SEMrushBot needs some time to discover changes in your robots.txt file. However, if you think that it keeps ignoring your "robots.txt" rules for quite a long time, please provide us with your website URL, the log entries showing SEMrushBot crawling the pages that it was not supposed to, and we will work quickly to resolve the issue.