Troubleshooting Site Audit Manual

Is your Site Audit not running properly?

There are a number of reasons why pages could be blocked from the Site Audit crawler based on your website’s configuration and structure, including:  

  • Robots.txt blocking crawler
  • Crawl scope excluding certain areas of the site
  • Website is not directly online due to shared hosting
  • Pages are behind a gateway / user base area of site
  • Crawler blocked by noindex tag
  • Domain could not be resolved by DNS - the domain entered in setup is offline -
  • Website content built on Javascript - our system only checks fixed website content and can only provide a partial audit of dynamic elements

Troubleshooting Steps

Follow these troubleshooting steps to see if you can make any adjustments on your own before reaching out to our support team for help.

  1. Check your Robots.txt file for disallow commands
  2. Remove noindex code on site
  3. Whitelist the SEMrushBot
  4. Check your account limits
  5. Make sure there are proper redirects to relevant versions of your site
  6. Change crawl source to sitemap
  7. Change the User Agent (from SEMrushBot to GoogleBot)
  8. Contact SEMrush Support for further assistance

Robots.txt

A Robots.txt file gives instructions to bots about how to crawl (or not crawl) the pages of a website.

You can inspect your Robots.txt to see if there are any disallow commands that would prevent crawlers like ours from accessing your website. To check the Robots.txt file of a website, enter the root domain of your site, followed by /robots.txt. For example, the robots.txt file on target.com is found at http://www.target.com/robots.txt  

To allow the Semrush-SA Bot to crawl your site, add the following into your robots.txt file:

User-agent: SemrushBot-SA
Disallow:   

(leave a blank space after “Disallow:”)

Remove NOINDEX Code on Site

If you see the following code on the main page of a website, it tells us that we’re not allowed to index/follow links on it and our access is blocked.

<meta name="robots" content="noindex, nofollow" >

Or, a page containing at least one of the following:  "noindex", "nofollow", "none", will lead to the error of crawling.

To allow our bot to crawl such a page, remove the “noindex” tag from your page’s code.

Whitelist SEMrushBot

To whitelist the bot, contact your webmaster or hosting provider and ask them to whitelist SemrushBot-SA.

The bot's IP addresses are 46.229.173.67 and 46.229.173.66.

The bot is using standard 80 HTTP and 443 HTTPS ports to connect.

If you use any plugins (Wordpress, for example) or CDNs (content delivery networks) to manage your site, you will have to whitelist the bot IP within those as well.

For whitelisting on Wordpress, contact Wordpress support.

Common CDNs that block our crawler include:

  • Cloudflare - read how to whitelist here.
  • Incapsula - read how to whitelist here (add SEMrush as a “Good bot”).
  • ModSecurity - read how to whitelist here.
  • Sucuri - read how to whitelist here.

Please note: If you have shared hosting, it is possible that your hosting provider may not allow you to whitelist any bots or edit the Robots.txt file.

Check Account Limits

To see how much of your current crawl budget has been used, go to Profile - Subscription Info and look for “Pages to crawl” under “My plan.”

Depending on your subscription level, you are limited to a set number of pages that you can crawl in a month (monthly crawl budget). If you go over the amount of pages allowed within your subscription, you’ll have to purchase additional limits or wait until the next month when your limits will refresh.

Proper Redirects (for DNS Issues)

If the domain could not be resolved by DNS, it likely means that the domain you entered during configuration is offline. Commonly, users have this issue when entering a root domain (example.com) without realizing that the root domain version of their site doesn’t exist and the WWW version of their site would need to be entered instead (www.example.com).  

To prevent this issue, the website owner could add a redirect from the unsecured “example.com” to the secured “www.example.com” that exists on the server. This issue could also occur the other way around, if someone’s root domain is not secured, but their WWW version is. In such a case, you would just have to redirect the root domain version to the WWW version.

Change Crawl Source (JavaScript)

SEMrush cannot parse JavaScript content at this time, so if your homepage has links to the rest of your site hidden in JavaScript elements, we will not be able to read them and crawl those pages.

However, you can implement AJAX crawling scheme and Site Audit will find links in your JavaScript and follow them to the content on your site that they link to. All you have to do is re-run your campaign and change the crawl source from Website to Sitemap. You can read more about this in our news release.

In order to not miss the most important pages on your website with our crawl, you can change your crawl source from website to sitemap - this way we won’t miss any pages that are mentioned in the sitemap.

Although we cannot crawl JavaScript content, we can crawl the HTML of a page that has some JS elements and we can review the parameters of your JS and CSS files with our Performance checks.

Change User Agent

Your website may be blocking the SEMrushBot in your robots.txt file. You can change the User Agent from SEMrushBot to GoogleBot and your website is likely to allow Google’s User Agent to crawl. To make this change, find the settings gear in your Project and select User Agent.

user-agent-semrush-site-audit

Contact SEMrush Support

If you still are having issues running your Site Audit, send an email to [email protected] or call us at the number in the website footer to explain your problem.

Further reading: Check out our 2017 study of the most common technical SEO mistakes.