Your browser is out of date. The site might not be displayed correctly. Please update your browser.

Troubleshooting Content Analyzer Manual

Is your Content Audit not running properly?

If you are reading this article, you might be facing one of the following problems during the configuration of the Content Audit:

  • "we couldn’t audit your domain. No sitemap files can be found at the specified URLs.”  
  • "your sitemap.xml file is invalid.”
  • or a similar note

Follow these troubleshooting steps to fix the most likely problems you could run into:

  1. Add sitemap manually
  2. Check your Robots.txt file
  3. Remove NOINDEX tag on your website
  4. Whitelist the SEMrush-Bot
  5. Make sure that the sitemap is correctly formatted
  6. Consider the Content Audit limitation
  7. I don't have a sitemap file yet, what should I do?
  8. How can I change the scope/sitemap of a Content Audit?
  9. How can I update the Content Audit results?
  10. What are the limitations of Post Tracking?
  11. Contact SEMrush Support for further assistance

Add sitemap manually

By default, the Content Audit tries to find your sitemap on any of these eight destinations:

  • https://www.domain/sitemap_index.xml
  • http://www.domain/sitemap_index.xml
  • http://domain/sitemap_index.xml
  • https://www.domain/sitemap_index.xml
  • https://www.domain/sitemap.xml
  • http://www.domain/sitemap.xml
  • http://domain/sitemap.xml
  • https://www.domain/sitemap.xml

If we couldn’t find the sitemap automatically, you can use the “Add sitemap link” button to add the sitemap URL:

content audit manualThere may be cases when you are not aware of the existence of the sitemap, but you have it — we recommend that you check it with your web designer or SEO specialist.

Check your Robots.txt file

We also take into account your Robots.txt file. This file can both help start an audit and prevent a bot from getting to your website.

A Robots.txt file gives instructions to bots about how to crawl (or not crawl) the pages of a website. To check the Robots.txt file of a website, enter the root domain of your site followed by /robots.txt. For example, the Robots.txt file on example.com is found at http://www.example.com/robots.txt  

You can inspect your Robots.txt to see if there are any disallow commands that would prevent crawlers like ours from accessing your website.

To allow the Semrush-Bot (Content-analyzer; 1.0; https://www.semrush.com/bot/) to crawl your site, add the following into your robots.txt file:

User-agent: SEMrush-Bot
Disallow:   

To help our bot to find the sitemap automatically, you can add the following line anywhere in your robots.txt file to specify the path to your sitemap:

Sitemap: http://domain/sitemap_location.xml

Remove NOINDEX tag on your website

If you see the following code on the main page of a website, it tells us that we’re not allowed to index/follow links on it and our access is blocked.

<meta name="robots" content="noindex, nofollow" >

Additionally, a page containing at least one of the following:  "noindex", "nofollow", "none", will lead to a crawling error.

To allow our bot to crawl such a page, remove the “noindex” tag from your page’s code.

Whitelist the SEMrush-Bot

Another reason that the audit won’t start may be due to blocking of our bot. To whitelist the bot, you need to contact your webmaster or hosting provider and ask them to whitelist the Semrush-Bot.

The bot's IP addresses are: 

  • 213.174.153.121
  • 18.197.42.174
  • 35.177.199.105
  • 13.53.129.183

The bot is using standard 80 HTTP port to connect.

If you use any plugins (Wordpress, for example) or CDNs (content delivery networks) to manage your site, you will have to whitelist the bot IP within those as well.

For whitelisting on Wordpress, contact Wordpress support.

Common CDNs that block our crawler include:

  • Cloudflare — read how to whitelist here.
  • Incapsula — read how to whitelist here (add SEMrush as a “Good bot”).
  • ModSecurity — read how to whitelist here.
  • Sucuri — read how to whitelist here.

Thus, make sure that the sitemap file is available to be visited by the bot, e.g. there is no block of our requests by user-agent or by IP.

Please note: If you have shared hosting, it is possible that your hosting provider may not allow you to whitelist any bots or edit the Robots.txt file.

Make sure that the sitemap is correctly formatted 

  • The sitemap should be correctly formatted in accordance with the sitemap protocol.
  • The sitemap should contain only the URLs of the domain you would like to analyze.

Consider the Content Audit limitation

There is a technical limitation allowing for no more than 20k pages analyzed per audit and no more than 100 embedded sitemaps in a sitemap index.

If your sitemap consists of other sitemaps which in turn also contain links to other sitemaps and not the list of URLs, then we will not be able to proceed with the audit in such a case. 

We don’t show the subdomains of a domain, then in case you need to audit your subdomain, it will require you to set up another project to do that.

I don't have a sitemap file yet, what should I do?

If the sitemap is in progress or inaccessible, you can submit a list of URLs for analysis. The file for upload should be a .txt, .xml or .csv, less than 100 MB in size: 

Troubleshooting Content Analyzer image 2

You need to make sure that URLs in the file match the project domain and there is nothing more in the file besides the list of URLs that match the domain name.

Additional Troubleshooting Tips:

How can I change the scope/sitemap of a Content Audit?

Subfolders to pull the URLs from are picked up from the sitemaps by default. To add more pages or parts of the domain to Content Analyzer you can:

  • Restart the campaign and select the corresponding subfolder;
  • Upload a file to include all the necessary URLs (up to 20k);
  • If the total number of pages you wish to analyze is over 20k, create an additional project to cover the extra pages.

How can I update the Content Audit results?

You can update the metrics and results of your audit by clicking the refresh / last update button. “Content update on” refers to the publication date, not the last update. The "last update" refers to the date of the last update of metrics.

What are the limitations of Post Tracking?

A homepage cannot be monitored in Post Tracking. That's one global limitation, however, the tool was intended for monitoring particular articles and posts, so we hope it will not give you any inconvenience. 

Contact SEMrush Support

If you still are having issues running your Content Audit, send an email to [email protected] and explain your problem

Manual
Show more