To set up a Site Audit, you first need to create a Project for the domain. Once you have your new project, select the “Set up” button in the Site Audit block of your Project interface.
If you are having problems getting your Site Audit to run, please reference Troubleshooting Site Audit for help.
Domain and Limit of Pages
You’ll be taken to the first part of the setup wizard, Domain and Limit Pages. From here, you can either choose to “Start Site Audit,” which will immediately run an audit of your site with our default settings, or proceed to customize the settings of your audit to your liking. But don't worry, you can always change your settings and re-run your audit to crawl a more specific area of your site after your initial set up.
To crawl a specific domain, subdomain, or subfolder, you can enter it into the “Crawl scope” field. If you enter a domain in this field, you’ll be given the option to crawl all subdomains of your domain with a checkbox.
Limit of Checked Pages
Next, select how many pages you want to crawl per audit. You can enter a custom amount using the “Custom” option. You will want to choose this number wisely, depending on the level of your subscription and how often you plan on re-auditing your website.
- Pro users can crawl up to 100,000 pages per month and 20,000 pages per audit
- Guru users can crawl 300,000 pages per month and 20,000 pages per audit
- Business users can crawl up to 1 million pages per month and 100,000 pages per audit
Setting the crawl source determines how the Semrush Site Audit bot crawls your website and finds pages to audit. In addition to setting the crawl source, you can set masks and parameters to include/exclude from the audit in steps 3 and 4 of the setup wizard.
There are 4 options to set as your Audit’s crawl source: Website, Sitemap on site, Sitemap by URL, and a file of URLs.
1. Crawling from Website means we will crawl your site like the GoogleBot, using a breadth-first search algorithm and navigating through the links we see on your page’s code - starting from the homepage.
If you just want to crawl the most important pages of a site, choosing to crawl from Sitemap instead of Website will let the audit crawl the most important pages, rather than just the ones most accessible from the homepage.
2. Crawling from Sitemaps on site means we will only crawl the URLs that are found in the sitemap from the robots.txt file.
3. Crawling from Sitemap by URL is the same as crawling from “Sitemaps on site” but this option lets you specifically enter your sitemap URL.
Since search engines use sitemaps to understand which pages they should crawl, you should always try to keep your sitemap as up to date as possible and use it as a crawl source with our tool to get an accurate audit.
4. Crawling from a file of URLs lets you audit a super-specific set of pages on a website. Make sure that your file is properly formatted as a .csv or .txt with one URL per line and upload it directly to Semrush from your computer.
This is a useful method if you want to check on specific pages and conserve your crawl budget. If you made any changes to only a small set of pages on your site that you want to check on, you can use this method to run a specific audit and not waste any crawl budget.
After uploading your file, the wizard will tell you how many URLs were detected so that you can double check that it worked properly before running the audit.
The "Crawl AMP pages first" checkbox ensures that your audit will crawl your AMP pages to check for the most important issues related to AMP implementation. At this time, the AMP checks are only available for Business level subscriptions.
After configuring these settings, you can now run your Site Audit. However, if you'd like to add masks or remove parameters and set your schedule, use the advanced setup and configuration instructions below.
Advanced Setup and Configuration
Note: The following four steps of the configuration are advanced and optional.
This is where you can choose the user agent that you want to crawl your site. First, set your audit’s user agent by choosing between the mobile or desktop version of either the SemrushBot or the GoogleBot.
As you change the user agent, you’ll see the code in the dialog box below change as well. This is the user agent’s code and can be used in a curl if you want to test the user agent on your own.
Next, you have 3 options for setting a crawl delay: Minimum delay, Respect robots.txt, and 1 URL per 2 seconds.
If you leave this minimum delay between pages checked, the bot will crawl your website at its normal rate. By default, SemrushBot will wait around one second before starting to crawl another page.
If you have a robots.txt file on your site and specified a crawl delay, then you can select the “respect robots.txt crawl-delay” option to have our Site Audit crawler follow that instructed delay.
Below is how a crawl delay would look like within a robots.txt file:
If our crawler slows down your website and you do not have a crawl delay directive in your robots.txt file, you can tell Semrush to crawl 1 URL per 2 seconds. This may force your audit to take longer to complete, but it will cause less potential speed issues for actual users on your website during the audit.
This option will allow you to specifically crawl or block select subfolders of a website. You will want to include everything within the URL after the TLD. For example, if you wanted to crawl the subfolder http://www.example.com/shoes/mens/ you would want to enter: “/shoes/mens/” into the allow box on the left.
To avoid crawling specific subfolders, you would have to enter that subfolder’s path in the disallow box. For example, to crawl the men’s shoes category but avoid the hiking boots sub-category under men’s shoes (https://example.com/shoes/mens/hiking-boots/), you would enter /shoes/mens/hiking-boots/ in the disallow box.
If you forget to enter the / at the end of the URL in the disallow box (ex: /shoes), then Semrush will skip all pages in the /shoes/ subfolder as well as all URLs that begin with /shoes (such as www.example.com/shoes-men).
Remove URL Parameters
URL parameters (also known as query strings) are elements of a URL that do not fit into the hierarchical path structure. Instead, they are added on to the end of a URL and give logic instructions to the web browser.
URL parameters always consist of a ? followed by the parameter name (page, utm_medium, etc) and =.
So “?page=3” is a simple URL parameter that could indicate the 3rd page of scrolling on a single URL.
The 4th step of the Site Audit configuration allows you to specify any URL parameters that your website uses in order to remove them from the URLs while crawling. This helps Semrush avoid crawling the same page twice in your audit. If a bot sees two URLs; one with a parameter, and one without, it may crawl both pages and waste your crawl budget as a result.
For example, if you were to add “page” into this box, this would remove all URLs that included “page” in the URL extension. This would be URLs with values such as ?page=1, ?page=2, etc. This would then avoid crawling the same page twice (for example, both “/shoes” and “/shoes/?page=1” as one URL) in the crawling process.
Common uses of URL parameters include pages, languages and subcategories. These types of parameters are useful for websites with large catalogues of products or information. Another common URL parameter type is UTMs, which are used for tracking clicks and traffic from marketing campaigns.
You can find the exact list of your website’s parameters in Google Search Console. In the left side menu, locate "Crawl - URL Parameters." There is also a link under the “How it Works” paragraph in the window that will take you to your website’s list of URL parameters in Google Search Console.
If you already have a project set up and would like to change your settings, you can do so using the Settings gear:
You will use the same directions listed above by selecting the “Masks” and “Removed Parameters” options.
Bypass Website Restrictions
To audit a website in pre-production or hidden by basic access authentication, step 5 offers two options:
Bypassing the disallow in robots.txt and robots meta tag
Crawling with your credentials to bypass password protected areas
If you want to bypass disallow commands in the robots.txt or meta tag (usually this would be found in your website’s <head> tag), you will have to upload the .txt file provided by Semrush to the main folder of your website.
You can upload this file the same way you would upload a file for GSC verification, for example, directly into your website’s main folder. This process verifies your ownership of the website and allows us to crawl the site.
Once the file is uploaded, you can start the Site Audit and gather results.
To crawl with your credentials, simply enter the username and password that you use to access the part of your website that is hidden. Our bot will then use your login info to access the hidden areas and provide you with the results of the audit.
Lastly, select how often you would like us to automatically audit your website. Your options are:
- Weekly (choose any day of the week)
You can always re-run the audit at your convenience within the Project.
After completing all of your desired settings, select “Start Site Audit.”
In the case of an “auditing domain has failed” dialog, you will want to check that our Site Audit crawler is not blocked by your server. To ensure proper crawl, please follow our Site Audit Troubleshooting steps to whitelist our bot.
Alternatively, you can download the log file that’s generated when the failed crawl occurs, and provide the log file to your webmaster so they can analyze the situation, and try to find a reason why we are blocked from crawling. If you need a way to analyze your website's log file yourself, you can upload it to our Log File Analyzer tool.
Connecting Google Analytics and Site Audit
After completing the setup wizard, you will be able to connect your Google Analytics account to include issues related to your top-viewed pages.