Knowledge Base

Configuring Site Audit Manual

To set up a Site Audit, you first need to create a Project for the domain. Once you have your new project, select the “Set up” button in the Site Audit block of your Project interface.

If you are having problems getting your Site Audit to run, please reference Troubleshooting Site Audit for help.

Crawl Scope
Crawl Source
Advanced Setup
Crawler Settings
Allow/Disallow URLs
Remove URL Parameters
Bypass Website Restrictions
Schedule
Connecting Google Analytics

Step 1: Domain and Limit of Pages

You’ll be taken to the first part of the setup wizard, Domain and Limit Pages. From here, you can either choose to “Start Site Audit,” which will immediately run an audit of your site with our default settings, or proceed to customize the settings of your audit to your liking. But don't worry, you can always change your settings and re-run your audit to crawl a more specific area of your site after your initial set up.

Crawl Scope

To crawl a specific domain, subdomain, or subfolder, you can enter it into the “Crawl scope” field. If you enter a domain in this field, you’ll be given the option to crawl all subdomains of your domain with a checkbox.

crawl-scope-site-audit

Limit of Checked Pages

Next, select how many pages you want to crawl per audit. You can enter a custom amount using the “Custom” option.  You will want to choose this number wisely, depending on the level of your subscription and how often you plan on re-auditing your website.

  • Pro users can crawl up to 100,000 pages per month and 20,000 pages per audit
  • Guru users can crawl 300,000 pages per month and 20,000 pages per audit
  • Business users can crawl up to 1 million pages per month and 100,000 pages per audit

Crawl Source

Setting the crawl source determines how the SEMrush Site Audit bot crawls your website and finds pages to audit. In addition to setting the crawl source, you can set masks and parameters to include/exclude from the audit in steps 3 and 4 of the setup wizard.

There are 4 options to set as your Audit’s crawl source: Website, Sitemap on site, Sitemap by URL, and a file of URLs.

1. Crawling from Website means we will crawl your site like the GoogleBot, using a breadth-first search algorithm and navigating through the links we see on your page’s code - starting from the homepage.

If you just want to crawl the most important pages of a site, choosing to crawl from Sitemap instead of Website will let the audit crawl the most important pages, rather than just the ones most accessible from the homepage.

2. Crawling from Sitemaps on site means we will only crawl the URLs that are found in the sitemap from the robots.txt file.

3. Crawling from Sitemap by URL is the same as crawling from “Sitemaps on site” but this option lets you specifically enter your sitemap URL.

Since search engines use sitemaps to understand which pages they should crawl, you should always try to keep your sitemap as up to date as possible and use it as a crawl source with our tool to get an accurate audit.

4. Crawling from a file of URLs lets you audit a super-specific set of pages on a website. Make sure that your file is properly formatted as a .csv or .txt with one URL per line and upload it directly to SEMrush from your computer.

This is a useful method if you want to check on specific pages and conserve your crawl budget. If you made any changes to only a small set of pages on your site that you want to check on, you can use this method to run a specific audit and not waste any crawl budget.

After uploading your file, the wizard will tell you how many URLs were detected so that you can double check that it worked properly before running the audit.
auditing-urls-from-fileCrawling Javascript

If you use JavaScript on your site, you can implement AJAX crawling scheme and Site Audit will find links in your JavaScript and follow them to the content on your site they link to. All you have to do is re-run your campaign and change the crawl source from Website to Sitemap. You can read more about this in our news release.

AJAX crawling allows us to find the pages where there are JavaScript elements and crawl the HTML on those pages and measure the size of JS and CSS elements with our Performance checks.

Auditing AMPs

The "Crawl AMP pages first" checkbox ensures that your audit will crawl your AMP pages to check for the most important issues related to AMP implementation. At this time, the AMP checks are only available for Business level subscriptions.

After configuring these settings, you can now run your Site Audit. However, if you'd like to add masks or remove parameters and set your schedule, use the advanced setup and configuration instructions below. 

Advanced Setup and Configuration

Note: The following four steps of the configuration are advanced and optional.

Step 2: Crawler Settings

This is where you can choose the user agent that you want to crawl your site. First, set your audit’s user agent by choosing between the mobile or desktop version of either the SEMrushBot or the GoogleBot.

user-agent-settings

As you change the user agent, you’ll see the code in the dialog box below change as well. This is the user agent’s code and can be used in a curl if you want to test the user agent on your own.

Crawl-Delay Options

Next, you have 3 options for setting a crawl delay: Minimum delay, Respect robots.txt, and 1 URL per 2 seconds.

If you leave this minimum delay between pages checked, the bot will crawl your website at its normal rate. By default, SEMrushBot will wait around one second before starting to crawl another page.

If you have a robots.txt file on your site and specified a crawl delay, then you can select the “respect robots.txt crawl-delay” option to have our Site Audit crawler follow that instructed delay.

Below is how a crawl delay would look like within a robots.txt file:

Crawl-delay:20

If our crawler slows down your website and you do not have a crawl delay directive in your robots.txt file, you can tell SEMrush to crawl 1 URL per 2 seconds. This may force your audit to take longer to complete, but it will cause less potential speed issues for actual users on your website during the audit.

Step 3: Allow/Disallow URLs

This option will allow you to specifically crawl or block select subfolders of a website. You will want to include everything within the URL after the TLD. For example, if you wanted to crawl the subfolder http://www.example.com/shoes/mens/ you would want to enter: “/shoes/mens/” into the allow box on the left.

allow-urlsTo avoid crawling specific subfolders, you would have to enter that subfolder’s path in the disallow box. For example, to crawl the men’s shoes category but avoid the hiking boots sub-category under men’s shoes (https://example.com/shoes/mens/hiking-boots/), you would enter /shoes/mens/hiking-boots/ in the disallow box.

disallow-pathsIf you forget to enter the / at the end of the URL in the disallow box (ex: /shoes), then SEMrush will skip all pages in the /shoes/ subfolder as well as all URLs that begin with /shoes (such as www.example.com/shoes-men). 

Step 4: Remove URL Parameters

URL parameters (also known as query strings) are elements of a URL that do not fit into the hierarchical path structure. Instead, they are added on to the end of a URL and give logic instructions to the web browser.

URL parameters always consist of a ? followed by the parameter name (page, utm_medium, etc) and =.

So “?page=3” is a simple URL parameter that could indicate the 3rd page of scrolling on a single URL.

The 4th step of the Site Audit configuration allows you to specify any URL parameters that your website uses in order to remove them from the URLs while crawling. This helps SEMrush avoid crawling the same page twice in your audit. If a bot sees two URLs; one with a parameter, and one without, it may crawl both pages and waste your crawl budget as a result.

remove-parameters

For example, if you were to add “page” into this box, this would remove all URLs that included “page” in the URL extension. This would be URLs with values such as ?page=1, ?page=2, etc. This would then avoid crawling the same page twice (for example, both “/shoes” and “/shoes/?page=1” as one URL) in the crawling process.

Common uses of URL parameters include pages, languages and subcategories. These types of parameters are useful for websites with large catalogues of products or information. Another common URL parameter type is UTMs, which are used for tracking clicks and traffic from marketing campaigns.

You can find the exact list of your website’s parameters in Google Search Console. In the left side menu, locate "Crawl - URL Parameters." There is also a link under the “How it Works” paragraph in the window that will take you to your website’s list of URL parameters in Google Search Console.

If you already have a project set up and would like to change your settings, you can do so using the Settings gear:

site-audit-settings-masks
You will use the same directions listed above by selecting the “Masks” and “Removed Parameters” options.

Step 5: Bypass Website Restrictions

To audit private areas of your website that are password protected by basic access authentication, enter your credentials in this slide to allow the Site Audit bot to reach those pages and audit them for you.bypass-website-restrictions-site-auditThis is recommended for sites that are under development or are private and fully guarded by password. To make sure that the crawler goes to the area that you want it to, be sure to take advantage of allowing/disallowing URLs and URL parameters.

Step 6: Schedule

Lastly, select how often you would like us to automatically audit your website. Your options are:

  • Weekly (choose any day of the week)
  • Daily
  • Once

You can always re-run the audit at your convenience within the Project.

schedule-site-auditAfter completing all of your desired settings, select “Start Site Audit.”

In the case of an “auditing domain has failed” dialog, you will want to check that our Site Audit crawler is not blocked by your server. The crawler has an IP address of 46.229.173.67. Next, you can download the log file that’s generated when the failed crawl occurs, and provide the log file to your webmaster so they can analyze the situation, and try to find a reason why we are blocked from crawling.

Connecting Google Analytics and Site Audit

After completing the setup wizard, you will be able to connect your Google Analytics account to include issues related to your top-viewed pages. 

If any issue persists with running your Site Audit, try Troubleshooting Site Audit or contact our support team and we will be happy to help you.