en
English Español Deutsch Français Italiano Português (Brasil) Русский 中文 日本語
Go to Blog
Elena Terenteva

18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

73
Wow-Score
The Wow-Score shows how engaging a blog post is. It is calculated based on the correlation between users’ active reading time, their scrolling speed and the article’s length.
This post is in English
Elena Terenteva
This post is in English
18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

You’ve been working on your website really hard, and can’t wait to see it on the top of the search, but your content is struggling to overcome the 10th page hurdle. If you know what is SEO, have optimized content, and sure that your website deserved to be ranked higher, the problem might exist within your website crawlability.

What is crawlability? Search engines use search bots for collecting certain website pages parameters. The process of collecting this data is called crawling. Based on this data, search engines include pages in their search index, which means that page can be found by users. Website crawlability is its accessibility for search bots. You have to be sure that search bots will be able to find your website pages, obtain access and then “read” them.

We also break these issues down into two categories: those you can solve by your own and those you need to involve a developer or a system administrator in. Of course, all of us have different backgrounds and skills, so take this categorization tentatively.

Fixing Issues

What we mean by “solve by your own:” you can manage your website’s pages code and root files. You also need to have basic knowledge of coding (change or replace a piece of code in the right place and in the right manner).

Necessary Tools

What we mean by “delegate to a specialist:” server administration and/or web developing skills are required.

Issues with Meta Tags or robots.txt

This type of issues are pretty easy to detect and solve by simply checking your meta tags and robots.txt file, which is why you should look at it first. The whole website or certain pages can remain unseen by Google for a simple reason: its site crawlers are not allowed to enter them.

There are several bot commands, which will prevent page crawling. Note, that it’s not a mistake to have these parameters in robots.txt; used properly and accurately these parameters will help to save a crawl budget and give bots exact direction they need to follow in order to crawl pages you want crawled.

Blocking the page from indexing through robots meta tag

If you do this, the search bot will not even start looking at your page’s content, moving directly to the next page.

You can detect this issue checking if your page’s code contains this directive:

<meta name="robots" content="noindex" />

NoFollow links

In this case, the site crawler will index your page’s content but will not follow the links. There are two types of no follow directives:

  • for the whole page. Check if you have

    <meta name="robots" content="nofollow">

    in the page’s code - that would mean the crawler can’t follow any link on the page.

  • for a single link. This is how the piece of code looks like in this case:
href="pagename.html" rel="nofollow"/>

Blocking the pages from indexing through robots.txt

Robots.txt is the first file of your website the crawlers look at. The most painful thing you can find there is:

User-agent: *
Disallow: /

It means that all the website’s pages are blocked from indexing.

It might happen that only certain pages or sections are blocked, for instance:

User-agent: *
Disallow: /products/

In this case any page in the Products subfolder will be blocked from indexing and, therefore, none of your product descriptions will be visible in Google.

Internal broken links

Broken links are always a bad experience for your users, but also for the crawlers. Every page the search bot is indexing (or trying to index) is a spend of crawl budget. With this in mind, if you have many broken links, the bot will waste all of its time to index them and won’t arrive to relevant and quality pages.

The Crawl errors report in Google Search Console or the Internal broken links check in SEMrush Site Audit will help you identify this type of problems.

URL errors

A URL error is usually caused by a typo in the URL you insert to your page (text link, image link, form link). Be sure to check that all the links are typed in correctly.

Outdated URLs

If you have recently undergone a website migration, a bulk delete or a URL structure change, you need to double-check this issue. Make sure you don’t link to old or deleted URLs from any of your website’s pages.

Pages with denied access

If you see that many pages in your website return, for example, a 403 status code, it’s possible that these pages are accessible only to registered users. Mark these links as nofollow so that they don’t waste crawl budget.

Server Related Problem (5xx)

Server errors

A large number of 5xx errors (for example 502 errors) may be a signal of server problems. To solve them, provide the list of pages with errors to the person responsible for the website’s development and maintenance. This person will take care of the bugs or website configuration issues causing the server errors.

Limited server capacity

If your server is overloaded, it may stop responding to users’ and bots’ requests. When it happens, your visitors receive the “Connection timed out” message. This problem can only be solved together with the website maintenance specialist who will estimate if and how much the server capacity should be increased.

Web server misconfiguration

This is a tricky issue. The site can be perfectly visible to you as a human, but it keeps giving an error message to site crawlers, so all the pages become unavailable for crawling. It can happen because of specific server configuration: some web application firewalls (for example, Apache mod_security) block Google bot and other search bots by default. In a nutshell, this problem, with all the related aspects, must be solved by a specialist.

The Sitemap, together with robots.txt, counts for first impression to crawlers. A correct sitemap advises them to index your site the way you want it to be indexed. Let’s see what can go wrong when the search bot starts looking at your sitemap(s).

Issues with Sitemap XML

Format errors

There are several types of format errors, for example, invalid URL or missing tags (see the complete list, along with a solution for each error, here).

You also may have found out (at the very first step) that the sitemap file is blocked by robots.txt. This means that the bots could not get access to the sitemap’s content.

Wrong pages in sitemap

Let’s move on to the content. Even if you are not a web programmer, you can estimate the relevancy of the URLs in the sitemap. Take a close look at the URLs in your sitemap and make sure that each one of them is: relevant, updated and correct (no typos or misprints). If the crawl budget is limited and bots can’t go throughout the entire website, the sitemap indications can help them index the most valuable pages first.

Don’t mislead the bots with controversial instructions: make sure that the URLs in your sitemap are not blocked from indexing by meta directives or robots.txt.

Mistakes with Website Architecture

The issues of this category are the most difficult to solve. This is why we recommend that you go through the previous steps before getting down to the following issues.

These problems related to site architecture can disorient or block the crawlers in your website.

Issues with internal linking

In a correctly optimized website structure, all the pages form an indissoluble chain, so that the site crawlers can easily reach every page.

In an unoptimized website, certain pages get out of crawlers’ sight. There can be different reasons for it, which you can easily detect and categorize using the Site Audit tool by SEMrush:

  • The page you want to get ranked is not linked by any other page on the website. This way it has no chance to be found and indexed by search bots.
  • Too many transitions between the main page and the page you want to be ranked. A common practice is a 4-link transition or less, otherwise, there’s a chance that the bot won’t arrive at it.
  • More than 3000 active links in one page (too much job for the crawler).
  • The links are hidden in unindexable site elements: submission required forms, frames, plugins (Java and Flash first of all).

In most cases, the internal linking problem isn’t something you can solve at the drop of a hat. A deep review of the website structure in collaboration with developers is needed.

Wrong redirects

Redirects are necessary to forward users to a more relevant page (or, better, the one that the website owner considers relevant). Here’s what you can overlook when working with redirects:

  • Temporary redirect instead of permanent: Using 302 or 307 redirects is a signal to crawlers to come back to the page, again and again, spending the crawl budget. So, if you understand that the original page doesn’t need to be indexed anymore, use the 301 (permanent) redirect for it.

  • Redirect loop: It may happen that two pages get redirected to each other. So the bot gets caught in a loop and wastes all the crawl budget. Double-check and remove eventual mutual redirects.

Slow load speed

The faster your pages load, the quicker the crawler goes through them. Every split second is important. And the website’s position in SERP is correlated to the load speed.

Use Google PageSpeed Insights to verify if your website is fast enough. If the load speed could deter users, there can be several factors affecting it.

Server-side factors: your website may be slow for a simple reason – the current channel bandwidth is not sufficient anymore. You can check the bandwidth in your pricing plan description.

Front-end factors: one of the most frequent issues is unoptimized code. If it contains voluminous scripts and plug-ins, your site is at risk. Also don’t forget to verify on a regular basis that your images, videos, and other similar content are optimized and don’t slow down the page’s load speed.

Page duplicates caused by poor website architecture

Duplicate content is the most frequent SEO issue, found in 50% of sites according to the recent SEMrush study "11 Most Common On-site SEO Issues." This is one of the main reasons you run out of the crawl budget. Google dedicates a limited time to each website, so it’s improper to waste it by indexing the same content. Another problem is that the site crawlers don’t know which copy to trust more and may give priority to the wrong pages, as long as you don’t use canonicals to clear things up.

To fix this issue you need to identify duplicate pages and prevent their crawling in one of the following ways:

  • Delete duplicate pages

  • Set necessary parameters in robots.txt

  • Set necessary parameters in meta tags

  • Set a 301 redirect

  • Use rel=canonical

Wrong JavaScript and CSS usage

Yet in 2015 Google officially claimed: “As long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.” It isn’t relevant for other search engines (Yahoo, Bing, etc.) though. Moreover, “generally” means that in some cases the correct indexation is not guaranteed.

Outdated Technologies in Web-Design

Flash content

Using Flash is a slippery slope both for user experience (Flash files are not supported in some mobile devices) and SEO. A text content or a link inside a Flash element are unlikely to be indexed by crawlers.

So we suggest simply don’t use it on your website. 

HTML frames

If your site contains frames, there’s good and bad news that comes along with it. It’s good because this probably means your site is mature enough. It’s bad because HTML frames are extremely outdated, poorly indexed and you need to replace them with a more up-to-date solution as fast as possible.

Delegate Daily Grind, Focus on Action

It’s not necessarily wrong keywords or content-related issues that keep you floating under Google’s radar. A perfectly optimized page is not a guarantee that you will get it ranked in the top (and ranked at all) if the content can’t be delivered to the engine because of crawlability problems.

To figure out what is blocking or disorienting Google’s crawlers on your website, you need to review your domain from soup to nuts. It’s a strenuous effort to do it manually. This is why you should trust routine tasks to appropriate tools. Most popular site audit solutions help you identify, categorize and prioritize the issues, so you can proceed to action immediately after getting the report. Moreover, many tools enable storing data of previous audits, which lets you get a big picture of your website’s technical performance over time.

Are there other issues your consider critical for the website’s crawlability? Do you use any tools that help optimize and solve these issues in a timely manner? Feel free to share your suggestions into the comments!

Get a free 7-day trial

Start working on your online visibility

Please specify a valid domain, e.g., www.example.com

Elena Terenteva
SEMrush

SEMrush employee.

Elena Terenteva, Product Marketing Manager at SEMrush. Elena has eight years public relations and journalism experience, working as a broadcasting journalist, PR/Content manager for IT and finance companies.
Bookworm, poker player, good swimmer.
Send feedback
Your feedback must contain at least 3 words (10 characters).

We will only use this email to respond to you on your feedback. Privacy Policy

Thank you for your feedback!