logo-small
Features Prices
News 0
Latest News See All

Temporarily unavailable. Please come back later.

See All
Webinars 0
Upcoming Webinars See All
Upcoming Webinars

Sorry, we could not find any upcoming webinars.

See recorded webinars
Blog 0
Recent Posts See All

Temporarily unavailable. Please come back later.

See All
Elena Terenteva

18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

Elena Terenteva
18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

You’ve been working on your website really hard, and can’t wait to see it on the top of the search, but your content is struggling to overcome the 10th page hurdle. If you are sure that your website deserved to be ranked higher, the problem might exist within your website crawlability.

What is crawlability? Search engines use search bots for collecting certain website pages parameters. The process of collecting this data is called crawling. Based on this data, search engines include pages in their search index, which means that page can be found by users. Website crawlability is its accessibility for search bots. You have to be sure that search bots will be able to find your website pages, obtain access and then “read” them.

We also break these issues down into two categories: those you can solve by your own and those you need to involve a developer or a system administrator in. Of course, all of us have different background and skills, so take this categorization tentatively.

What we mean by “solve by your own:” you can manage your website’s pages code and root files. You also need to have basic knowledge of coding (change or replace a piece of code in the right place and in the right manner).

ac4d9e66b85c9e71fe6fd7c9fb82ceff.png

What we mean by “delegate to a specialist:” server administration and/or web developing skills are required.

Crawler blocked by meta tags or robots.txt

This type of issues are pretty easy to detect and solve by simply checking your meta tags and robots.txt file, which is why you should look at it first. The whole website or certain pages can remain unseen by Google for a simple reason: its search bots are not allowed to enter them.

There are several bot commands, which will prevent page crawling. Note, that it’s not a mistake to have these parameters in robots.txt; used properly and accurately these parameters will help to save a crawl budget and give bots exact direction they need to follow in order to crawl pages you want crawled.

1. Blocking the page from indexing through robots meta tag

If you do this, the search bot will not even start looking at your page’s content, moving directly to the next page.

You can detect this issue checking if your page’s code contains this directive:

<meta name="robots" content="noindex" />

2. No follow links

In this case the search bot will index your page’s content but will not follow the links. There are two types of no follow directives:

  • for the whole page. Check if you have

    <meta name="robots" content="nofollow">

    in the page’s code - that would mean the crawler can’t follow any link on the page.

  • for a single link. This is how the piece of code looks like in this case:
href="pagename.html" rel="nofollow"/>

3. Blocking the pages from indexing through robots.txt

Robots.txt is the first file of your website the crawlers look at. The most painful thing you can find there is:

User-agent: *
Disallow: /

It means that all the website’s pages are blocked from indexing.

It might happen that only certain pages or sections are blocked, for instance:

User-agent: *
Disallow: /products/

In this case any page in the Products subfolder will be blocked from indexing and, therefore, none of your product descriptions will be visible in Google.

Broken Links issues

Broken links are always a bad experience for your users, but also for the crawlers. Every page the search bot is indexing (or trying to index) is a spend of crawl budget. With this in mind, if you have many broken links, the bot will waste all of its time to index them and won’t arrive to relevant and quality pages.

The Crawl errors report in Google Search Console or the Internal broken links check in SEMrush Site Audit will help you identify this type of problems.

4. URL errors

A URL error is usually caused by a typo in the URL you insert to your page (text link, image link, form link). Be sure to check that all the links are typed in correctly.

5. Outdated URLs

If you have recently undergone a website migration, a bulk delete or a URL structure change, you need to double-check this issue. Make sure you don’t link to old or deleted URLs from any of your website’s pages.

6. Pages with denied access

If you see that many pages in your website return, for example, a 403 status code, it’s possible that these pages are accessible only to registered users. Mark these links as nofollow so that they don’t waste crawl budget.

Broken links issues caused by server related problems

7. Server errors

A large number of 5xx errors (for example 502 errors) may be a signal of server problems. To solve them, provide the list of pages with errors to the person responsible for the website’s development and maintenance. This person will take care of the bugs or website configuration issues causing the server errors.

8. Limited server capacity

If your server is overloaded, it may stop responding to users’ and bots’ requests. When it happens, your visitors receive the “Connection timed out” message. This problem can only be solved together with the website maintenance specialist who will estimate if and how much the server capacity should be increased.

9. Web server misconfiguration

This is a tricky issue. The site can be perfectly visible to you as a human, but it keeps giving an error message to a bot, so all the pages become unavailable for crawling. It can happen because of specific server configuration: some web application firewalls (for example, Apache mod_security) block Google bot and other search bots by default. In a nutshell, this problem, with all the related aspects, must be solved by a specialist.

The Sitemap, together with robots.txt, counts for first impression to crawlers. A correct sitemap advises them to index your site the way you want it to be indexed. Let’s see what can go wrong when the search bot starts looking at your sitemap(s).

Sitemap errors

10. Format errors

There are several types of format errors, for example invalid URL or missing tags (see the complete list, along with a solution for each error, here).

You also may have found out (at the very first step) that the sitemap file is blocked by robots.txt. This means that the bots could not get access to the sitemap’s content.

11. Wrong pages in sitemap

Let’s move on to the content. Even if you are not a web programmer, you can estimate the relevancy of the URLs in the sitemap. Take a close look at the URLs in your sitemap and make sure that each one of them is: relevant, updated and correct (no typos or misprints). If the crawl budget is limited and bots can’t go throughout the entire website, the sitemap indications can help them index the most valuable pages first.

Don’t mislead the bots with controversial instructions: make sure that the URLs in your sitemap are not blocked from indexing by meta directives or robots.txt.

Site architecture issues

The issues of this category are the most difficult to solve. This is why we recommend that you go through the previous steps before getting down to the following issues.

These problems related to site architecture can disorient or block the crawlers in your website.

12. Bad internal linking

In a correctly optimized website structure all the pages form an indissoluble chain, so that the crawler can easily reach every page.

In an unoptimized website certain pages get out of crawlers’ sight. There can be different reasons for it, which you can easily detect and categorize using the Site Audit tool by SEMrush:

  • The page you want to get ranked is not linked by any other page on the website. This way it has no chance to be found and indexed by search bots.
  • Too many transitions between the main page and the page you want ranked. Common practice is a 4-link transition or less, otherwise there’s a chance that the bot won’t arrive to it.
  • More than 3000 active links in one page (too much job for the crawler).
  • The links are hidden in unindexable site elements: submission required forms, frames, plugins (Java and Flash first of all).

In most cases the internal linking problem isn’t something you can solve at the drop of a hat. A deep review of the website structure in collaboration with developers is needed.

13. Wrong redirects

Redirects are necessary to forward users to a more relevant page (or, better, the one that the website owner considers relevant). Here’s what you can overlook when working with redirects:

  • Temporary redirect instead of permanent: Using 302 or 307 redirects is a signal to crawlers to come back to the page again and again, spending the crawl budget. So, if you understand that the original page doesn’t need to be indexed anymore, use the 301 (permanent) redirect for it.

  • Redirect loop: It may happen that two pages get redirected to each other. So the bot gets caught in a loop and wastes all the crawl budget. Double-check and remove eventual mutual redirects.

14. Slow load speed

The faster your pages load, the quicker the crawler goes through them. Every split second is important. And website’s position in SERP is correlated to the load speed.

Use Google Pagespeed Insights to verify if your website is fast enough. If the load speed could deter users, there can be several factors affecting it.

Server side factors: your website may be slow for a simple reason – the current channel bandwidth is not sufficient anymore. You can check the bandwidth in your pricing plan description.

Front-end factors: one of the most frequent issues is unoptimized code. If it contains voluminous scripts and plug-ins, your site is at risk. Also don’t forget to verify on a regular basis that your images, videos and other similar content are optimized and don’t slow down the page’s load speed.

15. Page duplicates caused by poor website architecture

Duplicate content is the most frequent SEO issue, found in 50% of sites according to the recent SEMrush study "11 Most Common On-site SEO Issues." This is one of the main reasons you run out of crawl budget. Google dedicates a limited time to each website, so it’s improper to waste it by indexing the same content. Another problem is that the crawlers don’t know which copy to trust more and may give priority to wrong pages, as long as you don’t use canonicals to clear things up.

To fix this issue you need to identify duplicate pages and prevent their crawling in one of the following ways:

  • Delete duplicate pages

  • Set necessary parameters in robots.txt

  • Set necessary parameters in meta tags

  • Set a 301 redirect

  • Use rel=canonical

16. JS and CSS usage

Yet in 2015 Google officially claimed: “As long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.” It isn’t relevant for other search engines (Yahoo, Bing, etc.) though. Moreover, “generally” means that in some cases the correct indexation is not guaranteed.

Outdated technologies

17. Flash content

Using Flash is a slippery slope both for user experience (Flash files are not supported in some mobile devices) and SEO. A text content or a link inside a Flash element are unlikely to be indexed by crawlers.

So we suggest simply don’t use it on your website. 

18. HTML frames

If your site contains frames, there’s good and bad news that come along with it. It’s good because this probably means your site is mature enough. It’s bad because HTML frames are extremely outdated, poorly indexed and you need to replace them with a more up-to-date solution as fast as possible.

Delegate Daily Grind, Focus on Action

It’s not necessarily wrong keywords or content related issues that keep you floating under Google’s radar. A perfectly optimized page is not a guarantee that you will get it ranked in the top (and ranked at all), if the content can’t be delivered to the engine because of crawlability problems.

To figure out what is blocking or disorienting Google’s crawlers in your website, you need to review your domain from soup to nuts. It’s a strenuous effort to do it manually. This is why you should trust routine tasks to appropriate tools. Most popular site audit solutions help you identify, categorize and prioritize the issues, so you can proceed to action immediately after getting the report. Moreover, many tools enable storing data of previous audits, which lets you get a big picture of your website’s technical performance over time.

Are there other issues your consider critical for the website’s crawlability? Do you use any tools that help optimize and solve these issues in a timely manner? Feel free to share your suggestions into the comments!

Banner Site Audit

Elena Terenteva, Product Marketing Manager at SEMrush.

Comments

2000 symbols remain
Emma Labrador
Thanks for the article Elena! That's why you need a SEO log analyzer like OnCrawl to fix these issues :)
Jaden Madison
Jaden Madison
Hey Elena,
Thanks for sharing.
The issues you've stated are really crucial and greatly harm a website.
If you have one rather small website than you may check it manually. But for a big website / a big number of website it's hardly possible :-) So I use Netpeak Spider, it checks for almost all issues covered here and to my mind it's the only tool that detects issues in the sitemaps.
SeoKungFu
On a relaxedly serious note, once upon a time, very initially into my SEO venture adventure, I made the regrettable action of adding a "Disallow: /" line in the robots.txt file - and guess what, it worked like a charm: the relatively big news site it was on suddenly and almost instantaneously disappeared from search indexes. As a matter of fact, it recovered fast, and I managed to make up an excuse of the sorts of "technical glitch" phrasing.
Elena Terenteva
SeoKungFu
Haha :) Yep, it's always Google's fault
Ms dhoni
A 100 $$ Post

Cheers
Elena Terenteva
SeoKungFu
:)
Have a Suggestion?