English Español Deutsch Français Italiano Português (Brasil) Русский 中文 日本語
Submit post
Go to Blog

18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

The Wow-Score shows how engaging a blog post is. It is calculated based on the correlation between users’ active reading time, their scrolling speed and the article’s length.

18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

Elena Terenteva
18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues

You’ve been working on your website really hard, and can’t wait to see it on the top of the search, but your content is struggling to overcome the 10th page hurdle. If you are sure that your website deserved to be ranked higher, the problem might exist within your website crawlability.

What is crawlability? Search engines use search bots for collecting certain website pages parameters. The process of collecting this data is called crawling. Based on this data, search engines include pages in their search index, which means that page can be found by users. Website crawlability is its accessibility for search bots. You have to be sure that search bots will be able to find your website pages, obtain access and then “read” them.

We also break these issues down into two categories: those you can solve by your own and those you need to involve a developer or a system administrator in. Of course, all of us have different background and skills, so take this categorization tentatively.

18 Reasons Your Website is Crawler-Unfriendly: Guide to Crawlability Issues. Image 0

What we mean by “solve by your own:” you can manage your website’s pages code and root files. You also need to have basic knowledge of coding (change or replace a piece of code in the right place and in the right manner).


What we mean by “delegate to a specialist:” server administration and/or web developing skills are required.

Crawler blocked by meta tags or robots.txt

This type of issues are pretty easy to detect and solve by simply checking your meta tags and robots.txt file, which is why you should look at it first. The whole website or certain pages can remain unseen by Google for a simple reason: its site crawlers are not allowed to enter them.

There are several bot commands, which will prevent page crawling. Note, that it’s not a mistake to have these parameters in robots.txt; used properly and accurately these parameters will help to save a crawl budget and give bots exact direction they need to follow in order to crawl pages you want crawled.

1. Blocking the page from indexing through robots meta tag

If you do this, the search bot will not even start looking at your page’s content, moving directly to the next page.

You can detect this issue checking if your page’s code contains this directive:

<meta name="robots" content="noindex" />

2. No follow links

In this case the site crawler will index your page’s content but will not follow the links. There are two types of no follow directives:

  • for the whole page. Check if you have

    <meta name="robots" content="nofollow">

    in the page’s code - that would mean the crawler can’t follow any link on the page.

  • for a single link. This is how the piece of code looks like in this case:
href="pagename.html" rel="nofollow"/>

3. Blocking the pages from indexing through robots.txt

Robots.txt is the first file of your website the crawlers look at. The most painful thing you can find there is:

User-agent: *
Disallow: /

It means that all the website’s pages are blocked from indexing.

It might happen that only certain pages or sections are blocked, for instance:

User-agent: *
Disallow: /products/

In this case any page in the Products subfolder will be blocked from indexing and, therefore, none of your product descriptions will be visible in Google.

Broken Links issues

Broken links are always a bad experience for your users, but also for the crawlers. Every page the search bot is indexing (or trying to index) is a spend of crawl budget. With this in mind, if you have many broken links, the bot will waste all of its time to index them and won’t arrive to relevant and quality pages.

The Crawl errors report in Google Search Console or the Internal broken links check in SEMrush Site Audit will help you identify this type of problems.

4. URL errors

A URL error is usually caused by a typo in the URL you insert to your page (text link, image link, form link). Be sure to check that all the links are typed in correctly.

5. Outdated URLs

If you have recently undergone a website migration, a bulk delete or a URL structure change, you need to double-check this issue. Make sure you don’t link to old or deleted URLs from any of your website’s pages.

6. Pages with denied access

If you see that many pages in your website return, for example, a 403 status code, it’s possible that these pages are accessible only to registered users. Mark these links as nofollow so that they don’t waste crawl budget.

Broken links issues caused by server related problems

7. Server errors

A large number of 5xx errors (for example 502 errors) may be a signal of server problems. To solve them, provide the list of pages with errors to the person responsible for the website’s development and maintenance. This person will take care of the bugs or website configuration issues causing the server errors.

8. Limited server capacity

If your server is overloaded, it may stop responding to users’ and bots’ requests. When it happens, your visitors receive the “Connection timed out” message. This problem can only be solved together with the website maintenance specialist who will estimate if and how much the server capacity should be increased.

9. Web server misconfiguration

This is a tricky issue. The site can be perfectly visible to you as a human, but it keeps giving an error message to site crawlers, so all the pages become unavailable for crawling. It can happen because of specific server configuration: some web application firewalls (for example, Apache mod_security) block Google bot and other search bots by default. In a nutshell, this problem, with all the related aspects, must be solved by a specialist.

The Sitemap, together with robots.txt, counts for first impression to crawlers. A correct sitemap advises them to index your site the way you want it to be indexed. Let’s see what can go wrong when the search bot starts looking at your sitemap(s).

Sitemap errors

10. Format errors

There are several types of format errors, for example invalid URL or missing tags (see the complete list, along with a solution for each error, here).

You also may have found out (at the very first step) that the sitemap file is blocked by robots.txt. This means that the bots could not get access to the sitemap’s content.

11. Wrong pages in sitemap

Let’s move on to the content. Even if you are not a web programmer, you can estimate the relevancy of the URLs in the sitemap. Take a close look at the URLs in your sitemap and make sure that each one of them is: relevant, updated and correct (no typos or misprints). If the crawl budget is limited and bots can’t go throughout the entire website, the sitemap indications can help them index the most valuable pages first.

Don’t mislead the bots with controversial instructions: make sure that the URLs in your sitemap are not blocked from indexing by meta directives or robots.txt.

Site architecture issues

The issues of this category are the most difficult to solve. This is why we recommend that you go through the previous steps before getting down to the following issues.

These problems related to site architecture can disorient or block the crawlers in your website.

12. Bad internal linking

In a correctly optimized website structure all the pages form an indissoluble chain, so that the site crawlers can easily reach every page.

In an unoptimized website certain pages get out of crawlers’ sight. There can be different reasons for it, which you can easily detect and categorize using the Site Audit tool by SEMrush:

  • The page you want to get ranked is not linked by any other page on the website. This way it has no chance to be found and indexed by search bots.
  • Too many transitions between the main page and the page you want ranked. Common practice is a 4-link transition or less, otherwise there’s a chance that the bot won’t arrive to it.
  • More than 3000 active links in one page (too much job for the crawler).
  • The links are hidden in unindexable site elements: submission required forms, frames, plugins (Java and Flash first of all).

In most cases the internal linking problem isn’t something you can solve at the drop of a hat. A deep review of the website structure in collaboration with developers is needed.

13. Wrong redirects

Redirects are necessary to forward users to a more relevant page (or, better, the one that the website owner considers relevant). Here’s what you can overlook when working with redirects:

  • Temporary redirect instead of permanent: Using 302 or 307 redirects is a signal to crawlers to come back to the page again and again, spending the crawl budget. So, if you understand that the original page doesn’t need to be indexed anymore, use the 301 (permanent) redirect for it.

  • Redirect loop: It may happen that two pages get redirected to each other. So the bot gets caught in a loop and wastes all the crawl budget. Double-check and remove eventual mutual redirects.

14. Slow load speed

The faster your pages load, the quicker the crawler goes through them. Every split second is important. And website’s position in SERP is correlated to the load speed.

Use Google Pagespeed Insights to verify if your website is fast enough. If the load speed could deter users, there can be several factors affecting it.

Server side factors: your website may be slow for a simple reason – the current channel bandwidth is not sufficient anymore. You can check the bandwidth in your pricing plan description.

Front-end factors: one of the most frequent issues is unoptimized code. If it contains voluminous scripts and plug-ins, your site is at risk. Also don’t forget to verify on a regular basis that your images, videos and other similar content are optimized and don’t slow down the page’s load speed.

15. Page duplicates caused by poor website architecture

Duplicate content is the most frequent SEO issue, found in 50% of sites according to the recent SEMrush study "11 Most Common On-site SEO Issues." This is one of the main reasons you run out of crawl budget. Google dedicates a limited time to each website, so it’s improper to waste it by indexing the same content. Another problem is that the site crawlers don’t know which copy to trust more and may give priority to wrong pages, as long as you don’t use canonicals to clear things up.

To fix this issue you need to identify duplicate pages and prevent their crawling in one of the following ways:

  • Delete duplicate pages

  • Set necessary parameters in robots.txt

  • Set necessary parameters in meta tags

  • Set a 301 redirect

  • Use rel=canonical

16. JS and CSS usage

Yet in 2015 Google officially claimed: “As long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.” It isn’t relevant for other search engines (Yahoo, Bing, etc.) though. Moreover, “generally” means that in some cases the correct indexation is not guaranteed.

Outdated technologies

17. Flash content

Using Flash is a slippery slope both for user experience (Flash files are not supported in some mobile devices) and SEO. A text content or a link inside a Flash element are unlikely to be indexed by crawlers.

So we suggest simply don’t use it on your website. 

18. HTML frames

If your site contains frames, there’s good and bad news that come along with it. It’s good because this probably means your site is mature enough. It’s bad because HTML frames are extremely outdated, poorly indexed and you need to replace them with a more up-to-date solution as fast as possible.

Delegate Daily Grind, Focus on Action

It’s not necessarily wrong keywords or content related issues that keep you floating under Google’s radar. A perfectly optimized page is not a guarantee that you will get it ranked in the top (and ranked at all), if the content can’t be delivered to the engine because of crawlability problems.

To figure out what is blocking or disorienting Google’s crawlers in your website, you need to review your domain from soup to nuts. It’s a strenuous effort to do it manually. This is why you should trust routine tasks to appropriate tools. Most popular site audit solutions help you identify, categorize and prioritize the issues, so you can proceed to action immediately after getting the report. Moreover, many tools enable storing data of previous audits, which lets you get a big picture of your website’s technical performance over time.

Are there other issues your consider critical for the website’s crawlability? Do you use any tools that help optimize and solve these issues in a timely manner? Feel free to share your suggestions into the comments!

Get a free 7-day trial

Start working on your online visibility

Please specify a valid domain, e.g., www.example.com

Elena Terenteva

SEMrush employee.

Elena Terenteva, Product Marketing Manager at SEMrush. Elena has eight years public relations and journalism experience, working as a broadcasting journalist, PR/Content manager for IT and finance companies.
Bookworm, poker player, good swimmer.
Share this post


Muhammad Abid Khan

Occasionally takes part in conversations.

Great affort you've made

Either just recently joined or is too shy to say something.

Amazing article and I'm going to apply some of these techniques into my website: https://packagingbee.com as it has some issues in crawling and if somebody here who can help me more about a little audit then it will be good.

Thanks in advance

Either just recently joined or is too shy to say something.

I have added one site for crawling ,semrush crawl only homepage of my site. Other pages are not found by semrush, why please help

Either just recently joined or is too shy to say something.

My site's hompage has 453 word's post but semrush tells only 11 words count.
Sitemap of my site is readable by google but semrush says it is invalid.
And semrush crawl only homepage of my site. Other posts and pages not found by semrush.. Pls help.
I am using wordpress.

Either just recently joined or is too shy to say something.

From your SEMrush Audit report, I found around 500+ external links as a broken for 403 code. But when I checked them, I found all links are working well. So why SEMrush marked them as a broken link?

Help me please.
Customer Success Team

SEMrush employee.

Sam Mollaei
Hi Sam! Thanks for your great question.
If links are up-and-running at the browser but SEMrush reports them as broken, it means that our bot is blocked from crawling on those external resources.
Can you please hit up us at mail@semrush.com and provide the link to the site audit campaign? It will help a lot to provide some specific examples:)

Occasionally takes part in conversations.

good article ..not a single point left.. i was seeing the error for my porject which is uk based. [link removed by moderator] this site pages are indexed but issues are regarding the landing pages ..so this post was of extreme help to guess what exactly went wrong.. y my landing pages are not crawled...

Occasionally takes part in conversations.

meta name="robots" content="noodp" i used this on my website.
but google crawler does not able to read footer content on my website.Can anyone help me on this?

Either just recently joined or is too shy to say something.

"noodp" is no longer applicable. Use meta name="robots" content="index, follow" instead, if you want the page to be followed & indexed by crawler.
Brian Cooper

Occasionally takes part in conversations.

A small but important nit: In #3 you write that "Disallow: /" "means that all the website’s pages are blocked from indexing." - In fact, a robots.txt Disallow prevents crawling, not indexing. I.e., they are blocked from being crawled but other factors can still lead to them being indexed. Seems minor but I have seen this confuse and vex a lot of situations! If a page gets indexed, say from an old inbound link, and it's blocked in robots.txt, it may remain indexed forever. You would have to add a meta noindex in the page itself AND also remove the "Disallow" from robots.txt so that it DOES get crawled and Google can see the noindex directive on the page.
Juan Bautista Calderón Lista

Either just recently joined or is too shy to say something.

Hi Elena, reading your post i remain thinking about internal nofollow links, You said that if there are links to prívate users pages its better to tag them as nofollow, so al the tipical cart,register,account and login links in a ecommerce should be nofollow in your opinion¿¿ im confused right now
faisal saleem

An experienced member who is always happy to help.

great information , I would love to share Elena, thanks

Either just recently joined or is too shy to say something.

nice post-Elena!
thank you for sharing the informative post
Elena Terenteva

SEMrush employee.

neel bhad
You are very welcome!

Either just recently joined or is too shy to say something.

Great article Elena....
Elena Terenteva

SEMrush employee.

Jasna Rasik
Thank you, Jasna!

Either just recently joined or is too shy to say something.

Great article. Definitely a pretty comprehensive list.

I would also add the following items:

A very common thing that I have run into is faceted navigation / mix and match spider traps that can absolutely kill the crawl budget -- This is easily discovered by looking for duplicate titles , meta descriptions and canonicals and/ or looking for the really long URLs with parameters and then loading the longest one in the browser and see if there are faceted filters present. Then just play around a bit with mixing and matching the filters. If you see a bunch of unique URLs being created and what seems to be a nearly unlimited amount of URLs are created, you have found one. Getting around this can be a short and long term project. You can band-aid it with a really good robots.txt, but the bigger fix may take much more effort as it would probably require tweaking the CMS.

You should also cover blocking your CSS, Images and JavaScript directories in robots.txt which is very common.

But yes, lots of bad links and redirected links and redirect chains or loops are all great things to look for and are fairly easy to detect even if you only do a small sampling (1,000 pages crawl).

Also, using non standard names or locations for robots.txt and sitemap,xml and not linking to your sitemap.xml from robots.txt is something pretty easy to detect.

One last one, I promise. Are they live with both HTTP and HTTPS? You just doubled the amount of URLs that can be crawled.

Hope this helps.
Elena Terenteva

SEMrush employee.

Kevin Beares
Wow! Thank you so much Kevin! Very valuable note.
I'd give you a prize for the best comment if we had one :)

Either just recently joined or is too shy to say something.

Informative article! I mentioned it in my blog about SEO in the yacht industry: https://www.tidesntrends.com/blog/2017/marketing-yacht-industry-seo
Emma Labrador

Provides valuable insights and adds depth to the conversation.

Thanks for the article Elena! That's why you need a SEO log analyzer like OnCrawl to fix these issues :)
Jaden Madison

Either just recently joined or is too shy to say something.

Hey Elena,
Thanks for sharing.
The issues you've stated are really crucial and greatly harm a website.
If you have one rather small website than you may check it manually. But for a big website / a big number of website it's hardly possible :-) So I use Netpeak Spider, it checks for almost all issues covered here and to my mind it's the only tool that detects issues in the sitemaps.

Provides valuable insights and adds depth to the conversation.

On a relaxedly serious note, once upon a time, very initially into my SEO venture adventure, I made the regrettable action of adding a "Disallow: /" line in the robots.txt file - and guess what, it worked like a charm: the relatively big news site it was on suddenly and almost instantaneously disappeared from search indexes. As a matter of fact, it recovered fast, and I managed to make up an excuse of the sorts of "technical glitch" phrasing.
Elena Terenteva

SEMrush employee.

Haha :) Yep, it's always Google's fault
Siva kumar

Occasionally takes part in conversations.

A 100 $$ Post

Elena Terenteva

Provides valuable insights and adds depth to the conversation.


Send feedback

Your feedback must contain at least 3 words (10 characters).

We will only use this email to respond to you on your feedback. Privacy Policy

Thank you for your feedback!