You’ve been working on your website really hard, and can’t wait to see it on the top of the search, but your content is struggling to overcome the 10th page hurdle. If you are sure that your website deserved to be ranked higher, the problem might exist within your website crawlability.
What is crawlability? Search engines use search bots for collecting certain website pages parameters. The process of collecting this data is called crawling. Based on this data, search engines include pages in their search index, which means that page can be found by users. Website crawlability is its accessibility for search bots. You have to be sure that search bots will be able to find your website pages, obtain access and then “read” them.
We also break these issues down into two categories: those you can solve by your own and those you need to involve a developer or a system administrator in. Of course, all of us have different background and skills, so take this categorization tentatively.
What we mean by “solve by your own:” you can manage your website’s pages code and root files. You also need to have basic knowledge of coding (change or replace a piece of code in the right place and in the right manner).
What we mean by “delegate to a specialist:” server administration and/or web developing skills are required.
This type of issues are pretty easy to detect and solve by simply checking your meta tags and robots.txt file, which is why you should look at it first. The whole website or certain pages can remain unseen by Google for a simple reason: its site crawlers are not allowed to enter them.
There are several bot commands, which will prevent page crawling. Note, that it’s not a mistake to have these parameters in robots.txt; used properly and accurately these parameters will help to save a crawl budget and give bots exact direction they need to follow in order to crawl pages you want crawled.
1. Blocking the page from indexing through robots meta tag
If you do this, the search bot will not even start looking at your page’s content, moving directly to the next page.
You can detect this issue checking if your page’s code contains this directive:
<meta name="robots" content="noindex" />
2. No follow links
In this case the site crawler will index your page’s content but will not follow the links. There are two types of no follow directives:
for the whole page. Check if you have
<meta name="robots" content="nofollow">
in the page’s code - that would mean the crawler can’t follow any link on the page.
- for a single link. This is how the piece of code looks like in this case:
3. Blocking the pages from indexing through robots.txt
Robots.txt is the first file of your website the crawlers look at. The most painful thing you can find there is:
It means that all the website’s pages are blocked from indexing.
It might happen that only certain pages or sections are blocked, for instance:
In this case any page in the Products subfolder will be blocked from indexing and, therefore, none of your product descriptions will be visible in Google.
Broken links are always a bad experience for your users, but also for the crawlers. Every page the search bot is indexing (or trying to index) is a spend of crawl budget. With this in mind, if you have many broken links, the bot will waste all of its time to index them and won’t arrive to relevant and quality pages.
The Crawl errors report in Google Search Console or the Internal broken links check in SEMrush Site Audit will help you identify this type of problems.
4. URL errors
A URL error is usually caused by a typo in the URL you insert to your page (text link, image link, form link). Be sure to check that all the links are typed in correctly.
5. Outdated URLs
If you have recently undergone a website migration, a bulk delete or a URL structure change, you need to double-check this issue. Make sure you don’t link to old or deleted URLs from any of your website’s pages.
6. Pages with denied access
If you see that many pages in your website return, for example, a 403 status code, it’s possible that these pages are accessible only to registered users. Mark these links as nofollow so that they don’t waste crawl budget.
7. Server errors
A large number of 5xx errors (for example 502 errors) may be a signal of server problems. To solve them, provide the list of pages with errors to the person responsible for the website’s development and maintenance. This person will take care of the bugs or website configuration issues causing the server errors.
8. Limited server capacity
If your server is overloaded, it may stop responding to users’ and bots’ requests. When it happens, your visitors receive the “Connection timed out” message. This problem can only be solved together with the website maintenance specialist who will estimate if and how much the server capacity should be increased.
9. Web server misconfiguration
This is a tricky issue. The site can be perfectly visible to you as a human, but it keeps giving an error message to site crawlers, so all the pages become unavailable for crawling. It can happen because of specific server configuration: some web application firewalls (for example, Apache mod_security) block Google bot and other search bots by default. In a nutshell, this problem, with all the related aspects, must be solved by a specialist.
The Sitemap, together with robots.txt, counts for first impression to crawlers. A correct sitemap advises them to index your site the way you want it to be indexed. Let’s see what can go wrong when the search bot starts looking at your sitemap(s).
10. Format errors
There are several types of format errors, for example invalid URL or missing tags (see the complete list, along with a solution for each error, here).
You also may have found out (at the very first step) that the sitemap file is blocked by robots.txt. This means that the bots could not get access to the sitemap’s content.
11. Wrong pages in sitemap
Let’s move on to the content. Even if you are not a web programmer, you can estimate the relevancy of the URLs in the sitemap. Take a close look at the URLs in your sitemap and make sure that each one of them is: relevant, updated and correct (no typos or misprints). If the crawl budget is limited and bots can’t go throughout the entire website, the sitemap indications can help them index the most valuable pages first.
Don’t mislead the bots with controversial instructions: make sure that the URLs in your sitemap are not blocked from indexing by meta directives or robots.txt.
The issues of this category are the most difficult to solve. This is why we recommend that you go through the previous steps before getting down to the following issues.
These problems related to site architecture can disorient or block the crawlers in your website.
12. Bad internal linking
In a correctly optimized website structure all the pages form an indissoluble chain, so that the site crawlers can easily reach every page.
In an unoptimized website certain pages get out of crawlers’ sight. There can be different reasons for it, which you can easily detect and categorize using the Site Audit tool by SEMrush:
- The page you want to get ranked is not linked by any other page on the website. This way it has no chance to be found and indexed by search bots.
- Too many transitions between the main page and the page you want ranked. Common practice is a 4-link transition or less, otherwise there’s a chance that the bot won’t arrive to it.
- More than 3000 active links in one page (too much job for the crawler).
- The links are hidden in unindexable site elements: submission required forms, frames, plugins (Java and Flash first of all).
In most cases the internal linking problem isn’t something you can solve at the drop of a hat. A deep review of the website structure in collaboration with developers is needed.
13. Wrong redirects
Redirects are necessary to forward users to a more relevant page (or, better, the one that the website owner considers relevant). Here’s what you can overlook when working with redirects:
Temporary redirect instead of permanent: Using 302 or 307 redirects is a signal to crawlers to come back to the page again and again, spending the crawl budget. So, if you understand that the original page doesn’t need to be indexed anymore, use the 301 (permanent) redirect for it.
Redirect loop: It may happen that two pages get redirected to each other. So the bot gets caught in a loop and wastes all the crawl budget. Double-check and remove eventual mutual redirects.
14. Slow load speed
The faster your pages load, the quicker the crawler goes through them. Every split second is important. And website’s position in SERP is correlated to the load speed.
Use Google Pagespeed Insights to verify if your website is fast enough. If the load speed could deter users, there can be several factors affecting it.
Server side factors: your website may be slow for a simple reason – the current channel bandwidth is not sufficient anymore. You can check the bandwidth in your pricing plan description.
Front-end factors: one of the most frequent issues is unoptimized code. If it contains voluminous scripts and plug-ins, your site is at risk. Also don’t forget to verify on a regular basis that your images, videos and other similar content are optimized and don’t slow down the page’s load speed.
15. Page duplicates caused by poor website architecture
Duplicate content is the most frequent SEO issue, found in 50% of sites according to the recent SEMrush study "11 Most Common On-site SEO Issues." This is one of the main reasons you run out of crawl budget. Google dedicates a limited time to each website, so it’s improper to waste it by indexing the same content. Another problem is that the site crawlers don’t know which copy to trust more and may give priority to wrong pages, as long as you don’t use canonicals to clear things up.
To fix this issue you need to identify duplicate pages and prevent their crawling in one of the following ways:
Delete duplicate pages
Set necessary parameters in robots.txt
Set necessary parameters in meta tags
Set a 301 redirect
16. JS and CSS usage
17. Flash content
Using Flash is a slippery slope both for user experience (Flash files are not supported in some mobile devices) and SEO. A text content or a link inside a Flash element are unlikely to be indexed by crawlers.
So we suggest simply don’t use it on your website.
18. HTML frames
If your site contains frames, there’s good and bad news that come along with it. It’s good because this probably means your site is mature enough. It’s bad because HTML frames are extremely outdated, poorly indexed and you need to replace them with a more up-to-date solution as fast as possible.
Delegate Daily Grind, Focus on Action
It’s not necessarily wrong keywords or content related issues that keep you floating under Google’s radar. A perfectly optimized page is not a guarantee that you will get it ranked in the top (and ranked at all), if the content can’t be delivered to the engine because of crawlability problems.
To figure out what is blocking or disorienting Google’s crawlers in your website, you need to review your domain from soup to nuts. It’s a strenuous effort to do it manually. This is why you should trust routine tasks to appropriate tools. Most popular site audit solutions help you identify, categorize and prioritize the issues, so you can proceed to action immediately after getting the report. Moreover, many tools enable storing data of previous audits, which lets you get a big picture of your website’s technical performance over time.
Are there other issues your consider critical for the website’s crawlability? Do you use any tools that help optimize and solve these issues in a timely manner? Feel free to share your suggestions into the comments!