Google’s ‘GoogleBot’ crawler visits sites large and small across the globe daily and these site’s pages are read and scrutinized before being added into Google’s index. Some pages don’t get included for specific reasons such as a ‘noindex’ tag in the meta of the page or a canonical link pointing to a different page is found instead. Sometimes the page just isn’t that good and Google doesn’t include it or the page/domain is blacklisted.
Some pages however, don’t get indexed because they never get the chance to be found by the crawler. GoogleBot spends a limited time on a site and there are a instances where a sites ‘Crawl Budget’ is used up before all the pages on a site are found.
There are a few reasons this might happen and in this post I’ll try to cover them and offer ways in which you can mitigate against the chances of crawl budget being used before your priority pages get found and hopefully indexed!
Running our of crawl budget before the all the pages have been found and crawled can be broken down into these three main areas:
- Slow Site / Slow Page Speed – The crawler runs out of time because it is waiting for things to load too often.
- Poor Internal Linking – The crawler runs out of time because it cannot find an internal link to the page(s) and leaves the site.
- Too Many Pages – The crawler runs out of time because there are just too many pages on the site to get through.
So, how to try and help GoogleBot along its way through your sites pages? Below are a few tips that could work to ensure as many pages as possible are crawled.
It sounds simple enough, right? Increase the speed of the sites loading time and the crawler will have more opportunity to get further through the site and index the pages. Depending on your skills or resources, this could be a bigger task than it seems. There could be a lot of factors that influence site speed such as your hosting package, your server setup and the actual page makeup itself. This is particularly helpful for larger sites so here are a few tips to maximize the speed of your site:
Pick a reliable host and a decent hosting package. This will mean doing your research to ensure you get plenty of bandwidth for users and bots as well as a reliable level of uptime to ensure user experience isn’t harmed and bot’s crawls aren’t interrupted.
There are a few ways you can speed up load times though your server setup I wrote this post which will highlight the main areas for you but essentially it could be a matter of a few lines of code standing between a slow and fast loading site.
The items on your page, the architecture of your site and the code on it all contribute to the speed of a page. Things from extraneous code, slow APIs and externally loaded widgets and even unoptimised images can slow things down. You can find some good on site page load time changes here.
Simply put, if you don’t have a page linked to other pages on your site, then a crawler cannot find it (ignoring for the moment potential inbound links from external sites). The crawler follows the links its allowed to follow on each page it visits through as much of the site as time allows. If there are no internal links pointing to the page, there’s a good chance it won’t get crawled. It is a good idea to put your main pages into the navigation of the site, a sensibly sized on page sitemap and where appropriate, into the body copy of your pages content.
In much the same way as internal linking neglecting to include a page in your XML sitemap offers a further chance that it won’t get crawled. There are some XML sitemap best practice tips here, but the main ones to follow are these:
Keep It Up To Date
Sites change all the time. Pages are removed and added fairly often in some cases which can mean that unless your sitemap is regularly updated to ensure that all your live pages are found in there. It’s also, almost equally, important to ensure that all old pages are removed and the new version is in there. Time is wasted following a link that redirects or goes to a 404 page (I’ll cover that shortly) so remove old pages too. In many cases this is automatically done in the CMS or with a plugin however it is important to check your sitemap’s as sometimes you need to manually direct pages to be included or not.
Break Up Sitemaps
This can mainly help those sites that are much larger so that the main pages are separated from the less important pages on the site. This can be separating pages such as top level and category pages from individual product pages. A broken up sitemap makes it more likely that the priority pages will get crawled first over the lower priority ones.
Setting Crawl Priorities
One other way to tell the crawler which pages to look at over other is to set priorities within your sitemap. This is expressed as a percentage or a number between 1.0 and 0.0 (1.0 and 100% being the higher priority pages). This is fairly easy to add manually however in most cases, the CMS or plugins automatically set this for you with the ability to change it within their ecosystems.
Blocking Irrelevant Content
It stands to reason that if there is only a limited time to crawl the site then you don’t want it wasted on irrelevant pages. Especially on larger sites! Pages that aren’t relevant to what a user would want to see in the search results for example such as checkout or cart pages or theme content folders perhaps. There are a couple of ways to do this:
Disallow Through Robots.txt
This is a small change to your robots.txt file that tells the crawler not to look at a specific page or within a directory. It’s fairly simple to add however you should take care so as not to inadvertently block pages you want crawled. To block a specific page you should add Disallow: http://www.mydomain/my-folder/my-page (including any file extensions) to block a directory you add Disallow: /my-folder/.
For more info there’s a great, simple article on the robots.txt file here.
Using Meta Directives
Another way to block irrelevant content is to place a little bit of code to the <head> section of a page that tells them not to index the page. The code goes like this - <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> the ‘nofollow’ aspect of this code tells the crawler not to follow any of the links on that page either. You can remove this if it is a static page on the site or keep it there for pages like checkout or cart pages with links through the cart process on them.
Minimising Internal Linking Issues
So you’ve added an appropriate amount of internal links within the sites navigation, sitemap and body copy to aid the user and the crawler. Over time, as already mentioned, pages come and go on a site and you end up with non existent content or redirected pages. This can slow the crawl rate of your site down and cause pages to be missed too!
Update Or Remove Broken Internal Links
When a page is taken down or replaced by a new page it is important to redirect the link for the user (and in some cases to redirect link equity). It is also important to ensure that the old internal link on the site is updated to point to the new page or removed entirely. This prevents the crawler wasting time passing through a redirect or hitting a 404 page and continuing on its way.
This process should also be followed for inbound links too by redirecting them before they hit the sites 404 page and crawl it. In most cases it’s faster to pass through a redirect as an inbound bot than it is to pass through to the 404 page, crawl it and then move onto the next live page.
Minimizing Internal Redirects and Chains
In almost the same way as above I thought it was worth mentioning that sometimes in the life of a site you might redirect an old page to a new one which is later removed and redirected. If then the crawler visits the site through an inbound link to the original page it is passing along 2 redirects. The same is true if you haven’t updated the internal linking and it can slow the crawl speed down.
Update Content Regularly
Regularly updated content through a blog or news article on the site is a great way to get the crawler to your site more often. One good way that this extra crawl visit can be used to your advantage is by ensuring that you include internal links within the content (and of course the navigation). Bear in mind not to look unnatural and to make the number of internal links in the content to an appropriate number. This can also be used to provide additional internal relevancy to the internally linked post too. – Win, win!
Increase Inbound Links
One more way that you can increase the chances that all your pages get crawled and indexed is to increase your inbound links to the sites pages. This then gives the crawler something to follow from one site to another offering a new crawl opportunity for the site. It isn’t as easy as throwing a lot of links at the sites pages and sitting back however. Linkbuilding is something that should done with care in order to keep your sites link profile safe.
I hope this article will help you all to improve the number of pages crawled on your site and if you have any other tips or suggestions feel free to add them into the comments below.