en
English Español Deutsch Français Italiano Português (Brasil) Русский 中文 日本語
Submit post
Go to Blog

Weekly Wisdom with Bartosz Góralewicz: Crawl Budget

Weekly Wisdom with Bartosz Góralewicz: Crawl Budget

Bartosz Góralewicz

Modified Transcript

Hello. Welcome to another episode of Weekly Wisdom. Today I would like to talk about crawl budget, as it is going to be understood in 2020 and beyond. Crawl budget is something that has changed over the last year. Basically, there are quite a lot of new factors affecting it and quite a lot of changes with how Google is crawling, rendering, and indexing our content. I will go through all of the things affecting the crawl budget that I could come up with. I am going to try to go through as many things as possible, definitely touching on the most important ones within the next 10 to 15 minutes.

Organized Website Structure

Let's start with the most important thing that is actually a classic, something that didn't change for the last few years, and it is an organized website structure. It has quite a lot of different metrics within that statement. Let's go through them one by one:

Original content, no duplicates — this means no duplicate content, no near-duplicates, no soft 404's. These are the key things basically, and key offenders for most of the crawler budget and index bloat issues.

Index bloat contains everything that is not valuable, not searchable. If you have any pages that people wouldn't search for or that don't have any kind of traffic flow, or they don't correspond to any queries or user intent, then I wouldn't have them indexed in Google.

Everything that directly affects crawling your website: internal redirects, internal 404s, server problems, 500 code problems such as timeouts, and so on.

Information Architecture

The most important part, but not as technical as the previous section, would be information architecture. It covers everything that goes with how your website is structured and how logically it is built. Information architecture affects how both users and Google can look into your website structure and understand what to rank and how to index your content properly.

This has everything to do with indexing strategy. For example, you could have an eCommerce store and quite a lot of different pages and faceted navigation with different filters. You would not want to index pages with a filter from $102 to $104 for a certain product. The whole indexing strategy for an eCommerce store has to be in place to make sure that Google's crawling and indexing are as efficient as possible.

Internal Cannibalization 

If you have a lot of similar products or a lot of content pieces that are somehow similar to each other, you need to differentiate them, so Google clearly knows which of those pages is the most important for a given query. As the oldest rule in Google says, if you have two pages competing for the same query within your structure, neither of thee two is going to win. Most likely, your competitor is going to steal quite a lot of that traffic.

Internal Linking

Now that we have all of these products, we have a clear structure, and more, we need proper internal linking. This strategy will help Google be able to get to all of your products, and especially if you have new products, so those could be found fairly quickly and indexed within hours or minutes, not days.

Orphan Pages 

Orphan pages are also quite a popular problem within eCommerce stores. It sometimes happens that you will have a product that is not linked, or maybe Google cannot find it. For example, if your pagination relies on JavaScript that is not clickable, Google won't be able to find products beyond the first page. There are hundreds of different reasons why pages may be orphaned, and you should make sure that this is not your problem.

Clean Sitemaps

Make sure that your sitemap only contains URLs that are indexable and valuable. There should not be any junk in your sitemap, because even if a small percentage of the URLs within your sitemap is of low quality, Google may start ignoring those sitemaps altogether.

Canonicals, Noindex, and Robots

The last point in the website structure part is the proper usage of canonicals, noindex, robots, and so on. If you are going to use these in the wrong way, Google may assume that this was a mistake and Google will, at some point, start ignoring those. 

We see quite a lot of clients coming to us with index bloats, with indexing problems, with crawler budget problems, and we are seeing that Google is basically ignoring that there are canonical tags altogether. I am guessing the algorithm at some point is assuming, "Okay, this is not right. Maybe this is a technical mistake, so we are going to ignore canonicals altogether within this domain."

Server Performance

This is the thing that, as Google announced, is affecting crawl budgets. If Google is going to see that crawling your website affects your performance —for example, the more they crawl you, the longer response time they see from your server — they will definitely either stop crawling or slow it down massively. Because of that, they may not get to all the pieces of your content. To make sure that this doesn't happen, measure all of the metrics that affect crawling:

  • Time to First Byte (TTFB)
  • Backend Performance
  • Server Performance
  • Uptime
  • Timeouts 

Problems with all of the above may affect your content's availability to Google.

Checking Server Performance

You can use tools like Load Impact or do it yourself.

Put a lot of load onto your server, like crawl it with 100 threads or put a massive traffic load into your server and see how that affects your TTFB and your server's performance. If you see that having a lot of traffic slows down your server, this will most likely be visible to search engines as well. There are two more things to look into when optimizing your server performance to make sure that it doesn't affect your crawler budget or to see if it affects it in a positive way:

  • Your website should be on a unique IP,'
  • ]so there are no other websites on the same IP as your domain.
  • All of your resources, like CSS files, JavaScript files, and everything should be accessible; they are quite often hosted on a different domain because of CDNs and so on.

Rendering

Now that we have all those technical things in place, your domain quite often needs to be rendered to see all of your content, and this is where we get into another technical part.

We can be almost a hundred percent sure that if you have a medium-sized domain, you are using JavaScript at some capacity. Make sure that no content within your domain relies on JavaScript. If switching off JavaScript is making some of the content within your domain disappear, it means that your domain needs to go through rendering with Google to be fully indexed.

This is not something you want to rely on because Google is very good with rendering compared to other search engines, but this is still far from what we would like to see as SEOs. You should make sure that all of your content is visible without JavaScript so your domain doesn't have to go through that extra step of rendering. It is very expensive and it is definitely going to affect your indexing.

To check if you have that problem, you can use this tool. Enter your website's URL and see if there is any difference between the rendered version and not rendered version, and don't forget to do that for all the different views within your page. So if you have an eCommerce store, check the homepage, product page, category page, subcategory page, and all the different possible layouts and views you are using within your domain.

JavaScript Rendering Cost

One last thing to do in the topic of JavaScript and rendering to make sure that your website doesn't really struggle with rendering problems is to check your JavaScript rendering cost. Some of the websites are actually quite heavy for users, so it may also affect crawling and rendering.

Even though there are mixed signals from Google in this department, it is better to be safe than sorry. You can use the TL;DR tool (it stands for "Too long; didn't render") and verify that your website is in the green zone, which means that the cost of rendering your website is not very high. If the tool puts your website above the green level, that means that your JavaScript rendering is quite expensive.

What does it mean for you?

First, Google may struggle with indexing and crawling your content, which is still a little bit controversial. Still, there is a risk of that, but also, and this is a sure problem, users with a slower mobile phone will struggle to open and render your website — that is it in the rendering and JavaScript department.

Content

If your website has all the things I already mentioned optimized, make sure that the content you are producing and pushing out is obviously unique and has value. Also, there is a problem that you will see with a lot of, for example, clickbaity websites living from social media.

Some of their articles may be interesting, but they are not something you search for. So make sure that people actually search for something in the ballpark of your article's title. If you have a website full of pages that are not searched for, you may struggle with Google indexing your content because, at some point, they will have figured out that your content is not really valuable for search engines.

Outdated Content

I have a very good example of stale content that actually becomes a problem — that becomes a cost for you and the search engine.

A few years back, we worked with a sports website, a website that was predicting the results of some of the games (like football matches). They had hundreds of those per day. So you can imagine the number of pages that were completely useless after two, three, four years, because no one wants to read about the prediction of the basketball game from three years back.

Such pages have to be heavily monitored, and if you have too many of them, make sure to deindex them or maybe move them to some kind of archive outside of the domain and make sure that they are not slowing down your content being indexed.


Thank you so much. There are quite a lot of points in this video, but this is a good list to get you started with crawler budget as it is understood in 2020 and beyond. If you have any questions, feel free to reach out to me. Thank you so much, and see you in the next episode.

Bartosz Góralewicz
Master

A veteran community member.

Bartosz Goralewicz has been a staple in the SEO industry over the last decade as both the co-founder of Elephate (2018’s “Best Small SEO Agency” in Europe) and a thought leader. His ground-breaking research on JavaScript SEO has been the subject of numerous viral articles and has brought him on stages all over the world to share his knowledge. In 2019, he decided to go even more technical and founded Onely – the one and only technical SEO house. Onely’s specialized team works with Fortune 100 companies and other major international brands while continuing to push the envelope in technical SEO. He is also a husband and father of two.
Share this post
or

Comments

2000
Nauman Mohamed
Helper

An experienced member who is always happy to help.

This is quite fresh content.
Generall I find repeated content structure
Thank you for this one
Enthusiast

Occasionally takes part in conversations.

Stunning Writing!! Keep up the good work.
Anatolii Ulitovskyi
Helper

An experienced member who is always happy to help.

Big projects often create orphan or duplicate pages that spend the whole crawl budget. SEOs need to provide on-page audit each month in order to find them.

Send feedback

Your feedback must contain at least 3 words (10 characters).

We will only use this email to respond to you on your feedback. Privacy Policy

Thank you for your feedback!