Hello. Welcome to another episode of Weekly Wisdom. Today I would like to talk about crawl budget, as it is going to be understood in 2020 and beyond. Crawl budget is something that has changed over the last year. Basically, there are quite a lot of new factors affecting it and quite a lot of changes with how Google is crawling, rendering, and indexing our content. I will go through all of the things affecting the crawl budget that I could come up with. I am going to try to go through as many things as possible, definitely touching on the most important ones within the next 10 to 15 minutes.
Organized Website Structure
Let's start with the most important thing that is actually a classic, something that didn't change for the last few years, and it is an organized website structure. It has quite a lot of different metrics within that statement. Let's go through them one by one:
Original content, no duplicates — this means no duplicate content, no near-duplicates, no soft 404's. These are the key things basically, and key offenders for most of the crawler budget and index bloat issues.
Index bloat contains everything that is not valuable, not searchable. If you have any pages that people wouldn't search for or that don't have any kind of traffic flow, or they don't correspond to any queries or user intent, then I wouldn't have them indexed in Google.
Everything that directly affects crawling your website: internal redirects, internal 404s, server problems, 500 code problems such as timeouts, and so on.
The most important part, but not as technical as the previous section, would be information architecture. It covers everything that goes with how your website is structured and how logically it is built. Information architecture affects how both users and Google can look into your website structure and understand what to rank and how to index your content properly.
This has everything to do with indexing strategy. For example, you could have an eCommerce store and quite a lot of different pages and faceted navigation with different filters. You would not want to index pages with a filter from $102 to $104 for a certain product. The whole indexing strategy for an eCommerce store has to be in place to make sure that Google's crawling and indexing are as efficient as possible.
If you have a lot of similar products or a lot of content pieces that are somehow similar to each other, you need to differentiate them, so Google clearly knows which of those pages is the most important for a given query. As the oldest rule in Google says, if you have two pages competing for the same query within your structure, neither of thee two is going to win. Most likely, your competitor is going to steal quite a lot of that traffic.
Now that we have all of these products, we have a clear structure, and more, we need proper internal linking. This strategy will help Google be able to get to all of your products, and especially if you have new products, so those could be found fairly quickly and indexed within hours or minutes, not days.
Make sure that your sitemap only contains URLs that are indexable and valuable. There should not be any junk in your sitemap, because even if a small percentage of the URLs within your sitemap is of low quality, Google may start ignoring those sitemaps altogether.
Canonicals, Noindex, and Robots
The last point in the website structure part is the proper usage of canonicals, noindex, robots, and so on. If you are going to use these in the wrong way, Google may assume that this was a mistake and Google will, at some point, start ignoring those.
We see quite a lot of clients coming to us with index bloats, with indexing problems, with crawler budget problems, and we are seeing that Google is basically ignoring that there are canonical tags altogether. I am guessing the algorithm at some point is assuming, "Okay, this is not right. Maybe this is a technical mistake, so we are going to ignore canonicals altogether within this domain."
This is the thing that, as Google announced, is affecting crawl budgets. If Google is going to see that crawling your website affects your performance —for example, the more they crawl you, the longer response time they see from your server — they will definitely either stop crawling or slow it down massively. Because of that, they may not get to all the pieces of your content. To make sure that this doesn't happen, measure all of the metrics that affect crawling:
- Time to First Byte (TTFB)
- Backend Performance
- Server Performance
Problems with all of the above may affect your content's availability to Google.
Checking Server Performance
You can use tools like Load Impact or do it yourself.
Put a lot of load onto your server, like crawl it with 100 threads or put a massive traffic load into your server and see how that affects your TTFB and your server's performance. If you see that having a lot of traffic slows down your server, this will most likely be visible to search engines as well. There are two more things to look into when optimizing your server performance to make sure that it doesn't affect your crawler budget or to see if it affects it in a positive way:
- Your website should be on a unique IP,'
- ]so there are no other websites on the same IP as your domain.
Now that we have all those technical things in place, your domain quite often needs to be rendered to see all of your content, and this is where we get into another technical part.
To check if you have that problem, you can use this tool. Enter your website's URL and see if there is any difference between the rendered version and not rendered version, and don't forget to do that for all the different views within your page. So if you have an eCommerce store, check the homepage, product page, category page, subcategory page, and all the different possible layouts and views you are using within your domain.
What does it mean for you?
If your website has all the things I already mentioned optimized, make sure that the content you are producing and pushing out is obviously unique and has value. Also, there is a problem that you will see with a lot of, for example, clickbaity websites living from social media.
Some of their articles may be interesting, but they are not something you search for. So make sure that people actually search for something in the ballpark of your article's title. If you have a website full of pages that are not searched for, you may struggle with Google indexing your content because, at some point, they will have figured out that your content is not really valuable for search engines.
I have a very good example of stale content that actually becomes a problem — that becomes a cost for you and the search engine.
A few years back, we worked with a sports website, a website that was predicting the results of some of the games (like football matches). They had hundreds of those per day. So you can imagine the number of pages that were completely useless after two, three, four years, because no one wants to read about the prediction of the basketball game from three years back.
Such pages have to be heavily monitored, and if you have too many of them, make sure to deindex them or maybe move them to some kind of archive outside of the domain and make sure that they are not slowing down your content being indexed.
Thank you so much. There are quite a lot of points in this video, but this is a good list to get you started with crawler budget as it is understood in 2020 and beyond. If you have any questions, feel free to reach out to me. Thank you so much, and see you in the next episode.