Colin Craig

A Beginners Guide to Robots.txt: Everything You Need to Know

You have more control over the search engines than you think.

It is true; you can manipulate who crawls and indexes your site – even down to individual pages. To control this, you will need to utilize a robots.txt file. Robots.txt is a simple text file that resides within the root directory of your website. It informs the robots that are dispatched by search engines which pages to crawl and which to overlook.

While not exactly the be-all-and-end-all, you have probably figured out that it is quite a powerful tool and will allow you to present your website to Google in a way that you want them to see it. Search engines are harsh judges of character, so it is essential to make a great impression. Robots.txt, when utilized correctly, can improve crawl frequency, which can impact your SEO efforts.

So, how do you create one? How do you use it? What things should you avoid? Check out this post to find the answers to all these questions.

What is A Robots.txt File?

Back when the internet was just a baby-faced kid with the potential to do great things, developers devised a way to crawl and index fresh pages on the web. They called these ‘robots’ or ‘spiders’.

Occasionally these little fellas would wander off onto websites that weren’t intended to be crawled and indexed, such as sites undergoing maintenance. The creator of the world’s first search engine, Aliweb, recommended a solution – a road map of sorts, which each robot must follow.  

This roadmap was finalized in June of 1994 by a collection of internet savvy techies, as the “Robots Exclusion Protocol”.

A robots.txt file is the execution of this protocol. The protocol delineates the guidelines that every authentic robot must follow. Including Google bots. Some illegitimate robots, such as malware, spyware and the like, by definition, operate outside these rules.

You can take a peek behind the curtain of any website by typing in any URL and adding: /robots.txt at the end.

For example here’s POD Digital’s version:

As you can see, it is not necessary to have an all singing, all dancing file as we are a relatively small website.

Where to Locate the Robots.txt File

Your robots.txt file will be stored in the root directory of your site. To locate it, open your FTP cPanel, and you’ll be able to find the file in your public_html website directory.

There is nothing to these files so that they won't be hefty – probably only a few hundred bytes, if that.

Once you open the file in your text editor, you will be greeted with something that looks a little like this:

If you aren’t able to find a file in your site’s inner workings, then you will have to create your own.

How to Put Together a Robots.txt File

Robots.txt is a super basic text file, so it is actually straightforward to create. All you will need is a simple text editor like Notepad. Open a sheet and save the empty page as, ‘robots.txt’.

Now login to your cPanel and locate the public_html folder to access the site’s root directory. Once that is open, drag your file into it.

Finally, you must ensure that you have set the correct permissions for the file. Basically, as the owner, you will need to write, read and edit the file, but no other parties should be allowed to do so.

The file should display a “0644” permission code.

If not, you will need to change this, so click on the file and select, “file permission”.

Voila! You have a Robots.txt file.

Robots.txt Syntax 

A robots.txt file is made up of multiple sections of ‘directives’, each beginning with a specified user-agent. The user agent is the name of the specific crawl bot that the code is speaking to.

There are two options available:

a)   You can use a wildcard to address all search engines at once.

b)  You can address specific search engines individually.

When a bot is deployed to crawl a website, it will be drawn to the blocks that are calling to them.  

Here is an example:

User-Agent Directive 

The first few lines in each block are the ‘user-agent', which pinpoints a specific bot. The user-agent will match a specific bots name, so for example:

So if you want to tell a Googlebot what to do, for example, start with:

User-agent: Googlebot

Search engines always try to pinpoint specific directives that relate most closely to them. So, for example, if you have got two directives, one for Googlebot-Video and one for Bingbot. A bot that comes along with the user-agent ‘Bingbot’ will follow the instructions, whereas the ‘Googlebot-Video’ bot will pass over this and go in search of a more specific directive.

Most search engines have a few different bots, here us a list of the most common.

Host Directive

The host directive is supported only by Yandex at the moment, although some speculation exists that suggests Google has or does support it. This directive allows a user to decide whether to show the www. before a URL using this block:

Host: poddigital.co.uk

Since Yandex is the only confirmed supporter of the directive, it's not advisable to rely on it. Instead, 301 redirect the hostnames you don't want to the ones you do.

Disallow Directive

We will cover this in a more specific circumstance a little later on.

The second line in a block of directives is Disallow. You can use this to specify which sections of the site shouldn’t be accessed by bots. An empty disallow means it is a free-for-all and the bots can please themselves as to where they do and don’t visit.

Sitemap Directive (XML Sitemaps)

Using the sitemap directive tells search engines, where to find your XML sitemap.

However, probably the most useful thing to do would be to submit each one to the search engines specific webmaster tools. This is because you can learn a lot of valuable information from each about your website.

However, if you are short on time, the sitemap directive is a viable alternative.

Crawl-Delay Directive

Yahoo, Bing, and Yandex can be a little trigger happy when it comes to crawling, but they do respond to the crawl-delay directive, which keeps them at bay for a while.

Applying this line to your block:

Crawl-delay: 10

means that you can make the search engines wait ten seconds before crawling the site or ten seconds before they re-access the site after crawling – it is basically the same, but slightly different depending on the search engine.

Why Use Robots.txt

Now you know about the basics and how to use a few directives, you can put together your file. However, this next step will come down to the kind of content on your site.

Robots.txt is not an essential element to a successful website, in fact, your site can still function correctly and rank well without one.

However, there are several key benefits you must be aware of before you dismiss it:

  • Point Bots Away From Private Folders: Preventing bots from checking out your private folders will make them much harder to find and index.
  • Keep Resources Under Control: Each time a bot crawls through your site, it sucks up bandwidth and other server resources. For sites with tons of content and lots of pages, e-commerce sites, for example, can have thousands of pages, and these resources can be drained really quickly. You can use robots.txt to make it difficult for bots to access individual scripts and images; this will retain valuable resources for real visitors.

You will naturally want search engines to find their way to the most important pages on your website. By politely cordoning off specific pages, you can control which pages are put in front of searchers (be sure to never completely block search engines from seeing certain pages, though).

For example, if we look back at the POD Digital robots file, we see that this URL:

poddigital.co.uk/wp-admin has been disallowed.

Since that page is made just for us to login into the control panel, it makes no sense to allow bots to waste their time and energy crawling it.

Noindex

So, we have been talking about the disallow directive as if it is the answer to all of our problems. However, it doesn't always prevent the page from being indexed.

You could potentially disallow a page, and it may still end up in somewhere in the SERPS; this is where the noindex command comes in. It works in tandem with the disallow command to ensure that bots don’t enter certain or index specified pages.

Here is an example of how you’d do this:

Once you have specified these instructions, the chosen page won’t end up in the SERPs...or, so we thought.

There is some debate as to whether this is an effective tool. In some cases certain search engines support it, but others don’t. As per research completed by Stone Temple, Google is a little inconsistent with its stance on this.

In a study with 13 websites, they found that some pages were removed without being crawled and others remained in the SERPs despite multiple crawls. Also, if the target page is already ranking it can take multiple crawls over several weeks before the pages actually do drop out.

The study concludes that the directive does have an 80% success rate, but with Google no longer supporting such a feature, you must decide whether or not this approach is worth your time and effort, considering that it may not be supported anywhere in the near future.

Things to Avoid

We have talked a little bit about the things you could do and the different ways you can operate your robots.txt, but we are going to delve a little deeper into each point in this section and explain how each may turn into an SEO disaster if not utilized properly.

Overusing Crawl-Delay

We have already explained what the crawl-delay directive does, but you should avoid using it too often as you are limiting the pages crawled by the bots, which is perfect for some websites, but if you have got a huge website, you could be shooting yourself in the foot and preventing good rankings and solid traffic.

Using Robots.txt to Prevent Content Indexing

We have covered this a little bit already, as we said, disallowing a page is the best way to try and prevent the bots crawling it directly.

But it won't work in the following circumstances:

  • If the page has been linked from an external source, the bots will still flow through and index the page.
  • Illegitimate bots will still crawl and index the content.

Using Robots.txt to Shield Private Content

Some private content such as PDFs or thank you pages are indexable even if you point the bots away from it. One of the best methods to go alongside the disallow directive is to place all of your private content behind a login.

Of course, it does mean that it adds a further step for your visitors, but your content will remain secure.

Using Robots.txt to Hide Malicious Duplicate Content

Duplicate content is sometimes a necessary evil - think printer friendly pages for example. 

However, Google and the other search engines are smart enough to know when you are trying to hide something. In fact, doing this may actually draw more attention to it, and this is because Google recognizes the difference between a printer friendly page and someone trying to pull the wool over their eyes:

There is still a chance it may be found anyway.

Here are three ways to deal with this kind of content:

  • Rewrite the Content – Creating exciting and useful content will encourage the search engines to view your website as a trusted source. This suggestion is especially relevant if the content is a copy and paste job.
  • 301 Redirect – 301 redirects inform search engines that a page has transferred to another location. Add a 301 to a page with duplicate content and divert visitors to the original content on the site.
  • Rel= “canonical – This is a tag that informs Google of the original location of duplicated content; this is especially important for an e-commerce website where the CMS often generates duplicate versions of the same URL.

The Moment of Truth: Testing Out Your Robots.txt File

Now is the time to test your file to ensure everything is working in the way you want it to.

Google’s Webmaster Tools has a robots.txt test section.

So first, sign in and find your website within the crawl section of the menu.

There you will find the robots.txt Tester tool:

Remove anything currently in the box, replace it with your new robots.txt file and click, test:

If the ‘Test’ changes to ‘Allowed’, then you got yourself a fully functioning robots.txt.

Creating your robots.txt file correctly, means you are improving your SEO and the user experience of your visitors.

By allowing bots to spend their days crawling the right things, they will be able to organize and show your content in the way you want it to be seen in the SERPs.

Though i use it on every site I have always considered the robots.txt protocol to be a bad short-term fix, like a lot of other early internet decisions. If you want a directory to be hidden and not to be indexed you need to put it in robots.txt which means bad actors can find what you don't want to be indexed - there is a shopping list for them in robots.txt! It might have been better in hindsight to swap the protocol around or add to it to allow you to declare that nothing should be crawled other than the following directories and their contents - you then list the directories you want crawled for indexing.
Nice Article thanks for sharing, but small doubt i have eCommerce site where i can use canonical tag exactly
Geeky Seo
Hi there,

In relation to what we talked about in the article you'd use a canonical on page, category or FTP level to inform Google about pages which are similar or the same in terms of the content, think Printer Friendly pages for example.

Here's some more info: https://www.semrush.com/blog/learning-technical-seo/

I'm glad you enjoyed the article.
Nice article ang I am very thankful to read this.
aakib
Hi Aakib,

I'm glad you enjoyed the article!

Hope to post some more in the near future. Stay tuned.
This article is useful and easily understandable by the beginners.
BALACHANDAR I
Yes or course
BALACHANDAR I
Hi there,

I'm glad you found it helpful!
very nice explaination.
robots.txt is the most important file.
[link removed by moderator]
Add a comment