Submit a post
Go to Blog

A Beginners Guide to Robots.txt: Everything You Need to Know

77
Wow-Score
The Wow-Score shows how engaging a blog post is. It is calculated based on the correlation between users’ active reading time, their scrolling speed and the article’s length.
Learn more

A Beginners Guide to Robots.txt: Everything You Need to Know

Colin Craig
A Beginners Guide to Robots.txt: Everything You Need to Know

robots-txt-logo.jpg

You have more control over the search engines than you think.

It is true; you can manipulate who crawls and indexes your site – even down to individual pages. To control this, you will need to utilize a robots.txt file. Robots.txt is a simple text file that resides within the root directory of your website. It informs the robots that are dispatched by search engines which pages to crawl and which to overlook.

While not exactly the be-all-and-end-all, you have probably figured out that it is quite a powerful tool and will allow you to present your website to Google in a way that you want them to see it. Search engines are harsh judges of character, so it is essential to make a great impression. Robots.txt, when utilized correctly, can improve crawl frequency, which can impact your SEO efforts.

So, how do you create one? How do you use it? What things should you avoid? Check out this post to find the answers to all these questions.

What is A Robots.txt File?

Back when the internet was just a baby-faced kid with the potential to do great things, developers devised a way to crawl and index fresh pages on the web. They called these ‘robots’ or ‘spiders’.

Occasionally these little fellas would wander off onto websites that weren’t intended to be crawled and indexed, such as sites undergoing maintenance. The creator of the world’s first search engine, Aliweb, recommended a solution – a road map of sorts, which each robot must follow.  

This roadmap was finalized in June of 1994 by a collection of internet savvy techies, as the “Robots Exclusion Protocol”.

A robots.txt file is the execution of this protocol. The protocol delineates the guidelines that every authentic robot must follow. Including Google bots. Some illegitimate robots, such as malware, spyware and the like, by definition, operate outside these rules.

You can take a peek behind the curtain of any website by typing in any URL and adding: /robots.txt at the end.

For example here’s POD Digital’s version:

User agent directive

As you can see, it is not necessary to have an all singing, all dancing file as we are a relatively small website.

Where to Locate the Robots.txt File

Your robots.txt file will be stored in the root directory of your site. To locate it, open your FTP cPanel, and you’ll be able to find the file in your public_html website directory.

An image of an C Panel File manager

There is nothing to these files so that they won't be hefty – probably only a few hundred bytes, if that.

Once you open the file in your text editor, you will be greeted with something that looks a little like this:

An image of a basic robots.txt notepage file

If you aren’t able to find a file in your site’s inner workings, then you will have to create your own.

How to Put Together a Robots.txt File

Robots.txt is a super basic text file, so it is actually straightforward to create. All you will need is a simple text editor like Notepad. Open a sheet and save the empty page as, ‘robots.txt’.

Now login to your cPanel and locate the public_html folder to access the site’s root directory. Once that is open, drag your file into it.

Finally, you must ensure that you have set the correct permissions for the file. Basically, as the owner, you will need to write, read and edit the file, but no other parties should be allowed to do so.

The file should display a “0644” permission code.

An image of a change file attributes pop up

If not, you will need to change this, so click on the file and select, “file permission”.

Voila! You have a Robots.txt file.

Robots.txt Syntax 

A robots.txt file is made up of multiple sections of ‘directives’, each beginning with a specified user-agent. The user agent is the name of the specific crawl bot that the code is speaking to.

There are two options available:

a)   You can use a wildcard to address all search engines at once.

b)  You can address specific search engines individually.

When a bot is deployed to crawl a website, it will be drawn to the blocks that are calling to them.  

Here is an example:

robotstxt-syntax.png

User-Agent Directive 

The first few lines in each block are the ‘user-agent', which pinpoints a specific bot. The user-agent will match a specific bots name, so for example:

user-agent-directive.png

So if you want to tell a Googlebot what to do, for example, start with:

User-agent: Googlebot

Search engines always try to pinpoint specific directives that relate most closely to them. So, for example, if you have got two directives, one for Googlebot-Video and one for Bingbot. A bot that comes along with the user-agent ‘Bingbot’ will follow the instructions, whereas the ‘Googlebot-Video’ bot will pass over this and go in search of a more specific directive.

Most search engines have a few different bots, here us a list of the most common.

Host Directive

The host directive is supported only by Yandex at the moment, although some speculation exists that suggests Google has or does support it. This directive allows a user to decide whether to show the www. before a URL using this block:

Host: poddigital.co.uk

Since Yandex is the only confirmed supporter of the directive, it's not advisable to rely on it. Instead, 301 redirect the hostnames you don't want to the ones you do.

Disallow Directive

We will cover this in a more specific circumstance a little later on.

The second line in a block of directives is Disallow. You can use this to specify which sections of the site shouldn’t be accessed by bots. An empty disallow means it is a free-for-all and the bots can please themselves as to where they do and don’t visit.

Sitemap Directive (XML Sitemaps)

Using the sitemap directive tells search engines, where to find your XML sitemap.

However, probably the most useful thing to do would be to submit each one to the search engines specific webmaster tools. This is because you can learn a lot of valuable information from each about your website.

However, if you are short on time, the sitemap directive is a viable alternative.

Crawl-Delay Directive

Yahoo, Bing, and Yandex can be a little trigger happy when it comes to crawling, but they do respond to the crawl-delay directive, which keeps them at bay for a while.

Applying this line to your block:

Crawl-delay: 10

means that you can make the search engines wait ten seconds before crawling the site or ten seconds before they re-access the site after crawling – it is basically the same, but slightly different depending on the search engine.

Why Use Robots.txt

Now you know about the basics and how to use a few directives, you can put together your file. However, this next step will come down to the kind of content on your site.

Robots.txt is not an essential element to a successful website, in fact, your site can still function correctly and rank well without one.

However, there are several key benefits you must be aware of before you dismiss it:

  • Point Bots Away From Private Folders: Preventing bots from checking out your private folders will make them much harder to find and index.
  • Keep Resources Under Control: Each time a bot crawls through your site, it sucks up bandwidth and other server resources. For sites with tons of content and lots of pages, e-commerce sites, for example, can have thousands of pages, and these resources can be drained really quickly. You can use robots.txt to make it difficult for bots to access individual scripts and images; this will retain valuable resources for real visitors.

You will naturally want search engines to find their way to the most important pages on your website. By politely cordoning off specific pages, you can control which pages are put in front of searchers (be sure to never completely block search engines from seeing certain pages, though).

disallow.png

For example, if we look back at the POD Digital robots file, we see that this URL:

poddigital.co.uk/wp-admin has been disallowed.

Since that page is made just for us to login into the control panel, it makes no sense to allow bots to waste their time and energy crawling it.

Noindex

So, we have been talking about the disallow directive as if it is the answer to all of our problems. However, it doesn't always prevent the page from being indexed.

You could potentially disallow a page, and it may still end up in somewhere in the SERPS; this is where the noindex command comes in. It works in tandem with the disallow command to ensure that bots don’t enter certain or index specified pages.

Here is an example of how you’d do this:

noindex.png

Once you have specified these instructions, the chosen page won’t end up in the SERPs...or, so we thought.

There is some debate as to whether this is an effective tool. In some cases certain search engines support it, but others don’t. As per research completed by Stone Temple, Google is a little inconsistent with its stance on this.

In a study with 13 websites, they found that some pages were removed without being crawled and others remained in the SERPs despite multiple crawls. Also, if the target page is already ranking it can take multiple crawls over several weeks before the pages actually do drop out.

The study concludes that the directive does have an 80% success rate, but with Google no longer supporting such a feature, you must decide whether or not this approach is worth your time and effort, considering that it may not be supported anywhere in the near future.

Things to Avoid

We have talked a little bit about the things you could do and the different ways you can operate your robots.txt, but we are going to delve a little deeper into each point in this section and explain how each may turn into an SEO disaster if not utilized properly.

Overusing Crawl-Delay

We have already explained what the crawl-delay directive does, but you should avoid using it too often as you are limiting the pages crawled by the bots, which is perfect for some websites, but if you have got a huge website, you could be shooting yourself in the foot and preventing good rankings and solid traffic.

Using Robots.txt to Prevent Content Indexing

We have covered this a little bit already, as we said, disallowing a page is the best way to try and prevent the bots crawling it directly.

But it won't work in the following circumstances:

  • If the page has been linked from an external source, the bots will still flow through and index the page.
  • Illegitimate bots will still crawl and index the content.

Using Robots.txt to Shield Private Content

semrush-login.png

Some private content such as PDFs or thank you pages are indexable even if you point the bots away from it. One of the best methods to go alongside the disallow directive is to place all of your private content behind a login.

Of course, it does mean that it adds a further step for your visitors, but your content will remain secure.

Using Robots.txt to Hide Malicious Duplicate Content

Duplicate content is sometimes a necessary evil - think printer friendly pages for example. 

However, Google and the other search engines are smart enough to know when you are trying to hide something. In fact, doing this may actually draw more attention to it, and this is because Google recognizes the difference between a printer friendly page and someone trying to pull the wool over their eyes:

duplicate-content.png

There is still a chance it may be found anyway.

Here are three ways to deal with this kind of content:

  • Rewrite the Content – Creating exciting and useful content will encourage the search engines to view your website as a trusted source. This suggestion is especially relevant if the content is a copy and paste job.
  • 301 Redirect – 301 redirects inform search engines that a page has transferred to another location. Add a 301 to a page with duplicate content and divert visitors to the original content on the site.
  • Rel= “canonical – This is a tag that informs Google of the original location of duplicated content; this is especially important for an e-commerce website where the CMS often generates duplicate versions of the same URL.

The Moment of Truth: Testing Out Your Robots.txt File

Now is the time to test your file to ensure everything is working in the way you want it to.

Google’s Webmaster Tools has a robots.txt test section.

webmasters-login.png

So first, sign in and find your website within the crawl section of the menu.

There you will find the robots.txt Tester tool:

robotstxt-tester.png

Remove anything currently in the box, replace it with your new robots.txt file and click, test:

test.png

If the ‘Test’ changes to ‘Allowed’, then you got yourself a fully functioning robots.txt.

Creating your robots.txt file correctly, means you are improving your SEO and the user experience of your visitors.

By allowing bots to spend their days crawling the right things, they will be able to organize and show your content in the way you want it to be seen in the SERPs.

Like this post? Follow us on RSS and read more interesting posts:

RSS
My company: POD Digital is a specialist Ecommerce SEO Company.
We work with big companies up and down the country to improve the performance of their website, increasing traffic, rankings and driving revenue.
Share this post
or

Comments

2000 symbols remain
Hi I have a question, Is there any possible way to hide robots.txt file from human visitors..?
Simon Cox
udappkuma
You would not want to do that. Google really will not like you showing it something that humans cannot see - that's cloaking!
I think I know why you are asking - you perhaps do not want people to see what directories and files you have on your site? In my mind, robots.txt is perhaps the wrong way around or needs updating so you could have a directive that states that you should not crawl anything except the following directories etc.
However, there is a workaround and might be what you are looking for. Use the meta noindex directive, , in the pages you don't want to be indexed and leave those directories out of robots.txt - that way Google will not index and people will not be able to discover them from robots.txt.
Simon Cox
Simon Cox
Actually, there is an allow directive - Google and Bing will honour it so you could use that instead and disallow everything - risky but could work.
Albin Sandberg
You did a really good job covering a complex subject as well as making it easy to understand for beginners. Keep it up!
Colin Craig
Albin Sandberg
Hey Albin,

Thank you very much.

A lot of our clients ask us questions about it and we thought it'd be a great topic to cover at length.

Thanks,
Colin
I can agree with this, we also need to know 85% of customer interactions will be handled without a human by 2020. Even sooner than that, it’s expected that by the end of 2018, customer faces and voices will be recognizable to digital assistants. More about robots + AI you can read via g2a pay blog (article The Future of eCommerce: Will Robots Take Over Your Online Store? )
Simon Cox
Though i use it on every site I have always considered the robots.txt protocol to be a bad short-term fix, like a lot of other early internet decisions. If you want a directory to be hidden and not to be indexed you need to put it in robots.txt which means bad actors can find what you don't want to be indexed - there is a shopping list for them in robots.txt! It might have been better in hindsight to swap the protocol around or add to it to allow you to declare that nothing should be crawled other than the following directories and their contents - you then list the directories you want crawled for indexing.
Simon Cox
Simon Cox
Actually Simon there is an allow directive for use in robots text:
Allow directive
Some major crawlers support an Allow directive, which can counteract a following Disallow directive.[29][30] This is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed. While by standard implementation the first matching robots.txt pattern always wins, Google's implementation differs in that Allow patterns with equal or more characters in the directive path win over a matching Disallow pattern.[31] Bing uses either the Allow or Disallow directive, whichever is more specific, based on length, like Google.[11]

In order to be compatible to all robots, if one wants to allow single files inside an otherwise disallowed directory, it is necessary to place the Allow directive(s) first, followed by the Disallow, for example:

Allow: /directory1/myfile.html
Disallow: /directory1/
This example will Disallow anything in /directory1/ except /directory1/myfile.html, since the latter will match first. The order is only important to robots that follow the standard; in the case of the Google or Bing bots, the order is not important.
From: https://en.wikipedia.org/wiki/Robots_exclusion_standard
Nice Article thanks for sharing, but small doubt i have eCommerce site where i can use canonical tag exactly
Colin Craig
Geeky Seo
Hi there,

In relation to what we talked about in the article you'd use a canonical on page, category or FTP level to inform Google about pages which are similar or the same in terms of the content, think Printer Friendly pages for example.

Here's some more info: https://www.semrush.com/blog/learning-technical-seo/

I'm glad you enjoyed the article.
Nice article ang I am very thankful to read this.
Colin Craig
aakib
Hi Aakib,

I'm glad you enjoyed the article!

Hope to post some more in the near future. Stay tuned.
BALACHANDAR I
This article is useful and easily understandable by the beginners.
BALACHANDAR I
Yes or course
Colin Craig
BALACHANDAR I
Hi there,

I'm glad you found it helpful!
very nice explaination.
robots.txt is the most important file.
[link removed by moderator]

Send feedback

Your feedback must contain at least 3 words (10 characters).

We will only use this email to respond to you on your feedback. Privacy Policy

Thank you for your feedback!