What Is A Robots.txt File
And more importantly – does your site need one ?
Throughout the history of the internet , and of computing itself, the notion of privacy and automation has been a significant point to explore. To every rule, specialists know, there is an exception, and the necessity to store the lists of exceptions is crucial for good programming. So this leads us to the idea of the Robots.txt file and its main purpose.
You’ve probably often heard lots about the Robots.txt file if you’ve ever been around any people who are good with SEO, or if you’re trying to become one of them. This is because Robots.txt is nearly the only way to communicate information to Google’s (and other search engine’s ) web site crawlers.
What does it do ?
In essence , in order for the website to know what information your site contains, it needs to have “crawled it” – to have checked the information your site has. It does that with the use of bots, also known as “spiders” or “crawlers” which check your content and index your site.
What happens if you don’t want parts of your site indexed?
To some of you, this might sound like a controversy. “Why would I need it not to be indexed ? Isn’t that good for SEO?”. Actually that can sometimes be bad, since if you keep two versions of the same textual information (for example a printer-friendly page and web-friendly page) that still counts as duplicate content and might get you penalized.
Even if you have other reasons for hiding content from being indexed, you can take use of the Robots.txt file. But keep in mind that it will not stop malicious programs from still indexing your site. It’s not like it protects it somehow, it can be best described as a “Do not disturb” note – it doesn’t lock the door for you, but it keeps all “civilized” people from entering. A Robots.txt file is the same – some bots abide by it , while some don’t. The bots of major search engines like Google or Bing do.
How can I write one?
Let’s start with the more practical things. A Robots.txt file isn’t written in HTML, and as its extension suggests , it’s merely a regular, plain text file. It also doesn’t follow a standard approved by any particular organization, but the ways to write a Robots.txt file exist since 1994.
The commands used in the file are the following two , in this order :
They should be pretty self-explanatory, but still, let’s say what they do. The user agent parameter basically says which type of bot (or any software, really) is not allowed to browse. The following parameter, Disallow, states which pages or resources should not be available to that particular user agent. One could filter anything from images, to pages, to entire folders.
So let’s take some examples, shall we ?
This is an example, which albeit being very rarely used, can demonstrate how Robots.txt is interpreted. As the first argument, we have specified * , which is a wildcard meaning “everything”. So * means that the resources stated in the Disallow parameter are off-limits to any type of bot. Next we have “/ “ which is only one character long, but is a path to the root directory of the website, meaning that everything in the root directory is off limits. This extreme example, thus, filters everything from all bots.
In this example, we’ve specified Google as the user agent, meaning that we’ll disallow only for the Google bot. We’ve shown that it is possible to disallow an entire folder (like /temp/) and separate, particular files as is the case with /stuff/page1.html
Let’s merge the two cases to explore a particular characteristic of Robots.txt files . We have this example:
In this case , most people would think that since all pages are banned from all user-agents for indexing, the Google disallow field is useless. However that’s not the case, since Robots.txt files don’t follow inheritance and are instead based on specifics. This means that for all other robots, the whole page will be disallowed, while for Google bots – only /temp/ and /stuff/page1.html will be disallowed.
Hopefully , this tutorial has helped you better understand Robots.txt files.