Tips and articles you can use to build your online business

Demystifying the robots.txt file

The robots.txt file is probably one of the least understood aspects of the search engine optimization world.

Essentially, a robots.txt file tells the various search engine spiders (a.k.a. robots or bots) to crawl or not to crawl specific sections of a website.

The robots.txt file can indicate files or pages not to be indexed at all, or can instruct spiders and bots with specific instructions about how to index your site.

Many search engine spiders routinely look for the robots.txt file as they arrive on a site and many Search Engine Optimization (SEO) experts agree that including this file makes good sense because it acts as an invitation to crawl and to index your websites content.

There are however some important instances when you may want to limit or to even exclude bots from crawling a site.

Some examples of this are

  • when there are rogue spiders that are crawling for the chief purpose of indexing your site for their own use
  • when there is sensitive information (e.g., unfinished projects you do not want indexed such as site redesigns or exclusive beta-tests)
  • in situations when site owners decide that there is no need to index portions of their site such as image files, download files, or cgi bins.

Search engines scan through files that surfers will never see and this is reason enough to put a robots file on your site. If your site stats include a section on ‘files not found’, it’s possible to see many entries where search engines’ spiders looked for and failed to find a robots.txt file on your site.

Creating the robots.txt file
Creating a basic robots.txt file is a relatively simple process.

Open notepad or your favorite text editor and follow along with the instructions below.

Every robots.txt file contains records of two fields : a “User-agent” line and a “Disallow” line.

The User-Agent line specifies the robot or spider that you are instructing, and the “Disallow” line provides the instructions on what can or cannot be indexed.

Here are two examples…

Example #1 allows robots to index everything while Example #2 prohibits robots from indexing anything:

In the case of the User-agent, the asterisk (*) is the symbol for ‘all’ – so in the example it allows all ‘User-agents’ or robots.

The ‘Disallow:” field informs the robot (User-agent) what to crawl or what not to crawl.

If you want to allow all of your website to be crawled leave the Disallow field blank (see example #1).

If you want to disallow all crawling include the wildcard forward slash (example #2).

You can use this disallow command while creating a website. Just don’t forget to remove it once the site is live.

The majority of websites welcome robots to freely index a website, however there are some instances where the robots’ crawling may be unnecessary or is forbidden and therefore “off limits” to the robots.

Say you have some confidential documents, or downloadable files for users, by indicating these sensitive files are not to be indexed by the robots you eliminate them being indexed and the subsequent files being accessible to all. You can exclude files files from all robots or from individual search engines.

For instance, say you have a file called TopSecret.htm in a directory called ‘CONFIDENTIAL’ that you do not want to be spidered by robots (because you don’t want everyone to have access to these files).
You would simply add the following lines to your robots.txt file:

You can disallow whole directories by:

If, for some reason, you choose to prohibit some robots/spiders from crawling your site, the User-agent should include the name of the specific spider indexing your site.

For example, if a rogue robot keeps indexing your forum, you would include the following in your robots.txt file:

Alternatives to Robots.txt file:
Having said all of the above about the robots.txt file you do not have to actually create one if you don’t want to, however it is an older, more respected method of controlling robots and webcrawlers.

The alternative is to use the “noindex,nofollow” attribute in their HTML meta-tags. Though not a foolproof way to eliminate robots that routinely burn up bandwidth, such an attribute does mesh with the general objectives of many websites.

Here’s an example of a “noindex,nofollow” HTML meta-tag:

There are literally hundreds (if not thousands) of bots and spiders crawling the Web which will take note of your robots.txt file, but not all of them will, so if the information is very sensitive then better no to store it on a server.

If you would like to use a cool tool to generate your robots.txt file you can check out my Resources page on my website.

Discussion Area - Leave a Comment