What is the robots txt file?

Web Development

2022-02-28T19:00:00-05:00

The robots.txt is a text file with instructions for search engine robots that tells them which pages they should and should not crawl. These instructions are made specific by ‘allowing’ or ‘disallowing’ the behaviour of certain – or all – bots.

Why is the robots.txt file important?

It is the first place that a search engine robot visits and it will hopefully* follow the instructions contained within the file so if you want to exclude certain pages from being visited then this is the place to do it.

*N.B. Some user agents (robots) may choose to ignore your robots.txt file. This is especially common with the more dangerous crawlers like malware robots or email address scrapers.

How does the robots.txt file work?

A search engine robot’s initial action on visiting a website is to search for a robots.txt file. If one is found, it will first read the file before proceeding with anything else and follow its rules.

The robots.txt file is always found in the website's root directory.

For example, for the website ‘www.websitesyork.co.uk’, you would find the robots.txt file at: ‘www.websitesyork.co.uk/robots.txt’. A robots.txt file should always be found in the root directory of your domain. Search engine crawlers will assume you do not have a robots.txt file set up if it is to be found anywhere else.

The robots.txt file syntax

A robots.txt file is made up of one or more blocks of ‘directives’ (rules), each with a specified ‘user-agent’ (search engine bot) and an ‘allow’ or ‘disallow’ instruction.

The first line of every block of directives is the ‘user-agent’ which identifies the crawler it addresses.

The second line in any block of directives is the ‘Disallow’ line. You can have multiple disallow directives that specify which parts of your site the crawler does not have access to.

An empty ‘Disallow’ line means that you are not disallowing anything – enabling a crawler to access all sections of your site. The ‘Allow’ directive allows search engines to crawl a subdirectory or specific page, even in an otherwise disallowed directory.

Examples

1.Blocking all web crawlers from all content

User-agent: * Disallow: /

2.Allowing all web crawlers access to all content

User-agent: * Disallow:

3. Blocking a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /example-subfolder/

Optimising your 'crawl budget'

“Crawl budget” is a term given to the number of pages that search engines will crawl on your site during the search engine robot's visit. This number can vary based on a number of other metrics.

Your crawl budget is important because if your number of pages exceeds your site’s crawl budget, you’re going to have pages on your site that are not indexed, that you may think should be indexed.

Pages that don’t get indexed are not going to rank at all so using the robots.txt file to tell search engines not to crawl pages that need not be crawled, increases the chance that search engines will crawl the pages that you do want crawled and hopefully indexed.

Testing your robots.txt file

Having set up your robots.txt file you may want to test whether the folders you want to be indexed are still accessible and whether the folders that you don't want to be accessible are not accessible. In other words, have you accidentally stopped your whole site from being indexed?

I find the Google testing tool quite baffling to use. I quite like the Merkle one instead. See below.

This allows you to test both folders that should be 'Allowed' and folders that shouldn't, using the currently live version of the robots.txt file - which is exactly what I want to do.

Optimising your 'crawl budget'

“Crawl budget” is a term given to the number of pages that search engines will crawl on your site during the search engine robot's visit. This number can vary based on a number of other metrics.

Testing your robots.txt file

I find the Google testing tool quite baffling to use. I quite like the Merkle one instead. See below.

This allows you to test both folders that should be 'Allowed' and folders that shouldn't, using the currently live version of the robots.txt file - which is exactly what I want to do.

What is the robots txt file?

Why is the robots.txt file important?

How does the robots.txt file work?

The robots.txt file syntax

Optimising your 'crawl budget'

Testing your robots.txt file

Optimising your 'crawl budget'

Testing your robots.txt file

Share this post

See Also:

Is my site OK?

Zen and the art of website maintenance

Website migration best practices

Ways to deter unwanted visitors

How to decide on a hosting company

Why I am leaving Siteground as a host.

How to find the Google Place ID

Broken Image Checker in Chrome

Hacking and how to prevent it

Website Security

How to host your own website

Website making - a guide

Updating stacks in Rapidweaver

Choosing a domain name

Measuring colour contrast

Go live checklist

Dynamic URLs

How to check cookies on a website

Image Optimisation

How to view a site by I.P. address

Disk Not Ejected Properly messages

My setup for Contact Forms in Foundation 6

Progressive Web Apps

My Rapidweaver project takes ages to load