Robots.txt
May 20, 2023
The robots.txt
file is a standard used by websites to communicate with web robots (also known as crawlers or spiders) about which pages of their website should be crawled and indexed for search engine purposes. The robots.txt
file is a plain text file that is placed in the root directory of a website and contains a set of rules that web robots follow when crawling the website. The purpose of the robots.txt
file is to help website owners control how their website is crawled and indexed by search engines.
Syntax of robots.txt
The syntax of robots.txt
is very simple. It consists of a set of rules that web robots follow when crawling a website. Each rule consists of two parts: the user-agent and the directive. The user-agent identifies the web robot that the rule applies to, and the directive tells the web robot what to do. A sample robots.txt
file might look like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
In this example, the user-agent is set to “*” which means that the rule applies to all web robots. The Disallow
directive tells web robots not to crawl the directories /cgi-bin/
, /tmp/
, and /private/
.
Purpose of robots.txt
The purpose of robots.txt
is to give website owners control over how their website is crawled and indexed by search engines. By using robots.txt
, website owners can tell search engine robots which pages of their website should be crawled and which pages should be excluded from indexing. This can be useful for a number of reasons. For example:
-
Security: By disallowing access to certain directories, website owners can prevent web robots from crawling sensitive information that they do not want to be made public.
-
Bandwidth conservation: Web robots can generate a lot of traffic to a website, especially if they are crawling large numbers of pages. By excluding certain directories from being crawled, website owners can reduce the amount of bandwidth that their website uses.
-
SEO optimization: By excluding certain pages from being indexed, website owners can control what content appears in search engine results pages (SERPs). This can be useful for preventing duplicate content from being indexed or for preventing low-quality pages from appearing in SERPs.
Usage of robots.txt
The robots.txt
file can be used to give website owners a lot of control over how their website is crawled and indexed by search engines. Here are some of the main ways that website owners use robots.txt
:
Disallowing access to specific directories
One of the most common uses of robots.txt
is to disallow web robots from crawling specific directories on a website. This can be useful for preventing web robots from crawling sensitive information or from crawling directories that are not intended for public access.
For example, if a website has a directory called /admin/
that contains sensitive information, the website owner can use the following robots.txt
rules to prevent search engine robots from crawling this directory:
User-agent: *
Disallow: /admin/
In this example, the Disallow
directive tells search engine robots not to crawl the /admin/
directory.
Disallowing access to specific pages
In addition to disallowing access to specific directories, website owners can also use robots.txt
to disallow access to specific pages on their website. This can be useful for preventing low-quality pages or duplicate content from being indexed by search engines.
For example, if a website has a page called /duplicate-page/
that contains duplicate content, the website owner can use the following robots.txt
rules to prevent search engine robots from crawling this page:
User-agent: *
Disallow: /duplicate-page/
In this example, the Disallow
directive tells search engine robots not to crawl the /duplicate-page/
page.
Allowing access only to specific directories
In some cases, website owners may want to allow search engine robots to crawl only certain directories on their website. This can be useful for websites that have a lot of low-quality content or duplicate content that they do not want to be indexed by search engines.
For example, if a website has a directory called /high-quality-content/
that contains high-quality content that the website owner wants to be indexed, the website owner can use the following robots.txt
rules to allow search engine robots to crawl this directory:
User-agent: *
Disallow:
Disallow: /
Allow: /high-quality-content/
In this example, the Disallow: /
directive tells search engine robots not to crawl the entire website except for the directory /high-quality-content/
.
Allowing access only to specific pages
In addition to allowing access to specific directories, website owners can also use robots.txt
to allow access only to specific pages on their website. This can be useful for preventing low-quality pages from being indexed by search engines.
For example, if a website has a page called /high-quality-page/
that contains high-quality content that the website owner wants to be indexed, the website owner can use the following robots.txt
rules to allow search engine robots to crawl this page:
User-agent: *
Disallow:
Disallow: /
Allow: /high-quality-page/
In this example, the Disallow: /
directive tells search engine robots not to crawl the entire website except for the page /high-quality-page/
.