The Robots Exclusion Protocol (REP), the robot exclusion protocol, has been one of the most fundamental components of the web for 25 years.

It allows website owners, by means of rules inserted in robots.txt file, to exclude automated clients, e.g. web crawlers, from accessing their sites partially or completely.

The Robots Exclusion Protocol is almost as old as the web itself: it was 1994 when Martijn Koster (he himself a webmaster) created the initial standard, after the crawlers had almost overwhelmed his site. With l’input from other webmasters, Koster proposed the REP, which was adopted by search engines to help website owners manage their server resources more easily.

However, Google explained in its webmaster blog, the REP has never been transformed into an official Internet standard, proposed and ratified by the Internet Engineering Task Force (IETF).

This means that developers have interpreted the protocol with some differences over the years. Moreover, since its introduction, the REP has not been updated to cover some of today’s web cases. Such ambiguity areas of this de-facto standard are a problem for website owners, because it makes it difficult to write the rules correctly.

For this reason, Google announced an initiative carried out together with the original author of the protocol, webmasters and other search engines: the company documented how the REP is used on the current web and sent such documentation to the IETF.

The draft proposal REP, explains Google, reflects over 20 years of experience on robots.txt rules, used by Googlebot and other important crawlers, as well as from about half a billion websites that are based on

Granular controls give the publisher the power to decide what is scanned on their site and thus is potentially shown to the users concerned. Google explains that the proposal does not change the rules created in 1994, but rather essentially defines all the undefined scenarios for parsing and matching robots.txt and extends it to the needs of the modern web.

The draft is in phase RFC, acronym of Request for Comments: Google uploaded the draft to IETF to get feedback from the developers, so that, with the collaboration of the community, you get to the definition of this standard that regulates

In addition, Google has also released as open source the C++ library that its production systems use for parsing and matching rules in robots.txt files. This library, he illustrated the company, has existed for 20 years and contains pieces of code written in the 1990s. Since then, the library has evolved over the years. In the open source package the company has also entered a test tool, which helps developers try some rules.

This link is available to access the GitHub repository of the robots.txt parser.

Leave a Reply

Your email address will not be published.

You May Also Like