What Are Web Robots?Web robots are known by many different names. Web crawlers, web spiders and even web wanderers. These crawlers are really just programs that traverse the web automatically. Always searching, always gathering information. Search engines like Google, Yahoo, and Ask use them to index web content. Spammers even use web robots to lift email addresses or place backlinks automatically. Web robots have many uses and they’re very powerful.
How Does Robots.txt Actually Work?Search engines have two jobs: crawling the web to find content, and then turning around and indexing that content so that it can be found much more easily by people searching for the content. To crawl sites, web robots will follow links from one site to another, jumping from several billion links and sites to index content for users. When a web robot first comes to a site, it will locate the robots.txt file and read it first. The robots.txt file contains information about how the web robot needs to crawl the content. This information will indicate how future crawling will take place as well. If the robots.txt file doesn’t contain any certain directives, the web robot will continue crawling other content on the site. The same would be true of a site that doesn’t contain a robots.txt file.
Excluding Web Robots
TL:DR Warning! If you don’t want all the technical details, skip on down to the next image 🙂For sites that want to exclude web robots, there are ways it can be done. To exclude robots from a server, all you have to do is create a file on the server which specifies an access policy for web robots. However, if it’s going to work properly this file must be accessible via HTTP on the local URL “/robots.txt”. This particular approach was considered and approved because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval instead of digging through tons of files. The only possible negative aspect of this single-file approach is that only a server administrator can maintain this list, not any of the individual document maintainers on the server.
The Format of Web RobotsIf you’re stuck wondering what the format of web robots actually is, the format and semantics of the robots.txt are executed as follows: The file consists of one or more records separated by one or more blank lines. It can be terminated by CR, CR/NL, or NL. Each record will contain lines of the form “<field>:<optionalspace><value><optionalspace>”. The field name is absolutely not case sensitive. Added comments can be included in the file easily by using UNIX Bourne shell conventions: the ‘#’ character is often used to indicate that preceding space. That is, if there is one, as well as the remainder of the line up to the line termination, is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary. The record starts with one or more User-agent lines, followed by one or more Disallow lines, as we’ve detailed for you below. Headers that can’t be recognized will be completely ignored. The User-agent The value of this field is the name of the robot the record will be describing the access policy for. If more than one User-agent field is present the record describes a completely duplicate access policy for more than one web robot. At least one field has to be present for each record. The robot will have to be thorough in identifying everything within the program. If the value happens to be ‘*’, the record will describe the default access policy for any and all robots that aren’t identical to any of the other records. You aren’t allowed to have multiple records similar to this in the “/robots.txt” file. Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value will indicate that all URLs can be retrieved. At least one Disallow field needs to be present in every record. If the presence of an empty “/robots.txt” file has no totally associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome and thus crawl your site for informative content and index it accordingly.
So Why Do You Need A Robots.txt On Your Site?It’s simple. Robots.txt files control where web robots go on your site and thus what they index. This, of course, can be very dangerous if you aren’t an SEO professional and thus aren’t sure about what you’re doing and then accidentally keep Googlebot from crawling your entire site. Oops…Now no one can find you in the search results! However, if you are careful and have done your research, robots.txt can do wonderful things for your website. First and foremost, it does a wonderful job of preventing duplicate content from showing up in SERPs; keeping entire sections of a website private if it’s not yet ready to go live or it’s meant for only a certain audience; specifying the location of the sitemap; preventing search engines from indexing certain files on your website like images or pdf’s; and even specifying a crawl delay for the sole purpose of preventing your servers from being totally overloaded when web crawlers load multiple pages of content from your site at once. Nice, right? When it comes down to it, robots.txt can be a blessing or a curse depending on how familiar you are with the way things work. As you can see, web robots are very helpful when it comes to getting relevant results in search engines. And it all starts with a few of those web spiders crawling through your website and jumping from link to link!
Read More Useful SEO Tips at the Webology Blog.