Invalid Position |
Your Daily Source for Apache News and Information | Invalid Position |
Breaking News | Preferences | Contribute | Triggers | Link Us | Search | About |
By
Every day you have entries in your server logs, telling you about someone requesting a file called robots.txt
. You might not even have a file called robots.txt
, but people keep trying to get it.
What is robots.txt
, and what should be in it?
This week, we'll talk about spiders and robots, what they do, why they are a good thing, and why they can be a bad thing, and how you can prepare for them.
A robot, also called a "bot," a spider, a web crawler, and a variety of other names, is a program that automatically downloads web pages for a variety of purposes. Because it downloads one page, and then recursively downloads every page that is linked to from that page, one can imagine it crawling around the web, harvesting content. From these images come some of the names that these programs are called, as well as some of the names of particular spiders, like WebCrawler, Harvester, MomSpider, and so on.
What they do with these web pages once they have downloaded them varies from application to application. Most of these spiders are doing some variety of search, while some exist so that people can read online content when they are not actually connected to the internet. Others pre-cache content, so that end-users can more rapidly access that content. Still others are just trying to generate statistical information about the web.
Spiders are a good thing. Without them, things like Yahoo, Google, Altavista, and so on, would not be possible. By indexing the web with a spider, these sites permit us to search a collection of documents that it is not feasible to search by ourselves. I remember a few years ago when Yahoo claimed to index "more than 1 million" web pages. That number is now probably in the billions.
Spiders can get us personalized information. There are services that will deliver to your mail box every morning a personalized newspaper, containing only news items that you have expressed interest in, from the online newspapers that you have specified. This is done with robots that download those web sites, and comb through it for the information that you have requested. This would be extremely difficult and expensive were this done by actual people, or with actual paper newspapers.
Of course, occasionally, spiders can cause real problems with your web site. Poorly written, or poorly managed spiders can download pages from your site far faster than anyone could click on links, and can bring your web server to a grinding halt as it tries to keep up with a robot requesting thousands of pages per second.
Usually, this will only happen if someone has been careless in configuring a robot, but occasionally it's just an incompetent progammer that has tried to be clever and write their own robot, but don't know what they are doing.
Most of the time, however, spiders are great, and you want them crawling around your site, so that your site ends up on the search engines.
But there are parts of your web site where you don't want them going. There might be content that you don't really want to be in search engines. Or, perhaps there is a part of your web site that is dynamically generated, and so it would be a little silly to index it, because it will be different the next time.
Even worse, since some dynamically generated pages have links to other dynamically generated pages, the robot could become stuck in a never-ending maze of new pages, and request documents from your server forever.
In order to prevent these things from happening, the standard for robot exclusion was developed. This consists of a document called robots.txt
which will tell robots what they are not allowed to look at. And, even more specific than that, you can tell particular robots to stay out of particular parts of your site.
The file looks like this:
UserAgent: SpiderName Disallow: /cgi-bin/dynamicstuff/
The named spider is not permitted access to the specified resources.
You can also specify a asterisk (*) instead of the spider name, to indicate that no spiders are permitted access:
UserAgent: * Disallow: /cgi-bin/
=head1 What if that does not work (malicious or stupid spiders)
The problem with robots.txt
is that it's optional. It works on the honor system. That is, when someone writes a robot program, they are supposed to tell it to get this file before they ever try to look at your site, and then not get any pages that are disallowed. However, if the person writing a robot chooses not to implement this behavior, or if they just don't know that they are supposed to, it's perfectly possible that they can get documents that you have tagged as disallowed. After all, Apache (or whatever other web server you may be using) does not know the difference between a spider and a regular web user.
If you discover that a robot is going into parts of your web site which you have disallowed in robots.txt
, you have a variety of options.
First, you should attempt to contact the person that is operating the robot. Look in the access log, and find what address the robot is coming from. Find out who is responsible for that machine. Send them a polite note requesting that they leave your site along, and encourage them to adhere to the standard for robot exclusion.
If they do not pay any attention to your request, use some of the techniques covered in last week's article (deny
) to make sure that they can't get in. You can deny them access based on their user agent, or based on what address they are coming from.
If you want to write your own robot, you should be aware that there are already hundreds of robots available to do everything you might want to do. Consider using one of those. You can find a list, and other information about robots, at http://info.webcrawler.com/mak/projects/robots/robots.html
If you absolutely have to write your own, there are various Perl modules in the LWP package which implement much of the base functionality that you need, and will greatly ease the amount of effort required to write one from scratch.
And please, please, implement the standard for robot exclusion. It's very very easy to write a robot, but it's a little more involved to write a good one that does not have any ill effects on the sites it visits.
Related Stories:
Apache Guide: mod_access: Restricting Access by Host(Nov 13, 2000)
Apache Guide: Configuring Your Apache Compile with ./Configure(Nov 06, 2000)
Apache Guide: Logging, Part 4 -- Log-File Analysis(Sep 18, 2000)
Apache Guide: Logging, Part 3 -- Custom Logs(Sep 05, 2000)
Apache Guide: Logging, Part II -- Error Logs(Aug 28, 2000)
Apache Guide: Logging with Apache--Understanding Your access_log(Aug 21, 2000)
Apache Guide: Apache Authentication, Part 4(Aug 14, 2000)
Apache Guide: Apache Authentication, Part 3(Aug 07, 2000)
Apache Guide: Apache Authentication, Part II(Jul 31, 2000)
Apache Guide: Apache Authentication, Part 1(Jul 24, 2000)
Apache Guide: Setting Up Virtual Hosts(Jul 17, 2000)
Apache Guide: Configuring Your Apache Server Installation(Jul 10, 2000)
Apache Guide: The Newbie's Guide to Installing Apache(Jul 03, 2000)
Apache Guide: Advanced SSI Techniques(Jun 26, 2000)
Apache Guide: Introduction to Server Side Includes, Part 2 (Jun 19, 2000)
Apache Guide: Introduction to Server Side Includes(Jun 12, 2000)
Apache Guide: Dynamic Content with CGI(Jun 05, 2000)
About Triggers | Newsletters | Media Kit | Security | Triggers | Login |
All times are recorded in UTC. Linux is a trademark of Linus Torvalds. Powered by Linux 2.2.12, Apache 1.3.9. and PHP 3.14 Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy. |