Apache Today - Apache Guide: Spiders and Robots

Your Daily Source for Apache News and Information

Breaking News

Preferences

Contribute

Triggers

Link Us

Search

About

Apache Module Registry
PHP Server Side Scripting
The Apache Software Foundation
The Apache FAQ
Apache Project
Apache-Perl Integration Project
The Java Apache Project
Apache-Related Projects
The Jakarta Project
ApacheCon
Apache XML Project


	internet.com Internet News Internet Investing Internet Technology Windows Internet Tech. Linux/Open Source Web Developer ECommerce/Marketing ISP Resources ASP Resources Wireless Internet Downloads Internet Resources Internet Lists International EarthWeb Career Resources Search internet.com Advertising Info Corporate Info

Apache Guide: Spiders and Robots
Nov 21, 2000, 13 :23 UTC (1 Talkback[s]) (6559 reads) (Other stories by Rich Bowen)

Every day you have entries in your server logs, telling you about someone requesting a file called robots.txt. You might not even have a file called robots.txt, but people keep trying to get it.

What is robots.txt, and what should be in it?

This week, we'll talk about spiders and robots, what they do, why they are a good thing, and why they can be a bad thing, and how you can prepare for them.

What are Robots and Spiders?

A robot, also called a "bot," a spider, a web crawler, and a variety of other names, is a program that automatically downloads web pages for a variety of purposes. Because it downloads one page, and then recursively downloads every page that is linked to from that page, one can imagine it crawling around the web, harvesting content. From these images come some of the names that these programs are called, as well as some of the names of particular spiders, like WebCrawler, Harvester, MomSpider, and so on.

What they do with these web pages once they have downloaded them varies from application to application. Most of these spiders are doing some variety of search, while some exist so that people can read online content when they are not actually connected to the internet. Others pre-cache content, so that end-users can more rapidly access that content. Still others are just trying to generate statistical information about the web.

Spiders are Good

Spiders are a good thing. Without them, things like Yahoo, Google, Altavista, and so on, would not be possible. By indexing the web with a spider, these sites permit us to search a collection of documents that it is not feasible to search by ourselves. I remember a few years ago when Yahoo claimed to index "more than 1 million" web pages. That number is now probably in the billions.

Spiders can get us personalized information. There are services that will deliver to your mail box every morning a personalized newspaper, containing only news items that you have expressed interest in, from the online newspapers that you have specified. This is done with robots that download those web sites, and comb through it for the information that you have requested. This would be extremely difficult and expensive were this done by actual people, or with actual paper newspapers.

Spiders are Bad

Of course, occasionally, spiders can cause real problems with your web site. Poorly written, or poorly managed spiders can download pages from your site far faster than anyone could click on links, and can bring your web server to a grinding halt as it tries to keep up with a robot requesting thousands of pages per second.

Usually, this will only happen if someone has been careless in configuring a robot, but occasionally it's just an incompetent progammer that has tried to be clever and write their own robot, but don't know what they are doing.

How do I Protect My Server?

Most of the time, however, spiders are great, and you want them crawling around your site, so that your site ends up on the search engines.

But there are parts of your web site where you don't want them going. There might be content that you don't really want to be in search engines. Or, perhaps there is a part of your web site that is dynamically generated, and so it would be a little silly to index it, because it will be different the next time.

Even worse, since some dynamically generated pages have links to other dynamically generated pages, the robot could become stuck in a never-ending maze of new pages, and request documents from your server forever.

In order to prevent these things from happening, the standard for robot exclusion was developed. This consists of a document called robots.txt which will tell robots what they are not allowed to look at. And, even more specific than that, you can tell particular robots to stay out of particular parts of your site.

The file looks like this:

     UserAgent: SpiderName
     Disallow: /cgi-bin/dynamicstuff/

The named spider is not permitted access to the specified resources.

You can also specify a asterisk (*) instead of the spider name, to indicate that no spiders are permitted access:

     UserAgent: *
     Disallow: /cgi-bin/

=head1 What if that does not work (malicious or stupid spiders)

The problem with robots.txt is that it's optional. It works on the honor system. That is, when someone writes a robot program, they are supposed to tell it to get this file before they ever try to look at your site, and then not get any pages that are disallowed. However, if the person writing a robot chooses not to implement this behavior, or if they just don't know that they are supposed to, it's perfectly possible that they can get documents that you have tagged as disallowed. After all, Apache (or whatever other web server you may be using) does not know the difference between a spider and a regular web user.

If you discover that a robot is going into parts of your web site which you have disallowed in robots.txt, you have a variety of options.

First, you should attempt to contact the person that is operating the robot. Look in the access log, and find what address the robot is coming from. Find out who is responsible for that machine. Send them a polite note requesting that they leave your site along, and encourage them to adhere to the standard for robot exclusion.

If they do not pay any attention to your request, use some of the techniques covered in last week's article (deny) to make sure that they can't get in. You can deny them access based on their user agent, or based on what address they are coming from.

Writing a Robot

If you want to write your own robot, you should be aware that there are already hundreds of robots available to do everything you might want to do. Consider using one of those. You can find a list, and other information about robots, at http://info.webcrawler.com/mak/projects/robots/robots.html

If you absolutely have to write your own, there are various Perl modules in the LWP package which implement much of the base functionality that you need, and will greatly ease the amount of effort required to write one from scratch.

And please, please, implement the standard for robot exclusion. It's very very easy to write a robot, but it's a little more involved to write a good one that does not have any ill effects on the sites it visits.

Current Newswire:

Another mod_xslt added to the Apache Module Registry database

Netcraft Web Server Survey for December is available

O'Reilly: Apache Web-Serving with Mac OS X: Part 1

WDVL: Perl for Web Site Management: Part 3

Retro web application framework V1.1.0 release

Leveraging open standards such as Java, JSP, XML,J2EE, Expresso and Struts.

Netcraft Web Server Survey for November is available

FoxServ 2.0 Released

Ace's Hardware: Building a Better Webserver in the 21st Century

Web Techniques: Customer Number One

[Home][Top of Page]

Index Mode | Flat Mode | Thread Mode | Thread Flat

Talkback(s)

Name

Date

bandwidth throttling
It used to drive me mad that people were downloading my entire web-site,
which at Australian pay-per-MByte rates was costing me a lot of money.
So I wrote a bandwidth throttler to limit the speed of download of individual
client IP hosts to 30 files/second and about 10% of my link bandwidth.
All well-behaved spiders fit these rules, especially since I allow some extra
to allow for variability.

The Apache module I wrote is bwshare:
http://www.topology.org/src/bwshare/README.html
which gives a warning message when the user overdoes it.

Nov 21, 2000, 15:08:03

Home | Search Talkbacks | Customize View

Top of Page

Enter your comments below.

About Triggers

Media Kit

Security

Triggers

Login

All times are recorded in UTC.
Linux is a trademark of Linus Torvalds.
Powered by Linux 2.4, Apache 1.3, and PHP 4 Copyright INT Media Group, Incorporated All Rights Reserved.
Legal Notices, �Licensing, Reprints, & Permissions, �Privacy Policy.

Your Name:	Your Email Address:

Subject:	CC: [will also send this talkback to an E-Mail address]

Comments:
See our talkback-policy for or guidelines on talkback content.