[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cobalt-users] blocking a crawler
- Subject: Re: [cobalt-users] blocking a crawler
- From: Robert Brownback <bobski@xxxxxxxxxxxxxxx>
- Date: Sun May 6 18:12:00 2001
- List-id: Mailing list for users to share thoughts on Cobalt products. <cobalt-users.list.cobalt.com>
On 07 May 2001 06:52:43 +0000, johnny t wrote:
> Shouldn't you be able to create a robot.txt to block access?
>
> Just a thought.
>
> EL
In an ideal world you would think so, but it is entirely up to the well
behaved crawler (and associatively, its author) to respect the robot.txt
directives. Unlike the "rules" governing posting to this group, the
guidelines for robots.txt are strictly voluntary.
The only way to absolutely deny access to a crawler/spider is to block
its ip address(es).
Something that the clued-in don't reminding of, but for those that may
make the mistake of thinking that robots.txt holds the same properties
as .htaccess or an httpd.conf directive:
Any and all browsers can retrieve a robots.txt file.
For grins try:
http://www.apple.com/robots.txt or
http://www.linux.com/robots.txt or
http://www.redhat.com/robots.txt or
http://www.seagate.com/robots.txt or even
http://www.cobalt.com/robots.txt
See anything juicy in there? Any subdirectory that *you* can access can
also be accessed by a crawler that chooses not to respect robots.txt
directives.
Don't make the mistake of thinking that a robots.txt file will keep any
misbehaving crawler, or anyone for that matter, out of an area that you
don't want to be seen.