[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] blocking a crawler



"Robert Brownback" <bobski@xxxxxxxxxxxxxxx> wrote:
> The only way to absolutely deny access to a crawler/spider is to block
> its ip address(es).

FYI, this can be accomplished within an .htaccess file or Apache's
httpd.conf file.  See www.apache.org directives section and read through the
directives and examples.  It's trivial to do it.

> Any and all browsers can retrieve a robots.txt file.
> For grins try:
> http://www.apple.com/robots.txt or
> http://www.linux.com/robots.txt or
> http://www.redhat.com/robots.txt or
> http://www.seagate.com/robots.txt or even
> http://www.cobalt.com/robots.txt
>
> See anything juicy in there?

I didn't look, but I have seen robots.txt files that state that viewing (or
attempting to view) any of the listed directories will result in $x per
occurrence charge billed to your ISP.  I get a kick out of that.

> Any subdirectory that *you* can access can
> also be accessed by a crawler that chooses not to respect robots.txt
> directives.

And do not underestimate the # of crawlers that don't respect it.  Anyone
can create a spider to index sites.  I spider a number of external sites for
a niche site I run and if I wanted to it would be easy to ignore robots.txt
files.

--
Steve Werby
President, Befriend Internet Services LLC
http://www.befriend.com/