Page 3 of 3

Re: Distributed Web Crawler - 80legs

Posted: Sun Apr 18, 2010 5:18 am
by Rhet-or-Ric
.

I wonder if, when time allows, one of the Team members could give us the straight poop on just what code should be used in .htaccess if one of us wished to use .htaccess to block a specific User Agent?

Or should I be starting another topic for this?

Thank you.

.

Re: Distributed Web Crawler - 80legs

Posted: Sun Apr 18, 2010 7:02 am
by Pony99CA
shiondev wrote:Blocking us by IP address is not a workable solution. We effectively have an infinite number of IP addresses, since they constantly rotate, while old ones go away and new ones come online.
Infinite, eh? There are at most 2**32 IP addresses (or somewhat over 4 billion). And if that were close to infinite, people wouldn't be talking about running out of IP addresses and the need for IP v6.

On top of that, I doubt you'll get even a large fraction of those IP addresses in your network. I'm not saying that blocking IP addresses is the best solution (and I presented two others), but I don't think it's as hopeless as claimed.

However, the other solutions I suggested (browser agent blocking and phpBB permissions) are probably better and easier.

Steve

Re: Distributed Web Crawler - 80legs

Posted: Sun Apr 18, 2010 7:18 am
by Phil
Rhet-or-Ric wrote:.

I wonder if, when time allows, one of the Team members could give us the straight poop on just what code should be used in .htaccess if one of us wished to use .htaccess to block a specific User Agent?

Or should I be starting another topic for this?

Thank you.

.
Given it's not necessarily phpBB related and instead applies to Apache as a whole, you'd probably be better off finding/asking on a more general forum related to Apache.

Re: Distributed Web Crawler - 80legs

Posted: Sun Apr 18, 2010 12:14 pm
by 3Di
Rhet-or-Ric wrote:.

I wonder if, when time allows, one of the Team members could give us the straight poop on just what code should be used in .htaccess if one of us wished to use .htaccess to block a specific User Agent?

Or should I be starting another topic for this?

Thank you.

.
http://blamcast.net/articles/block-bots ... p-htaccess

Re: Distributed Web Crawler - 80legs

Posted: Thu Nov 18, 2010 11:30 pm
by haggisv
I had well over 100 of these crawlers swamp my forum... I suspect they were also the ones that brought my site down recently.

I added them to the bot list (used the url as the string to identify them), then banned them, which seemed to work. Is that the best way to do it?

I've also contacted 80legs and told them the issue, as they suggested on their site if you have problems with their crawlers. Will be interested to see if I get a response...

Re: Distributed Web Crawler - 80legs

Posted: Sun Nov 21, 2010 10:44 pm
by haggisv
I received a response from 80legs, and they told me they would take my URL off the list to crawl! Of course I had banned them already, :twisted: but it shows that they reputable IMO.

Re: Distributed Web Crawler - 80legs

Posted: Wed Mar 02, 2011 12:49 am
by rampp
Reputable ? The way they conduct their crawling, it is more similar to an botnet / DDOS.

The snippet below works fine for nginx, seeing in the last 45 minutes I've received 6500 different 80legs hits on a single site, disrupting the normal traffic it became apparent this is not a legit operation.

Code: Select all

if ($http_user_agent ~ "80legs" ) {

rewrite ^.+ http://www.80legs.com;

 }

Re: Distributed Web Crawler - 80legs

Posted: Sun Mar 06, 2011 7:30 am
by haggisv
Yes I agree. I mean reputable in that they at least identify the bots by it's proper name, and respond to requests if you ask them to stop crawling your site. I'm sure there are many others out there that won't do either. :o