Distributed Web Crawler - 80legs

Do not post support requests, bug reports or feature requests. Discuss phpBB here. Non-phpBB related discussion goes in General Discussion!
Suggested Hosts
Rhet-or-Ric
Registered User
Posts: 306
Joined: Sun Apr 06, 2008 1:38 pm

Distributed Web Crawler - 80legs

Post by Rhet-or-Ric »

.
We had a guest the other day on our site. In fact, the first time we had 30 something guests. Then, about 4 hours later they returned to the tune of about 50. These guests were sent to us by 80legs.

I have not seen any topic here on phpBB specifically on the 80legs crawler. I found them listed within some script in a few posts, for example here: http://www.phpbb.com/community/viewtopi ... #p11548905

I checked and did find a topic specifically on Baidu, so I hope that starting a discussion on this 80legs company is not against any rules. This is a subject that I feel needs addressing.

The 80legs.com/spider is a distributed web crawler and some refer to them as a gun for hire. They seem to have been formed last year. There's a post on their blog about attending the 2009 Defrag Conference.

After reading their blog I came away with a feeling of unease. Here is the page on their blog that outlines their subscription plans. http://80legs.wordpress.com/2009/12/21/ ... -crawling/

On their landing page they boast of having 50,000 computers at their beck and call. Our experience with them on our site was rather benign when compared to some that have come calling.

Even though these distributed web crawling companies in the past haven't done so well and many have gone out of business, I still feel a strange sense of unease about this company.

I would assume I've been a member here long enough that the Mod Squad here knows I am not posting this to advertise this company.

I sincerely feel uneasy about this, but I just wonder what some of my fellow webmasters think about this web-crawling business as it's being conducted by these people.

Thoughts anyone?
.
Desdenova
Registered User
Posts: 646
Joined: Sat Feb 23, 2008 7:25 pm

Re: Distributed Web Crawler - 80legs

Post by Desdenova »

Rhet-or-Ric wrote:We had a guest the other day on our site. In fact, the first time we had 30 something guests. Then, about 4 hours later they returned to the tune of about 50. These guests were sent to us by 80legs.
Is it safe to assume that they instances of that crawler in action?

If they were, it sounds like yet another bot that doesn't behave; I'll be considering banning this bot from my server if it doesn't behave.
Rhet-or-Ric
Registered User
Posts: 306
Joined: Sun Apr 06, 2008 1:38 pm

Re: Distributed Web Crawler - 80legs

Post by Rhet-or-Ric »

.
Exact same User Agent for every guest sent by 80legs:
Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620 Viewing topics in ...
All used unique IPs, even during the second visit about fours hours after the first.

We just kept an eye on them and they didn't get into numbers that we felt was a problem. They scooted off on their own. But we are setting up plans in case they come in numbers that exceed 100 at one time. Anything less than that isn't too much of a concern.

.
Desdenova
Registered User
Posts: 646
Joined: Sat Feb 23, 2008 7:25 pm

Re: Distributed Web Crawler - 80legs

Post by Desdenova »

Hmm, I'm probably going to block it then. Already had a bad experience with Yandex requesting a few hundred pages in the mere span of two and a half seconds (and essentially DDoS'ing the server), and afterwards not respecting robots.txt, so better safe than sorry.
SamG
Former Team Member
Posts: 3221
Joined: Fri Aug 31, 2001 6:35 pm
Location: Beautiful Northwest Lower Michigan
Name: Sam Graf

Re: Distributed Web Crawler - 80legs

Post by SamG »

Their FAQ attempts to be reassuring: "We're very interesting in proper crawling behavior ..." and "By allowing 80legs to crawl your site, you encourage developers that would otherwise use unregulated crawling tools to use a highly controlled and manable [sic] crawling service. In other words, if your site is crawled by 80legs, you can be sure that it's crawled at a rate that your servers can handle."

The FAQ also details what to do if the bot becomes objectionable. So it seems like they aren't intentionally trying to swamp a server.
We should talk less, and say more.
User avatar
Lumpy Burgertushie
Registered User
Posts: 68554
Joined: Mon May 02, 2005 3:11 am
Contact:

Re: Distributed Web Crawler - 80legs

Post by Lumpy Burgertushie »

maybe I am missing something here. what is the purpose of this service? why would anyone pay their crazy prices to have their service crawl anyone's site?


robert
I'm baaaaaccckkkk. still doing work on donation basis. PM your needs.

Premium phpBB 3.3 Styles by PlanetStyles.net

I am pleased to announce that I have completed the first item on my bucket list. I have the bucket.
Rhet-or-Ric
Registered User
Posts: 306
Joined: Sun Apr 06, 2008 1:38 pm

Re: Distributed Web Crawler - 80legs

Post by Rhet-or-Ric »

.
Lumpy Burgertushie wrote:maybe I am missing something here. what is the purpose of this service? why would anyone pay their crazy prices to have their service crawl anyone's site?


robert
That, Robert, is the 64-something question.

But what has me even more concerned is what sort of screening they put a client through before they do what the client asks.

And I think your question and my question fit together. If somebody is up to no good they would obviously go to a company like 80legs and with some imaginative paperwork (company history/purpose) to fool any auditors that "somebody" could enlist the help of 80legs to do their dirty work.

What might that dirty work be? Sort of makes one wonder about answers to that question.

Like I have already written, there's something about this that makes me uneasy.

Unfortunately, this came across my desk at a bad time and I haven't been able to pull up old files to reacquaint myself with stuff I learned and unlearned back a bit and I have not had the time to dig deeper into this 80legs creature, but I was interested in any feedback that may come of my posting here. As well as any experiences anyone may have had with this 80legs critter. I have already read some negative comments elsewhere on the Net that my Team found that there were some problems with the spider not following rules, but that was last year and the 80legs company may have rewritten some of their code since then.

Maybe one of the 80legs fellas or gals will pop by and enlighten us as to what their auditing procedures are and if their little spidey spides now follow proper industry standard protocol.

.
Desdenova
Registered User
Posts: 646
Joined: Sat Feb 23, 2008 7:25 pm

Re: Distributed Web Crawler - 80legs

Post by Desdenova »

Apparently their bot is supposed to obey robots.txt, but that's the only method they're telling everyone of denying it access, so says they in a few tweets.

One thing I find borked by design is the crawl-delay instruction robots.txt; who is willing to bet that the bot network doesn't obey it as a whole, but rather individually?
shiondev
Registered User
Posts: 4
Joined: Tue Apr 06, 2010 4:02 pm

Re: Distributed Web Crawler - 80legs

Post by shiondev »

Hey all,

I work at 80legs, so I'll do my best to answer your questions.

Q: Do we follow robots.txt?
A: Yes. I believe some webmasters say we don't is that once a robots.txt block is put in, it takes some time for the crawl requests to go away. This happens because of our distributed architecture. The nodes in our grid don't communicate constantly with the central system constantly.

Q: Do we follow crawl-delay?
A: No. We only follow the standard specification for robots.txt. Some directives, like crawl-delay, no-follow, etc. are not standard.

Q: What is the purpose of the service? Why would anyone pay our crazy prices?
A: Companies that use 80legs want a massive crawling system to crawl millions of pages very quickly and an easy-to-use, all-the-work-is-done-for-you service. We give that to them. Our prices are actually much lower than what these companies are used to paying (usually about 50% lower). While 80legs can be used (freely) for small-scale crawling, its real power is in larger, web-scale crawls.

Q: Do people use 80legs for shady purposes?
A: No. Many of our customers are well-established companies. You may even recognize some of them by name. Some of our customers are individuals or bootstrapped startups. Out of personal curiosity, we tend to do research on each customer. All have websites, company information, team bios, etc. If any of them are setting up businesses just to hide their purpose, I must say, they are doing an excellent job!

I hope this information helps! We're not evil, shady people. We are programming nerds that live in Houston and enjoy humidity, bbq and Mario Kart. If you want 80legs to go away, all you really need to do is disallow us in robots.txt. We promise the crawl requests will go away soon after.

Thanks,
Shion
User avatar
Lumpy Burgertushie
Registered User
Posts: 68554
Joined: Mon May 02, 2005 3:11 am
Contact:

Re: Distributed Web Crawler - 80legs

Post by Lumpy Burgertushie »

ok, what benefit do these companies get from you crawling websites?

what data are you prividing them with?


robert
hey, I was born and raised in houston for over 30 years. I would n't go back to that humidity if my life depended on it.

I still can't believe that I worked my whole life in constructions in that mess.

:o
I'm baaaaaccckkkk. still doing work on donation basis. PM your needs.

Premium phpBB 3.3 Styles by PlanetStyles.net

I am pleased to announce that I have completed the first item on my bucket list. I have the bucket.
shiondev
Registered User
Posts: 4
Joined: Tue Apr 06, 2010 4:02 pm

Re: Distributed Web Crawler - 80legs

Post by shiondev »

Q: What data are you providing them with?
A: Our users can customize an 80legs crawl to do just about anything. They may be looking at outgoing links, specific data on the site, ad code, etc. We let them customize what content is produced from the crawl.

- Shion
User avatar
Lumpy Burgertushie
Registered User
Posts: 68554
Joined: Mon May 02, 2005 3:11 am
Contact:

Re: Distributed Web Crawler - 80legs

Post by Lumpy Burgertushie »

seems pretty intrusive to me.

just one more thing that helps advertisers learn even more about our lives I guess.

robert
I'm baaaaaccckkkk. still doing work on donation basis. PM your needs.

Premium phpBB 3.3 Styles by PlanetStyles.net

I am pleased to announce that I have completed the first item on my bucket list. I have the bucket.
Rhet-or-Ric
Registered User
Posts: 306
Joined: Sun Apr 06, 2008 1:38 pm

Re: Distributed Web Crawler - 80legs

Post by Rhet-or-Ric »

.
As the OP, I thank the company for providing a spokesperson to answer the questions raised here. I hope we do not become too much of an irritant to the executives at 80legs.

For example, I would be remiss if I didn't point out that your answer to this question is false. Mind you, it may be false because of a lack of concern to the level that some in society have. Or should I state it may be false, but not meant to be maliciously false.
Q: Do people use 80legs for shady purposes?

A: No. Many of our customers are well-established companies. You may even recognize some of them by name. Some of our customers are individuals or bootstrapped startups. Out of personal curiosity, we tend to do research on each customer. All have websites, company information, team bios, etc. If any of them are setting up businesses just to hide their purpose, I must say, they are doing an excellent job!
The answer was not something like, "We hope not." It was, "No."

But then you qualified the answer and it is very clear that your company, in fact, cannot stand in a courtroom before a judge and answer that question in that manner.

Why? Because you admit that the only research you do on a client is out of "personal curiosity". The fact that some of your clients are well-established companies makes no difference whatsoever in answering the question whether one out of fifty might be up to no good.

You admit that your company does not have an established procedure for confirming in the words of Google, "That no evil is being done."

Please don't get me wrong, I am not implying that Google is all noble-minded and does no wrong, but at least they seem to have some sense of a responsibility to the general public not to do what might be considered a bad thing.

I compare that style of thinking with what I am told is an attitude at 80legs of simple curiosity about whether your product is being used for bad.

On the other hand, there is the very important question whether you actually should be held responsible for the way your product is used. It is a perfectly legitimate question. After all, are gun manufacturers responsible for the lives lost due to the use of their product?

But by your own explanation, there is no way you can answer the question whether anyone is using your product for "shady purposes". You could only answer that you do not know the answer. That you hope not.

And the further comment in your answer that somebody is doing an excellent job of hiding shady activities, is kind of strange. Sorry to be so blunt or crude, but the courts have had many, many cases over the years where somebody was indeed clever enough to hide all sorts of illegal activities. Enron is a case where the government of California appears to have been hoodwinked. And, of course, that is just one case.

But the point I am making is that if you do decide there is some responsibility upon your company to weed out the bad clients, then the attitudes in your executive offices are going to need some drastic revision.

But that gets us back to the question of whether your company should shoulder the responsibility to have to do that. You know, weed out the baddies. The baddies are out there, you can be sure of that. And I have to tell you that your product looks like a ripe picking for some baddie trying to mine information for bad purposes. But it is only what it "looks like" and that may turn out to be incorrect, so I will apologize to you, shiondev, and your company executives as soon as I find out my words here are unduly harsh.

Oh yes, I forgot, you allow a client to customize the mining operation, yes? How much oversight is there at your end?

Again, I really appreciate that the company has provided a spokesperson and I hope my rather rude response here doesn't drive y'all away.

.
SamG
Former Team Member
Posts: 3221
Joined: Fri Aug 31, 2001 6:35 pm
Location: Beautiful Northwest Lower Michigan
Name: Sam Graf

Re: Distributed Web Crawler - 80legs

Post by SamG »

Even if their company is not responsible, and even if 80legs themselves are entirely on the up-and-up, the fact that I can hire their services to collect whatever information I please tests the good faith of content providers, it seems to me. We have no reason to suppose 80legs are shady just because they provide an unsupervised crawling service. If their bot is a good netizen, then it's all good. But we do have good reason to fear that someone, someday will use that service (just as similar services have been used in the past) in ways that will cost us more than we gain.

Which is just another way of saying that the benefits they encourage webmasters to consider aren't necessarily greater than the risks. I'm guessing they know that, so they won't take it personally if phpBB admins opt to block their bot.
We should talk less, and say more.
Pony99CA
Registered User
Posts: 4783
Joined: Thu Sep 30, 2004 3:13 pm
Location: Hollister, CA
Name: Steve
Contact:

Re: Distributed Web Crawler - 80legs

Post by Pony99CA »

shiondev wrote:Q: Do we follow robots.txt?
A: Yes. I believe some webmasters say we don't is that once a robots.txt block is put in, it takes some time for the crawl requests to go away. This happens because of our distributed architecture. The nodes in our grid don't communicate constantly with the central system constantly.
What? I thought robots.txt was supposed to be checked every time a bot visited a page, not cached and distributed. If your "nodes" aren't checking it every time, I wouldn't say that you really follow it.

Steve
Silicon Valley Pocket PC (http://www.svpocketpc.com)
Creator of manage_bots and spoof_user (ask me)
Need hosting for a small forum with full cPanel & MySQL access? Contact me or PM me.
Post Reply

Return to “phpBB Discussion”