Is my forum being scraped?

Do not post support requests, bug reports or feature requests. Discuss phpBB here. Non-phpBB related discussion goes in General Discussion!
Ideas Centre
Post Reply
skybound
Registered User
Posts: 145
Joined: Wed Nov 12, 2003 7:11 am
Location: Port Elizabeth - South Africa
Contact:

Is my forum being scraped?

Post by skybound » Tue Sep 17, 2019 8:00 pm

In the last month my online average stat (one on the front page) has pretty much doubled. Used to average around 200 users online for most of the day. Now has shot to 500+ at times. On my website stats number of unique users has also doubled, however the number of pages served has only slightly increased by say 10%.

Majority of the visitors are Guests and via the users online panel can see these guests looking at most fora on our site, even the less popular sections.

No visible impact on server performance, so is not a DoS type attack.

Anyone else experienced something similar? Could it be scraping?

Many of the guest entries look similar to this (but not consistently):
GuestIP: 183.22.29.201 » Whois
Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36
IP addresses are also from all over the World. Very few common IPs.

User avatar
david63
Registered User
Posts: 16536
Joined: Thu Dec 19, 2002 8:08 am
Location: Lancashire, UK
Name: David Wood
Contact:

Re: Is my forum being scraped?

Post by david63 » Tue Sep 17, 2019 8:10 pm

This is not an uncommon occurrence - it is most likely bots crawling your site, whether they are "good" bots or "bad" bots is another matter.
David
Remember: You only know what you know and - you don't know what you don't know!
My CDB Contributions | How to install an extension
I will not be accepting translations for any of my extensions in Github - please post any translations in the appropriate topic.
No support requests via PM or email as they will be ignored

User avatar
John connor
Registered User
Posts: 2238
Joined: Fri Nov 14, 2014 5:14 pm
Location: U S Of A
Name: Aaron
Contact:

Re: Is my forum being scraped?

Post by John connor » Wed Sep 18, 2019 8:12 am

Check out CIDRAM over at Github. I use it myself. Even had features added to it like an AbuseIPDB module. The Dev is a friend of mine.

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3303
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Is my forum being scraped?

Post by thecoalman » Wed Sep 18, 2019 9:08 am

Had the same thing happen on my forum and that coincided with same activity here on phpbb.com but that was a few weeks ago.
Most users ever online was 27085 on Wed Aug 21, 2019 12:40 am
That's actually 26000 bots give or take a few hundred. My site is proxied through Cloudflare and since the primary users of my site are Northeast US I've simply started to throw a speed bump at IP's from China, India and a few others. Cloudflare can target by nation, you don't need to go mucking around with IP's. Any country in that list gets a JS browser challenge, the bots typically don't execute JS so they get blocked. At the height it was blocking about about 50 requests per minute,
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

skybound
Registered User
Posts: 145
Joined: Wed Nov 12, 2003 7:11 am
Location: Port Elizabeth - South Africa
Contact:

Re: Is my forum being scraped?

Post by skybound » Wed Sep 18, 2019 5:45 pm

Thanks for the responses guys. Gives me some understanding and at least know it is not too out the ordinary.
If they are bots indexing, is that not a good thing and perhaps result in some increase in real traffic?

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3303
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Is my forum being scraped?

Post by thecoalman » Wed Sep 18, 2019 10:06 pm

skybound wrote:
Wed Sep 18, 2019 5:45 pm
If they are bots indexing, is that not a good thing and perhaps result in some increase in real traffic?
There is good bots and bad bots. A bot like google is going to identify itself and will obey robots.txt. This are not identifying themselves nor obeying robots.txt, they can't be of any benefit to you. As a side not a rogue bot could spoof Google's user agent, it's the IP that will really tell you if it's Google.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

User avatar
John connor
Registered User
Posts: 2238
Joined: Fri Nov 14, 2014 5:14 pm
Location: U S Of A
Name: Aaron
Contact:

Re: Is my forum being scraped?

Post by John connor » Thu Sep 19, 2019 10:32 pm

thecoalman wrote:
Wed Sep 18, 2019 9:08 am
Had the same thing happen on my forum and that coincided with same activity here on phpbb.com but that was a few weeks ago.
Most users ever online was 27085 on Wed Aug 21, 2019 12:40 am
That's actually 26000 bots give or take a few hundred. My site is proxied through Cloudflare and since the primary users of my site are Northeast US I've simply started to throw a speed bump at IP's from China, India and a few others. Cloudflare can target by nation, you don't need to go mucking around with IP's. Any country in that list gets a JS browser challenge, the bots typically don't execute JS so they get blocked. At the height it was blocking about about 50 requests per minute,
I use CloudFlare myself and many years ago I put the JS challenge to the test. So I JS challenged my IP, disabled JS in the browser and all I got was a block page then a resume to my site. I opened a ticket with CF about this and he informed me it's the way it works. Really? Maybe they changed it now, I'd have to confirm. Also, I've read that bots can parse JS. Maybe not all, but the ability does exist. You're better of with a challenge instead of a JS challenge. And I just straight up block China, India, and many cloud/hosting provider's ASNs. You can challenge countries, or you have like ten free rules in the firewall to outright block whole countries.

User avatar
John connor
Registered User
Posts: 2238
Joined: Fri Nov 14, 2014 5:14 pm
Location: U S Of A
Name: Aaron
Contact:

Re: Is my forum being scraped?

Post by John connor » Thu Sep 19, 2019 10:35 pm

thecoalman wrote:
Wed Sep 18, 2019 10:06 pm
skybound wrote:
Wed Sep 18, 2019 5:45 pm
If they are bots indexing, is that not a good thing and perhaps result in some increase in real traffic?
There is good bots and bad bots. A bot like google is going to identify itself and will obey robots.txt. This are not identifying themselves nor obeying robots.txt, they can't be of any benefit to you. As a side not a rogue bot could spoof Google's user agent, it's the IP that will really tell you if it's Google.
In CIDRAM there is search engine verification. Then in my htaccess file I blocked all access to robots except the four major search engines. If you forge a Google bot UA you are served a 403 thanks to CIDRAM's search engine verification.

skybound
Registered User
Posts: 145
Joined: Wed Nov 12, 2003 7:11 am
Location: Port Elizabeth - South Africa
Contact:

Re: Is my forum being scraped?

Post by skybound » Fri Sep 20, 2019 10:18 am

When you mention bad bots, are we just talking from a sense they ignore the robots file, or that they are up to no good in indexing your site.
ie is there a downside to being indexed if it is not affecting your sites response times and performance?

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3303
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Is my forum being scraped?

Post by thecoalman » Fri Sep 20, 2019 11:22 am

John connor wrote:
Thu Sep 19, 2019 10:32 pm
I use CloudFlare myself and many years ago I put the JS challenge to the test. So I JS challenged my IP, disabled JS in the browser and all I got was a block page then a resume to my site.
It's one of the things they use to block DDOS as well. You need to use it with caution, I would never use it for traffic that I'm expecting other during a a DDOS event. I may have a handful of legitimate users from China per year, I even had a legitimate poster from China once. :lol: I can no longer justify allowing the spammers, scrapers and other nuisance that make up 99.9% of the traffic from these countries. Any legitimate user with JS enabled will get through.


Also, I've read that bots can parse JS
.

They can but they typically don't.

.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3303
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Is my forum being scraped?

Post by thecoalman » Fri Sep 20, 2019 11:28 am

skybound wrote:
Fri Sep 20, 2019 10:18 am
When you mention bad bots, are we just talking from a sense they ignore the robots file, or that they are up to no good in indexing your site.
Any bot that ignores robots.txt would be considered bad bot and is likely up to no good.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

User avatar
John connor
Registered User
Posts: 2238
Joined: Fri Nov 14, 2014 5:14 pm
Location: U S Of A
Name: Aaron
Contact:

Re: Is my forum being scraped?

Post by John connor » Fri Sep 20, 2019 10:14 pm

thecoalman wrote:
Fri Sep 20, 2019 11:22 am
John connor wrote:
Thu Sep 19, 2019 10:32 pm
I use CloudFlare myself and many years ago I put the JS challenge to the test. So I JS challenged my IP, disabled JS in the browser and all I got was a block page then a resume to my site.
It's one of the things they use to block DDOS as well. You need to use it with caution, I would never use it for traffic that I'm expecting other during a a DDOS event. I may have a handful of legitimate users from China per year, I even had a legitimate poster from China once. :lol: I can no longer justify allowing the spammers, scrapers and other nuisance that make up 99.9% of the traffic from these countries. Any legitimate user with JS enabled will get through.


Also, I've read that bots can parse JS
.

They can but they typically don't.

.
I put the JS challenge to the test again and now after all these years it does in fact do what it's suppose to do. If JS is off you get a block page. Before you got a block page and then a continence to the site.

:lol: Funny about that China poster. I actually just block the whole damn country. As well as many others.

skybound
Registered User
Posts: 145
Joined: Wed Nov 12, 2003 7:11 am
Location: Port Elizabeth - South Africa
Contact:

Re: Is my forum being scraped?

Post by skybound » Sat Sep 21, 2019 8:20 am

I tried blocking China, but ended up with too many mails from legit forumites on layovers. Guess that's what you end up with on a forum for Pilots. They end up all over the globe.

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3303
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: Is my forum being scraped?

Post by thecoalman » Sun Sep 22, 2019 8:16 am

skybound wrote:
Sat Sep 21, 2019 8:20 am
I tried blocking China, but ended up with too many mails from legit forumites on layovers.
With Cloudflare you have choice of blocking them outright, issuing a JS challenge or captcha. They get a brief page that says "checking your browser" and then it will direct to the page after a few seconds. Any IP that passes the test is whitelisted for X amount of hours depending on how you have it set in Cloudlflare config. After the time is expired they will get the browser test page again.

It's almost seamless as long as they have JS enabled.

In case you are unfamiliar Cloudflare is a proxy that sits between you and the end user, any traffic blocked by them never hits your server. If you really wanted to get nuts with this you could utilize their API and whitelist IP's after successful login so legitimate users would only see that browser check once.

As a side note Cloudflare adds a header with country code with the http request to the server so you could also do stuff with it server side.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison

Post Reply

Return to “phpBB Discussion”