How to get archive.org bot to crawl viewtopic

Do not post support requests, bug reports or feature requests. Discuss phpBB here. Non-phpBB related discussion goes in General Discussion!
Scam Warning
Post Reply
CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Tue Apr 02, 2019 2:09 am

When I look up our forum on archive.org, it has index and viewforum, but if you try to click a topic, the posts are not archived. There has to be a way to do it, because they have archived the posts from phpbb.com. What is the trick to getting a complete crawl? Thanks

https://web.archive.org/web/20190208050 ... &t=2157448

User avatar
Lumpy Burgertushie
Registered User
Posts: 66324
Joined: Mon May 02, 2005 3:11 am
Contact:

Re: How to get archive.org bot to crawl viewtopic

Post by Lumpy Burgertushie » Tue Apr 02, 2019 2:55 am

as far as I know there is no control over what they crawl and when they do it.

I have found many phpbb boards with their posts using the archive wayback machine over the years.

robert
I'm baaaaaccckkkk. still doing work on donation basis. PM your needs.

Premium phpBB 3.2 Styles by PlanetStyles.net

If a tree falls in the forest and nobody is there, does it make a sound?

User avatar
AmigoJack
Registered User
Posts: 5588
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: How to get archive.org bot to crawl viewtopic

Post by AmigoJack » Tue Apr 02, 2019 7:33 am

CarolC1 wrote:
Tue Apr 02, 2019 2:09 am
What is the trick
To actually go to that website, looking for something like "Help" or "Support" and then finding articles like Save Pages in the Wayback Machine and Archive whole web sites. Or to contact them. That should give you more answers than we're able to give.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Tue Apr 02, 2019 2:57 pm

Hi AmigoJack,

I have not contacted them, but I have read their help pages several times over the years, including a week or two ago. They really don't give the answer.

The save pages link is mostly about saving small numbers of individually selected pages. I've certainly used that a number of times (and I've made regular donations to Wayback, too).

The Archive It link is more for institutions and you have to pay for it, but nowhere does it say how much. Maybe boards like phpbb.com and phpbbhacks.com have paid to have their sites archived as a special project, but I kind of doubt it? I could be wrong? I would probably be willing to pay to have it completely crawled once, but I say that without knowing how much we're talking about.

I thought maybe there was something about having a sitemap a certain way, or possibly it's just that some sites have a lot of traffic so they get fully archived. Maybe someone will know.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Wed Apr 03, 2019 3:46 am

This seems to be a common problem.

Found some urls to check on archive.org (by looking in old support request templates).

Some say This board has no forums, most of the rest are only archived to viewforum level. Its hard to even find one that has posts. I'm trying forum after forum and seeing the same thing, no posts.

There is one with over 4 million posts and another with nearly 6 million, and neither one has the actual posts archived. Theirs stop at viewforum, too.

I found urls in the profiles of a couple of very sharp regulars here that also have this same problem, the archive bot got as far as viewforum on their boards and quit, you click a post and it says it isn't archived.

I checked one forum of a support team member and it is fully archived down to the post level, although the templating appears messed up in viewtopic, but at least the posts are there, the bot recorded them.

I would say based on maybe an hour and a half of searching that if you have your posts archived, you are the exception. Almost no one does. But phpbb.com does. phpbbhacks does. Hmm.

User avatar
AmigoJack
Registered User
Posts: 5588
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: How to get archive.org bot to crawl viewtopic

Post by AmigoJack » Wed Apr 03, 2019 7:17 am

CarolC1 wrote:
Wed Apr 03, 2019 3:46 am
This board has no forums
That's not Wayback's fault - the board owner then didn't assign read permissions to bots.
CarolC1 wrote:
Wed Apr 03, 2019 3:46 am
only archived to viewforum level
Its hard to even find one that has posts
no posts
you click a post and it says it isn't archived
Do you mean that precisely? How about clicking the topic URI instead of a post URI? They differ, so one may be archived and the other not.

The board I maintain does not give out topics to bots, only forums. And up to that it's archived by Wayback. So I can at least confirm that part also works for me.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Wed Apr 03, 2019 5:14 pm

AmigoJack wrote:
Wed Apr 03, 2019 7:17 am
Do you mean that precisely? How about clicking the topic URI instead of a post URI? They differ, so one may be archived and the other not.
Thanks for the idea. :) For a second I thought that was going to be it. Great idea! Unfortunately, neither url worked, but that was a very good idea.

I tried it 2 ways.

1) Tried taking a post url (several years old) off the live forum and pasting it into the Wayback search box directly. It said not archived.

2) I also tried bringing up the forum in the Wayback machine, then going to a forum and clicking a topic link, but it said not archived, so you never even get a chance to see any post links to click.

The only way you can get to a post link by this second method is by clicking Last post on the index (but it is still saying not archived).

User avatar
mrgoldy
Jr. Extension Validator
Posts: 1047
Joined: Tue Oct 06, 2009 7:34 pm
Location: The Netherlands
Name: Gijs

Re: How to get archive.org bot to crawl viewtopic

Post by mrgoldy » Wed Apr 03, 2019 6:19 pm

I might be stating the obvious here but are you sure you have the ‘correct’ permissions set for bots on your board? Can they properly see the topics lists and read their respective posts, etc..?
phpBB Studio / ''Proud member of the Studio"

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Wed Apr 03, 2019 6:42 pm

It's another good question. I think they're OK?
Here is an example from one forum. Please let me know if you see anything. :)

permissions.PNG

User avatar
Lumpy Burgertushie
Registered User
Posts: 66324
Joined: Mon May 02, 2005 3:11 am
Contact:

Re: How to get archive.org bot to crawl viewtopic

Post by Lumpy Burgertushie » Wed Apr 03, 2019 10:07 pm

use the permission mask and use a bot to test with to see if there is a never permission set somewhere else.

robertg
I'm baaaaaccckkkk. still doing work on donation basis. PM your needs.

Premium phpBB 3.2 Styles by PlanetStyles.net

If a tree falls in the forest and nobody is there, does it make a sound?

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Wed Apr 03, 2019 10:30 pm

Hi Robert,

Not sure this is what you meant? If not, please give me the dummies version. I don't see where to check forum permissions for a single bot, only the bots group. Thanks!

mask.PNG

User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 3220
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.
Contact:

Re: How to get archive.org bot to crawl viewtopic

Post by thecoalman » Thu Apr 04, 2019 1:12 am

The permissions would appear to be correct. I'd recommend removing the permission to print topic, it's just an extra link of duplicate content. Not sure if there is easier way to do this but if you go into ACP and users tab. Search for any bot and you'll get link to "Test users permissions".

Alternatively and the method I prefer is a user agent switcher on the browser. Any browser should have extension for this available. You can switch to Googles user agent or any bot and browse the forum as bot. Log out and clear your browser cache beforehand.

I don't believe there is anything you can do to get archive.org to index pages. I know on a very old forum there was a significant amount of content indexed. That forum got moved to new domain about 2 years ago with proper redirects and they really haven't indexed much.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Thu Apr 04, 2019 3:43 am

Thanks for checking the permissions. I fixed the print permission for all forums and double-checked them all. I tried what you said about checking permissions of the bot through the acp, the bot can see everything it needs to, thanks. I looked at firefox extensions for user agent switcher so I will at least know there is such a thing if I need it.

I just signed up for a 45-min webinar tomorrow about the Archive-It subscription service.

Thanks for the tips.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: How to get archive.org bot to crawl viewtopic

Post by CarolC1 » Thu Apr 04, 2019 11:47 pm

FYI, in the webinar they said a 1-yr subscription to Archive-It starts at $3000 USD for 128 gigabytes and is individually priced based on the project. They kept encouraging everyone to run a trial in which they crawl your site and then talk to you about price.

When I submitted a question about the fact that they've crawled our site repeatedly for years (332 times at the current address) and probably don't have 5% of it, they expressed a lack of concern because they have the whole web to do.

Today I googled "powered by vBulletin" for comparison and looked up some of their forums on Archive.org, and it's pretty much the same situation. Most of them are barely archived. I found one large forum that was mostly archived, again it's an exception.

User avatar
AmigoJack
Registered User
Posts: 5588
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: How to get archive.org bot to crawl viewtopic

Post by AmigoJack » Fri Apr 05, 2019 7:10 am

Thanks for reporting back. The price is... maybe even justified when the archieved pages are accessible forever afterwards. And I also think they're right with "the whole web to do" - there are countless discussion boards that have rather worthless chit-chat - I even consider large parts of the phpbb.com forums not worthy for archiving them (repeated trivial questions, outdated advices...). Maybe Wayback decides this upon such a trivial thing like "not more than one parameter in the URI".

Well, I'm glad Wayback exists (I've found numerous older definitions which don't seem to exist anywhere else anymore), and I never consider myself in the position of asking them to be heard. But then again I grew up downloading entire websites for offline browsing - I can sense how much traffic, storage and effort that needs and how good you can work with the outcome.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

Post Reply

Return to “phpBB Discussion”