Page 1 of 1

How to get archive.org bot to crawl viewtopic

Posted: Tue Apr 02, 2019 2:09 am
by CarolC1
When I look up our forum on archive.org, it has index and viewforum, but if you try to click a topic, the posts are not archived. There has to be a way to do it, because they have archived the posts from phpbb.com. What is the trick to getting a complete crawl? Thanks

https://web.archive.org/web/20190208050 ... &t=2157448

Re: How to get archive.org bot to crawl viewtopic

Posted: Tue Apr 02, 2019 2:55 am
by Lumpy Burgertushie
as far as I know there is no control over what they crawl and when they do it.

I have found many phpbb boards with their posts using the archive wayback machine over the years.

robert

Re: How to get archive.org bot to crawl viewtopic

Posted: Tue Apr 02, 2019 7:33 am
by AmigoJack
CarolC1 wrote:
Tue Apr 02, 2019 2:09 am
What is the trick
To actually go to that website, looking for something like "Help" or "Support" and then finding articles like Save Pages in the Wayback Machine and Archive whole web sites. Or to contact them. That should give you more answers than we're able to give.

Re: How to get archive.org bot to crawl viewtopic

Posted: Tue Apr 02, 2019 2:57 pm
by CarolC1
Hi AmigoJack,

I have not contacted them, but I have read their help pages several times over the years, including a week or two ago. They really don't give the answer.

The save pages link is mostly about saving small numbers of individually selected pages. I've certainly used that a number of times (and I've made regular donations to Wayback, too).

The Archive It link is more for institutions and you have to pay for it, but nowhere does it say how much. Maybe boards like phpbb.com and phpbbhacks.com have paid to have their sites archived as a special project, but I kind of doubt it? I could be wrong? I would probably be willing to pay to have it completely crawled once, but I say that without knowing how much we're talking about.

I thought maybe there was something about having a sitemap a certain way, or possibly it's just that some sites have a lot of traffic so they get fully archived. Maybe someone will know.

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 3:46 am
by CarolC1
This seems to be a common problem.

Found some urls to check on archive.org (by looking in old support request templates).

Some say This board has no forums, most of the rest are only archived to viewforum level. Its hard to even find one that has posts. I'm trying forum after forum and seeing the same thing, no posts.

There is one with over 4 million posts and another with nearly 6 million, and neither one has the actual posts archived. Theirs stop at viewforum, too.

I found urls in the profiles of a couple of very sharp regulars here that also have this same problem, the archive bot got as far as viewforum on their boards and quit, you click a post and it says it isn't archived.

I checked one forum of a support team member and it is fully archived down to the post level, although the templating appears messed up in viewtopic, but at least the posts are there, the bot recorded them.

I would say based on maybe an hour and a half of searching that if you have your posts archived, you are the exception. Almost no one does. But phpbb.com does. phpbbhacks does. Hmm.

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 7:17 am
by AmigoJack
CarolC1 wrote:
Wed Apr 03, 2019 3:46 am
This board has no forums
That's not Wayback's fault - the board owner then didn't assign read permissions to bots.
CarolC1 wrote:
Wed Apr 03, 2019 3:46 am
only archived to viewforum level
Its hard to even find one that has posts
no posts
you click a post and it says it isn't archived
Do you mean that precisely? How about clicking the topic URI instead of a post URI? They differ, so one may be archived and the other not.

The board I maintain does not give out topics to bots, only forums. And up to that it's archived by Wayback. So I can at least confirm that part also works for me.

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 5:14 pm
by CarolC1
AmigoJack wrote:
Wed Apr 03, 2019 7:17 am
Do you mean that precisely? How about clicking the topic URI instead of a post URI? They differ, so one may be archived and the other not.
Thanks for the idea. :) For a second I thought that was going to be it. Great idea! Unfortunately, neither url worked, but that was a very good idea.

I tried it 2 ways.

1) Tried taking a post url (several years old) off the live forum and pasting it into the Wayback search box directly. It said not archived.

2) I also tried bringing up the forum in the Wayback machine, then going to a forum and clicking a topic link, but it said not archived, so you never even get a chance to see any post links to click.

The only way you can get to a post link by this second method is by clicking Last post on the index (but it is still saying not archived).

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 6:19 pm
by mrgoldy
I might be stating the obvious here but are you sure you have the ‘correct’ permissions set for bots on your board? Can they properly see the topics lists and read their respective posts, etc..?

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 6:42 pm
by CarolC1
It's another good question. I think they're OK?
Here is an example from one forum. Please let me know if you see anything. :)

permissions.PNG

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 10:07 pm
by Lumpy Burgertushie
use the permission mask and use a bot to test with to see if there is a never permission set somewhere else.

robertg

Re: How to get archive.org bot to crawl viewtopic

Posted: Wed Apr 03, 2019 10:30 pm
by CarolC1
Hi Robert,

Not sure this is what you meant? If not, please give me the dummies version. I don't see where to check forum permissions for a single bot, only the bots group. Thanks!

mask.PNG

Re: How to get archive.org bot to crawl viewtopic

Posted: Thu Apr 04, 2019 1:12 am
by thecoalman
The permissions would appear to be correct. I'd recommend removing the permission to print topic, it's just an extra link of duplicate content. Not sure if there is easier way to do this but if you go into ACP and users tab. Search for any bot and you'll get link to "Test users permissions".

Alternatively and the method I prefer is a user agent switcher on the browser. Any browser should have extension for this available. You can switch to Googles user agent or any bot and browse the forum as bot. Log out and clear your browser cache beforehand.

I don't believe there is anything you can do to get archive.org to index pages. I know on a very old forum there was a significant amount of content indexed. That forum got moved to new domain about 2 years ago with proper redirects and they really haven't indexed much.

Re: How to get archive.org bot to crawl viewtopic

Posted: Thu Apr 04, 2019 3:43 am
by CarolC1
Thanks for checking the permissions. I fixed the print permission for all forums and double-checked them all. I tried what you said about checking permissions of the bot through the acp, the bot can see everything it needs to, thanks. I looked at firefox extensions for user agent switcher so I will at least know there is such a thing if I need it.

I just signed up for a 45-min webinar tomorrow about the Archive-It subscription service.

Thanks for the tips.

Re: How to get archive.org bot to crawl viewtopic

Posted: Thu Apr 04, 2019 11:47 pm
by CarolC1
FYI, in the webinar they said a 1-yr subscription to Archive-It starts at $3000 USD for 128 gigabytes and is individually priced based on the project. They kept encouraging everyone to run a trial in which they crawl your site and then talk to you about price.

When I submitted a question about the fact that they've crawled our site repeatedly for years (332 times at the current address) and probably don't have 5% of it, they expressed a lack of concern because they have the whole web to do.

Today I googled "powered by vBulletin" for comparison and looked up some of their forums on Archive.org, and it's pretty much the same situation. Most of them are barely archived. I found one large forum that was mostly archived, again it's an exception.

Re: How to get archive.org bot to crawl viewtopic

Posted: Fri Apr 05, 2019 7:10 am
by AmigoJack
Thanks for reporting back. The price is... maybe even justified when the archieved pages are accessible forever afterwards. And I also think they're right with "the whole web to do" - there are countless discussion boards that have rather worthless chit-chat - I even consider large parts of the phpbb.com forums not worthy for archiving them (repeated trivial questions, outdated advices...). Maybe Wayback decides this upon such a trivial thing like "not more than one parameter in the URI".

Well, I'm glad Wayback exists (I've found numerous older definitions which don't seem to exist anywhere else anymore), and I never consider myself in the position of asking them to be heard. But then again I grew up downloading entire websites for offline browsing - I can sense how much traffic, storage and effort that needs and how good you can work with the outcome.