I finally managed to run these scripts on my server.
I really should have done this a year ago when I first found out about them. Unfortunately there was a lot of other forum maintenance I needed to do to free up necessary disk space, not to mention everything else in life that makes it hard to find time for these projects. Also, as it turned out I needed to improve my mySQL knowledge and skills to get it to all run through completely. I have been an administrator of this board for almost 20 years but was only ever responsible for board/ACP-side admin, not server-side. I had to take on the whole job solo 2 years ago when the other admin retired and it has been a huge learning curve running a dedicated server with hobbyist skills only.
Through my delay I missed the opportunity to easily scrape all the Photobucket images before they plastered their watermarks all over them and, 4 weeks ago, when they started blurring all remotely hosted images.
I honestly thought I had stuffed up big time and these images would be lost forever. As explained in an earlier post, many of these images are invaluable, some representing the hard work and knowledge of friends who are now deceased.
I am extremely pleased to report I found a way to bypass Photobucket's watermarking and blurring. In the end, the solution was very simple. I will share the solution with v12mike.
The extraction script took 4 days to complete and the downloading script 2 days...and that's for Photobucket only. The script identified 320,000 embedded image URLs on my board, with 94,000 on Photobucket alone.
Based on other comments in the thread, it confirmed my suspicions that my board is much more image-heavy than other boards and was particularly affected by the Photobucket image hijack.
I have now scraped 78,000 images (6.5GB) previously hosted on Photobucket, not a single one watermarked or blurred. There are a few hundred more than returned a 200 or 301 code and I should be able to retrieve if I manually tweak the database entries the script created. More on that in a subsequent post. The rest are images the users themselves likely deleted and I cannot be aggrieved if the users no longer wanted those online.
I have also created a modified version of the scripts to scrape images embedded in users signatures and successfully retrieved 300 of these. I should be able to easily modify this for user avatars as well, which I will likely do today or tomorrow. I will also provide these scripts to v12mike.
I will also run the script for the images at hosts other than Photobucket. I'll need to purchase some more HDD space again for this. Around 150,000 of these images were on the old Imageshack and are no longer online. However, many are hosted on archive.org and could be automatically retrieving by inserting some code into the download script which: uses the Wayback Machine JSON API to query the URL for the oldest version of the image cached on archive.org and use the URL returned by the API to download that image instead. The download script already checks file size against images that have already been downloaded so it could easily verify whether the archived version is more authentic than the version previously downloaded.
In the meantime, fuck you Photobucket. I'm glad I beat you.
Finally, I and many others owe a massive debt to v12mike for sharing these scripts. Having saved almost 80,000 images means a great deal to me and the users of my board. So, thank you