Images Hijacked by Photobucket?

Discussion of non-phpBB related topics with other phpBB.com users.
Forum rules
General Discussion is a bonus forum for discussion of non-phpBB related topics with other phpBB.com users. All site rules apply.
Holger
Registered User
Posts: 1756
Joined: Tue Mar 12, 2002 3:54 pm
Location: Hannover

Re: Images Hijacked by Photobucket?

Post by Holger » Mon Sep 23, 2019 8:54 am

Woah, that is just sad! :x

Reading this I am SO happy that I always encuraged my users to use the forum upload function and tried to make it as easy as possible for the users to upload images to my own servers. Also, the decision to let admins and mods move external images to my own servers via one-click was on point. Phew!

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Sun Oct 06, 2019 9:22 pm

I finally managed to run these scripts on my server.

I really should have done this a year ago when I first found out about them. Unfortunately there was a lot of other forum maintenance I needed to do to free up necessary disk space, not to mention everything else in life that makes it hard to find time for these projects. Also, as it turned out I needed to improve my mySQL knowledge and skills to get it to all run through completely. I have been an administrator of this board for almost 20 years but was only ever responsible for board/ACP-side admin, not server-side. I had to take on the whole job solo 2 years ago when the other admin retired and it has been a huge learning curve running a dedicated server with hobbyist skills only.

Through my delay I missed the opportunity to easily scrape all the Photobucket images before they plastered their watermarks all over them and, 4 weeks ago, when they started blurring all remotely hosted images.

I honestly thought I had stuffed up big time and these images would be lost forever. As explained in an earlier post, many of these images are invaluable, some representing the hard work and knowledge of friends who are now deceased.

I am extremely pleased to report I found a way to bypass Photobucket's watermarking and blurring. In the end, the solution was very simple. I will share the solution with v12mike.

The extraction script took 4 days to complete and the downloading script 2 days...and that's for Photobucket only. The script identified 320,000 embedded image URLs on my board, with 94,000 on Photobucket alone. :o Based on other comments in the thread, it confirmed my suspicions that my board is much more image-heavy than other boards and was particularly affected by the Photobucket image hijack.

I have now scraped 78,000 images (6.5GB) previously hosted on Photobucket, not a single one watermarked or blurred. There are a few hundred more than returned a 200 or 301 code and I should be able to retrieve if I manually tweak the database entries the script created. More on that in a subsequent post. The rest are images the users themselves likely deleted and I cannot be aggrieved if the users no longer wanted those online.

I have also created a modified version of the scripts to scrape images embedded in users signatures and successfully retrieved 300 of these. I should be able to easily modify this for user avatars as well, which I will likely do today or tomorrow. I will also provide these scripts to v12mike.

I will also run the script for the images at hosts other than Photobucket. I'll need to purchase some more HDD space again for this. Around 150,000 of these images were on the old Imageshack and are no longer online. However, many are hosted on archive.org and could be automatically retrieving by inserting some code into the download script which: uses the Wayback Machine JSON API to query the URL for the oldest version of the image cached on archive.org and use the URL returned by the API to download that image instead. The download script already checks file size against images that have already been downloaded so it could easily verify whether the archived version is more authentic than the version previously downloaded.

In the meantime, fuck you Photobucket. I'm glad I beat you.

Finally, I and many others owe a massive debt to v12mike for sharing these scripts. Having saved almost 80,000 images means a great deal to me and the users of my board. So, thank you 8-)

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Sun Oct 06, 2019 9:44 pm

Further to my post above, here are some tips for those running these scripts as I encountered a few problems:
  • Run the scripts using php's CLI if you have a large database and have shell access. This way, I was able to run the scripts without any timeouts on a 7GB database.
  • However, if the scripts encounter an error, they don't remember the last post/image successfully read in that batch, so you need to try to fix the error and the script will through that entire batch again before it gets to the problematic post again. The script will only continue onto the next batch once it has run through an entire batch successfully. You could also reduce the number of posts/images in a batch if you are struggling to identify the exact issue.
  • The extraction script threw 2 errors at me. I couldn't find a way to identify the specific post_id causing the issues for me so I had to find them by checking the php/mySQL error and querying the database for the entry that was creating the error condition. In my case the 2 issues were:
    1. The script interpreted the string "firetopmountain" within a javascript URL as a file extension and tried to drop that into the "ext" column in the relevant database table. That column is configured as VARCHAR(10) so I needed to modify the column for longer text.
    2. There were a few URLs that exceeded the 500 character limit for the "orig_link" column created by the script. I modified the VARCHAR for this column also. As it turned out, the script incorrectly parsed around 100 URLs causing them to be excessively long.
  • As noted above, the script did not successfully parse the URL for a number of embedded images and the url data it extracted still contained BBCode. This occurred in about 100 cases. Some of these were due to the original post have incorrect/incomplete BBCode. Others did not and the image is actually online, so I'll need to manually clean up these database entries. Some images still downloaded despite the BBcode in the image (due to redirections) but I suspect that if the database not corrected, the image will be saved using a MD5 hash of the incorrect URL in the database, not the actual URL appearing in the forum text and so the saved image will not display properly when using the externally hosted images extension.
  • for the above reasons, I suggest going to the tables created by the script and querying the ext and URL columns for anomalous entries. For example, the URL column should not contain any entries with "[img]" or "img src". The ext column should not contain entries which are greater than 4 characters in length. Check for URLs longer than 200-300 characters.
  • when running the download script, it may encounter 200 response codes for files but then not download them due to incorrect mime type or file size. It will mark these entries in the database with a "200" code but no further information to tell you that they were not downloaded. In my case, the bad mime-type was due to the image URL redirecting to a HTML page containing the image. You could use this to manually retrieve the images after the script has run (and then save to the correct MD5 format file name by using an online MD5 generator). To make this easy, I suggest saving the output of the download script into a log file so you can go back and find the problematic files.
  • Photobucket files in thumbnail size have the prefix "th_". My users were posting these in the days before device-sensitive themes. The detailed image is still there if you remove the "th_" from the URL. You could download these versions if you want the full size image backed up on your board without watermarking etc. I am yet to do this but will do so.
  • The download script requires you to have the php fileinfo extension enabled. This is not enabled to default, so I had to add the relevant dll into the php.ini file for the script to work
I hope this feedback helps others.

Holger
Registered User
Posts: 1756
Joined: Tue Mar 12, 2002 3:54 pm
Location: Hannover

Re: Images Hijacked by Photobucket?

Post by Holger » Mon Oct 07, 2019 7:05 am

Thank you KYPREO for your valuable posts! :)

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Mon Oct 07, 2019 9:33 am

Holger wrote:
Mon Oct 07, 2019 7:05 am
Thank you KYPREO for your valuable posts! :)
You're welcome. If it can help others, then great!

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Mon Oct 07, 2019 8:46 pm

I have modified the scripts for avatars and successfully recovered around 650 Photobucket avatars. I will also make these modified scripts available to the original author for validation.

This now ought to cover every image in a normal forum except private message text.

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Thu Oct 10, 2019 11:00 pm

This is not specific to Photobucket but relevant to the utilisation of these scripts and recovery of dead images, and might therefore by useful to others.

If your users have been utilising external image hosting services for a number of years, do not assume that a 404 or incorrect MIME-type error means the images are no longer online. I have found that in many instances the images are still being hosted but the original hotlink is dead and it is a matter of figuring out the new URL structure.

One good example is Flickr.

The following image URLs on my board are dead:

Code: Select all

http://photos23.flickr.com/26103776_9a591e5b01.jpg
http://photos23.flickr.com/26103779_02e1c92280.jpg
http://photos22.flickr.com/26103780_63183014cb.jpg
http://photos22.flickr.com/26103777_1de9c50117.jpg
http://photos22.flickr.com/26103778_917e0db652.jpg
If take the first number of the filename appearing before the underscore, this is the photo ID. I then plug the photo ID into the following the URL structure:

Code: Select all

http://flickr.com/photo.gne?id={photo_ID}
This returns a website showing the image and the direct URL can then be extracted, eg

https://live.staticflickr.com/23/261037 ... 14d6_k.jpg etc

Unfortunately the direct URL has to be individually extracted as there is a hash as part of the filename. This could then be manually downloaded and saved using the original URL in MD5 format as the filename.

Other services have a more systemised way of restructuring a URL so you can figure out the new links and download them in batches.

One excellent feature of the script that v12mike implemented is that the host and HTML response code for each URL is recorded in the database. This means you can figure out which are the popular hosts used by users and then create a report for URLs for that host returning a 404 or other error code. This would allow you to focus efforts on URLs on hosts which have recoverable images and which appear most frequently on the forum. In my case, there are thousands of Flickr images which are showing as dead but are all recoverable so the effort may be worthwhile.

User avatar
AmigoJack
Registered User
Posts: 5613
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: Images Hijacked by Photobucket?

Post by AmigoJack » Fri Oct 11, 2019 6:23 am

KYPREO wrote:
Thu Oct 10, 2019 11:00 pm
do not assume that a 404 or incorrect MIME-type error means the images are no longer online
Likewise do not assume that HTTP 200 means you get what you want: a couple of picture hosters no longer serve the original, but instead respond with a picture that has printed text on it (i.e. "removed" or "404"). One naive approach to check for those is finding out their dimensions and filesize. A more sophisticated approach is to find all of their duplicates, as their file content is always the same.

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Fri Oct 11, 2019 6:30 am

AmigoJack wrote:
Fri Oct 11, 2019 6:23 am
KYPREO wrote:
Thu Oct 10, 2019 11:00 pm
do not assume that a 404 or incorrect MIME-type error means the images are no longer online
Likewise do not assume that HTTP 200 means you get what you want: a couple of picture hosters no longer serve the original, but instead respond with a picture that has printed text on it (i.e. "removed" or "404"). One naive approach to check for those is finding out their dimensions and filesize. A more sophisticated approach is to find all of their duplicates, as their file content is always the same.
Yes, I agree. That was exactly my plan - to run a search for duplicates in the download folder to identify placeholder images.

The script by default has a minimum file size, but I actually found that placeholder images (like those now used by Tinypic) are actually larger than many legitimate images, so file size is a rather crude and imprecise way of excluding false positives.

v12mike
Registered User
Posts: 374
Joined: Thu Jul 09, 2015 5:03 pm

Re: Images Hijacked by Photobucket?

Post by v12mike » Fri Oct 11, 2019 9:43 am

Good investigative work there, Keypro. I know of some other forum admins who have never got around to downloading copies of their externally hosted images, and thought that it was now too late, but perhaps they will do it now.

User avatar
Mick
Support Team Member
Support Team Member
Posts: 21572
Joined: Fri Aug 29, 2008 9:49 am
Location: Caerdydd

Re: Images Hijacked by Photobucket?

Post by Mick » Tue Oct 15, 2019 1:18 pm

I cancelled my photobucket account last week, I went there to see if there was anything worth saving and got so frustrated with how slow it was I just deleted the lot. I hadn’t visited for years anyway but it is now a real pain in the arse.
"The more connected we get the more alone we become" - Kyle Broflovski

KYPREO
Registered User
Posts: 93
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Images Hijacked by Photobucket?

Post by KYPREO » Tue Oct 15, 2019 9:50 pm

Mick wrote:
Tue Oct 15, 2019 1:18 pm
I cancelled my photobucket account last week, I went there to see if there was anything worth saving and got so frustrated with how slow it was I just deleted the lot. I hadn’t visited for years anyway but it is now a real pain in the arse.
I suppose this is too late for you, but there is a way for users to download all their photos from their own account without having to log in and use their shitty platform. The platform is clearly designed to be slow and painful and the download feature is likely deliberately broken. Various online reports show that it used to work and became broken when they started asking for payment to hotlink images. It's pretty obvious what they are doing. They truly are ransoming people's photos.

Anyway, for others who may wish to know, the answer is here: Download all your photobucket images in bulk via CLI

This uses the same method I used to modify v12mike's scripts to obtain the unwatermarked/unblurred original versions of images.

User avatar
Mick
Support Team Member
Support Team Member
Posts: 21572
Joined: Fri Aug 29, 2008 9:49 am
Location: Caerdydd

Re: Images Hijacked by Photobucket?

Post by Mick » Wed Oct 16, 2019 9:51 am

Thanks but I’d got to the end of my requirement for photobucket. I started getting emails from them all of a sudden after 15 years, I’d forgotten I had an account at all. Hopefully your instructions will help a lot of folks who do want to recover their images. Thanks again.
"The more connected we get the more alone we become" - Kyle Broflovski

Post Reply

Return to “General Discussion”