The converters as written are probably overkill for backing up a single thread. The basic idea for one topic is pretty simple:
1) download the first topic page, and determine the number of pages in the thread based on that (this requires a regular expression which can find the last page number from the HTML)
2) download the remaining topic pages, parsing all of the posts out of each page into a simpler form with a regular expression (or some other method, but regular expressions tend to be the simplest)
3) convert the resulting simplified data into the desired output format (for the converters, the input is a series of records delineated with <^>,</^> tags and with fields separated by <|>, and the output is SQL)
Basically, you need to write two regular expressions: one which can find out the number of pages (or, since you are converting a single thread, you could hardcode the number of pages), and a second expression to pull out the post data and convert it into a simpler form.
You can then worry about smaller issues like BBCode conversion, etc. Since VBulletin is entirely different from any other platform I've converted so far, no existing converter will have the right regular expressions, so you'll have to write them.
Here's a vastly simplified skeleton for this converter:
Code: Select all
from common import *
COOKIEDATA = 'cookie data goes here'
NUMPAGES = 12000 # hardcoded
# fill this in
re_posts = "<p>Author: (.+?)<br/>Date: (.+?)<br/>Post: (.+?)"
posts = 
for page in xrange(1,NUMPAGES+1):
statusline = "Page %i ... "%i
progressline = statusline+"Downloading - "
data = download_page("http://domain.com/showthread.php?...&page=%i"%i, progressline, COOKIEDATA)
At the end of this, posts is a list of all the posts in the thread, ready to be further processed.
Hope that helps!