[2.0.x] Tweaks for large forums

The 2.0.x discussion forum has been locked; this will remain read-only. The 3.0.x discussion forum has been renamed phpBB Discussion.
Locked
lanzer
Registered User
Posts: 152
Joined: Wed Oct 10, 2001 10:00 am
Contact:

[2.0.x] Tweaks for large forums

Post by lanzer »

Throughout the months of running phpBB I've made a few mods here and there to help speed things up on the site due to the high volume of traffic. Just wanted to take a short break and share with everyone some of the modifications that I made. They're by no way an ehnancement to phpBB in any ways, but those who are experiencing slowness in their large forums might found interesting:

- Index page

By default phpBB collects all topics that are new to see which forum have been updated since your last visit. When user post count approached 50,000 posts a day this process took more than 10 seconds to load. Eventually the process was dropped, and a cheat was put in-place.

This was taken out:

Code: Select all

	if ( $userdata['session_logged_in'] )
	{
		$sql = "SELECT t.forum_id, t.topic_id, p.post_time 
			FROM " . TOPICS_TABLE . " t, " . POSTS_TABLE . " p 
			WHERE p.post_id = t.topic_last_post_id 
				AND p.post_time > " . $userdata['user_lastvisit'] . " 
				AND t.topic_moved_id = 0"; 
		if ( !($result = $db->sql_query($sql)) )
		{
			message_die(GENERAL_ERROR, 'Could not query new topic information', '', __LINE__, __FILE__, $sql);
		}

		$new_topic_data = array();
		while( $topic_data = $db->sql_fetchrow($result) )
		{
			$new_topic_data[$topic_data['forum_id']][$topic_data['topic_id']] = $topic_data['post_time'];
		}
	}
The cheat is the one line right after $forum_data[] = $row;

Code: Select all

    $sql = "SELECT f.*, p.post_time, p.post_username, u.username, u.user_id
        FROM (( " . FORUMS_TABLE . " f
        LEFT JOIN " . POSTS_TABLE . " p ON p.post_id = f.forum_last_post_id )
        LEFT JOIN " . USERS_TABLE . " u ON u.user_id = p.poster_id ) 
        ORDER BY f.cat_id, f.forum_order";



	if ( !($result = $db->sql_query($sql)) )
	{
		message_die(GENERAL_ERROR, 'Could not query forums information', '', __LINE__, __FILE__, $sql);
	}

	$forum_data = array();
	while( $row = $db->sql_fetchrow($result) )
	{
		$forum_data[] = $row;
        if ($row['post_time'] > $userdata['user_lastvisit']) $new_topic_data[$row['forum_id']][$row['topic_id']] = $row['post_time'];
	}
It just end up checking to see if the latest post is posted after the user's last visit time. Quick and dirty change.

- Users Online

The index page also scans the users plus session tables for active users. When online users reaches close to a thousand this check becomes costly in time. This isn't even a hack, I ended up having an external script on a cron job to run the exact query on a per minute basis. Result is stored in a hash table, and phpBB then queries the hash table for the list of on-line users.

This code was taken out

Code: Select all

	$user_forum_sql = ( !empty($forum_id) ) ? " WHERE session_page = " . intval($forum_id) : '';
	$sql = "SELECT u.username, u.user_id, u.user_allow_viewonline, u.user_level, s.session_logged_in, s.session_ip FROM ".USERS_TABLE." u, ".SESSIONS_TABLE." s WHERE u.user_id = s.session_user_id AND s.session_time >= ".( $board_config['time_now'] - 300 ) . " $user_forum_sql ORDER BY u.username ASC, s.session_ip ASC";
And replaced with a query that grabs everything from a hash table:

Code: Select all

	$user_forum_sql = ( !empty($forum_id) ) ? " WHERE session_page = " . intval($forum_id) : '';
    $sql = "SELECT * FROM " . $session_table_name . $user_forum_sql;
- Viewtopic.php

Perhaps this is one of the more interesting hack and more useful to everyone. When fetching a particular page within a topic, there is a large query that fetches the user info, post info, and post contents. I had the believe that we should always join queries to avoid latency so I never touched that query. As it turns out, when a thread gets bigger and bigger, a large joined query with the where and sort by clause will take a ton of memory to process. If we're looking for the last page of a topic with 1000 replies, MYSQL grabs the post text and user data from all the posts, then splits out the last 15. (or whatever posts each page shows)

So, instead of one query containing:

Code: Select all

$sql = "SELECT u.username, u.user_id, u.user_posts, u.user_from, u.user_website, u.user_email, u.user_icq, u.user_aim, u.user_yim, u.user_regdate, u.user_msnm, u.user_viewemail, u.user_rank, u.user_sig, u.user_sig_bbcode_uid, u.user_avatar, u.user_session_time, u.user_allow_viewonline, u.user_allowsmile, p.*,  pt.post_text, pt.post_subject, pt.bbcode_uid
	FROM " . POSTS_TABLE . " p, " . USERS_TABLE . " u, " . POSTS_TEXT_TABLE . " pt
	WHERE p.post_id in  p.topic_id = $topic_id $limit_posts_time
		AND pt.post_id = p.post_id
		AND u.user_id = p.poster_id
	ORDER BY p.post_time $post_time_order
	LIMIT $start, ".$board_config['posts_per_page'];
I now start off with

Code: Select all

$p_array = array();
$sql = "SELECT p.post_id FROM " . POSTS_TABLE . " p WHERE p.topic_id = $topic_id $limit_posts_time LIMIT $start, " . $board_config['posts_per_page'];
if ( !($result = $db->sql_query($sql)) )
{
	message_die(GENERAL_ERROR, "Could not obtain post index information.", '', __LINE__, __FILE__, $sql);
}
while (list($p_id) = $db->sql_fetchrow($result)) {
    $p_array[] = $p_id;
}
$post_index = implode(",",$p_array);
A list of post_id are obtained, now I slip it into the large query

Code: Select all

$sql = "SELECT u.username, u.user_id, u.user_posts, u.user_from, u.user_website, u.user_email, u.user_icq, u.user_aim, u.user_yim, u.user_regdate, u.user_msnm, u.user_viewemail, u.user_rank, u.user_sig, u.user_sig_bbcode_uid, u.user_avatar, u.user_session_time, u.user_allow_viewonline, u.user_allowsmile, p.*,  pt.post_text, pt.post_subject, pt.bbcode_uid
	FROM " . POSTS_TABLE . " p, " . USERS_TABLE . " u, " . POSTS_TEXT_TABLE . " pt
	WHERE p.post_id in ($post_index)
		AND pt.post_id = p.post_id
		AND u.user_id = p.poster_id
	ORDER BY p.post_time $post_time_order";
And all is well. The difference is not noticible on small systems. The latency in having 2 queries might actually make itself slower in some cases. Though on a forum with 3.3M posts, the difference is 5 minutes vs 1 minute on a topic with 70,000 replies. There is a performance increase even on topics with as little as a few hundred replies. Overall it saves memory, which made me happy.

Hope this proofs to be an amusing read for anyone who likes to take things apart. :wink: I'll be adding more stuff to this thread as I find more time. Have fun!
Last edited by Marshalrusty on Sat Aug 02, 2008 12:02 am, edited 3 times in total.
Reason: Moved to phpBB Dicussion from the now archived 2.0 Discussion
Deriel
Registered User
Posts: 164
Joined: Tue Aug 27, 2002 5:16 pm
Location: Curitiba/PR - Brasil
Contact:

Post by Deriel »

Really interesting.

One doubt: what you consider a "large forum"? More than X posts? Ou by users? Or by posts/day?
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

Post by drathbun »

Deriel wrote: what you consider a "large forum"?
lanzer wrote: When user post count approached 50,000 posts a day

I consider that a "large" forum in my book... 50K posts a day? Was that for real, or a typo?

Dave
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
jk1
Registered User
Posts: 103
Joined: Sun Jul 21, 2002 10:55 pm
Location: USA

Post by jk1 »

I think he was for real, here is his forum: http://ian.go-gaia.com/forum/index.php
lanzer
Registered User
Posts: 152
Joined: Wed Oct 10, 2001 10:00 am
Contact:

Post by lanzer »

With lots of role playing our board has large threads that goes up to 72,000 replies. I can't trim them since the posts follow a particular storyline. Everytime someone reads the last page, MySQL will need to read through 50,000 keys to find the latest posts if a thread has 50,000 replies.

A small piece of puzzle had been the inability for MySQL to search a key entry backwards when looking for post ID's within a topic. Turns out I need to create a compound index of the topic_id plus post_id in order to make that possible.

So first, I created a compound key with the command:

Code: Select all

create index topic_n_id on bb_posts (topic_id, post_id);
(my prefix is "bb_" instead of "phpbb_")

Then I slipped in a few lines of codes to request a reverse search if the user requests a page number that's more than half way through the thread:

Code: Select all

$total_pages = ceil($total_replies/$board_config['posts_per_page']);
$on_page = floor($start / $board_config['posts_per_page']) + 1;
if ($start > 100 && ($total_replies / 2) < $start) {
    $reverse = TRUE;
    $last_page_posts = $total_replies - ($board_config['posts_per_page'] * ($total_pages - 1));
}
* This code was inserted right after the "Generate a 'Show posts in previous x days' select box." section.

Followed by:

Code: Select all

if (isset($reverse)) {
    $limit_string = ($total_pages == $on_page) ? $last_page_posts : ($last_page_posts + ($total_pages - $on_page - 1) * $board_config['posts_per_page'] ).','. $board_config['posts_per_page'];
    $sql = "SELECT p.post_id FROM " . POSTS_TABLE . " p USE INDEX(topic_n_id) WHERE p.topic_id = $topic_id $limit_posts_time ORDER BY p.post_id DESC LIMIT $limit_string" ;
} else {
    $sql = "SELECT p.post_id FROM " . POSTS_TABLE . " p WHERE p.topic_id = $topic_id $limit_posts_time LIMIT $start, " . $board_config['posts_per_page'];
}
* Note that this query is the modified tweak for viewtopic.php mentioned in the first post.

Being able to search in reverse order made me a happy man. :D

Before:

Code: Select all

Query:	SELECT post_id FROM bb_posts WHERE topic_id = 57286  LIMIT 25890, 15

Time before:  0.014814972877502
Time after:   0.032263994216919
Elapsed time: 0.017449021339417
After:

Code: Select all

Query:	SELECT post_id FROM bb_posts FORCE INDEX(topic_n_id) WHERE topic_id = 57286  ORDER BY post_id DESC LIMIT 17,15

Time before:  0.017421960830688
Time after:   0.018036961555481
Elapsed time: 0.00061500072479248
With 1100 people on-line and 3.5M posts, reading long threads used to take as long as 5 minutes. Now I'm pretty sure that the fourm can hold a few more million posts without major complaints. (everything takes half a second now since there are no temp space used) The two tweaks for viewtopic.php are really easy to do and is highly recommended for anyone who has a fair sized database.

Oh yeah, the username field in the users table is not indexed by default. After having a lot of users I eventually had to make it an index to avoid long waits for login and posting private messages, etc. Just wished that I made the username field smaller.

My search feature is still disabled. :oops: My next side project is to give a stab at speeding up the search word insertion process. Doubt that I can do much about that...
Last edited by lanzer on Fri Oct 10, 2003 4:59 am, edited 2 times in total.
lanzer
Registered User
Posts: 152
Joined: Wed Oct 10, 2001 10:00 am
Contact:

Answers to questions above

Post by lanzer »

Hello Deriel, by large forum I would say a post count of 250,000 or above. (wild, uneducated guess) Basically up to the point where it takes more than a second or two to read a thread. "Large forum" can be smaller if the MySQL server doesn't have that much memory for queries. Consequently, a beefed up machine can see a forum with 1M posts as small, when all queries probably take fractions of a second to complete.

A lot of people often blame dynamic forums like PHPBB to be slow when a forum gets larger, when the slowness usually comes from the SQL server taking too long to handle queries. The tweaks mentioned might help minimize the problems, and I hope to be able to find more in the future.
pengrus
Registered User
Posts: 566
Joined: Mon Dec 02, 2002 6:13 am
Contact:

Post by pengrus »

Lanzer,

I visited your board, and must say you've done a great job!!!
Oh yeah, the username field in the users table is not indexed by default. After having a lot of users I eventually had to make it an index to avoid long waits for login and posting private messages, etc. Just wished that I made the username field smaller.


Would you mind sharing the code modification on the username index thing?

Thank you so much for sharing the great modifications!
lanzer
Registered User
Posts: 152
Joined: Wed Oct 10, 2001 10:00 am
Contact:

Post by lanzer »

Hi Pengrus,

By indexing I meant making the username field within the MySQL database reside in memory, it's not something done in PHP. What I entered was a command in MySQL that goes:

Code: Select all

create index username on phpbb_users
Normally it's not recommended because usernames take up as much as 25 bytes in legnth for every key while a typical index takes up about 3 bytes. But when a server is busy or if the bottleneck is the hard drive then sacrificing memory for speed would be what this addition provides. You save memory if requests are handled faster anyways, that's another way of looking at it.
Darth Wong
Registered User
Posts: 2401
Joined: Wed Jul 03, 2002 5:20 am
Location: Toronto, Canada
Name: Michael Wong
Contact:

Post by Darth Wong »

I wonder if any of these performance issues are addressed in 2.2 ...

Anyway, kudos on what you've done with your board. Certainly more impressive than my kludgy solution of pruning back my search_wordmatch table :)
Freedom of speech is not absolute, nor was it ever meant to be. If it were, then fraud would not be illegal.
User avatar
dhn
Former Team Member
Posts: 4999
Joined: Wed Jul 04, 2001 8:10 am
Location: Internet
Name: Dominik Dröscher
Contact:

Post by dhn »

Darth Wong wrote: I wonder if any of these performance issues are addressed in 2.2 ..


I'd say that the major performance boost in 2.2 will be upon the script side, not the database one. The new templating system will speed up your forums tremendously for one thing.

That does however not mean that database queries aren't tweaked. Especially the topic marking got reworked (Apperently another forum software implented the same system shortly after Paul finished it for 2.2, take that how you like it). Editing of topics will also be less problematic, for each post a MD5 hash is created now, which will be checked if you edit the post. So if you just want to disable bbcode or your signature, the search indexes won't need to be updated. Only two examples of probably a lot more.

Visit area51 :: to see the effects. 8)
User avatar
chAos
Former Team Member
Posts: 4032
Joined: Wed Jan 16, 2002 7:05 am
Location: Seattle, WA, US

Post by chAos »

Also, if you are using mysql, i believe it uses the in built indexes which removes the need for one of the search tables (the big one i believe).
doggroups
Registered User
Posts: 48
Joined: Tue Apr 15, 2003 4:32 pm
Contact:

Post by doggroups »

Great post lanzer! I am keeping all this in mind in case our forums ever become that large. This is why I have always loved phpBB because everyone here is so great about sharing and helping each other out.
lanzer
Registered User
Posts: 152
Joined: Wed Oct 10, 2001 10:00 am
Contact:

Post by lanzer »

I've taken a look at the queries in 2.2 and they're much like 2.0.6, where the user and post text info are joined when searching for posts to display.

Unfortunately I do not have a smaller forum to compare performance before and after the tweak. Generating a list of post ID before the joint query made a speed difference of 5 times on a thread with 5000 replies, inside a table with 3M+ entries. I know that it is a ridiculus number to a lot of people to have this much replies in a post. Perhaps slipping in a toggle for finding the indexes only if the $start value is above a few hundred? Either way, the current query has a lot of potential for creating a tmp table on the SQL server is what worries me. If any of the developers has time to try this tweak out it'd be interesting to say the least. :wink:
lanzer
Registered User
Posts: 152
Joined: Wed Oct 10, 2001 10:00 am
Contact:

Post by lanzer »

Just added the search feature back to the forum. To avoid having a huge wordsearch table I've told the searchword adding function to be extremely stingy. Here's what I did:

- Skip posts with only one word in the subject line
- Only process the first 20 replies within a thread
- Skip posts with less than 20 words
- All quoted text are removed
- All numbered text are removed
- Increased stopword table to 350 words
- Only the first 80 words in a post are added to table
- 3 letter words (or less) are omitted, 15+ letter words are omitted
- Omit all words with 5+ repeating letters (heeeee, ehhhhh, etc are all omitted)

This is a pretty crippled search, but seeing that the board now has 3.8M posts, if I log every word then the search_wordmatch table would be gigantic.

As others had recommended before, I indexed the word_id within search_wordmatch table, and I also did something to the search_wordlist:

- a "post_id" field is added
- word_common is changed from tiny integer to medium integer

When words are being added to the wordlist table, the post_id is added along with it. As for all the existing words, their post_id are updated and the word_common values are raised by one. Reason being:

1 - I can delete old wordlist entries by specifying an old post_id. Was going to use timestamp, but post_id takes up less space

2 - Avoid running the find common word function since I have no idea how long it'll take to run when the table is huge.

I have no idea to what degree am I crippling the search feature, but considering that at one point I was going to only add the subject line to the tables, I guess this is one step towards a better direction. :o

* Searchword adding is now a class in 2.2.0, so I snatched that and use it instead of modifying what's in 2.0.x If anyone's interested in the class file I can e-mail you. It's a simple alteration to the function_post.php file to use it.
kurtl
Registered User
Posts: 8
Joined: Thu May 30, 2002 2:07 pm
Contact:

Post by kurtl »

Lanzer -- would you mind posting the code you used to modify your wordsearch table? I'd be really interested in taking a look at it.

Also, what sort of hardware do you have to run that sort of site? Our forum isn't nearly as big, but we're still hitting some performance ceilings, so I'm curious to know how much metal you have behind your site.

Thanks for the performance tips, btw. We'll be looking at integrating these into our site.

--kurt
Locked

Return to “2.0.x Discussion”