[3.0.x] Tweaks for large forums

Do not post support requests, bug reports or feature requests. Discuss phpBB here. Non-phpBB related discussion goes in General Discussion!
Scam Warning
User avatar
/a3
Registered User
Posts: 411
Joined: Sun Sep 19, 2010 9:08 am
Location: /dev/random

Re: [3.0.x] Tweaks for large forums

Post by /a3 »

You shouldn't delete any columns in any of the tables in phpBB. Most changes to the database will require changes to the code as well. ;)
User avatar
1234homie
Registered User
Posts: 439
Joined: Fri Sep 26, 2008 3:17 pm

Re: [3.0.x] Tweaks for large forums

Post by 1234homie »

DoubleTea wrote:
1234homie wrote:PS. what about topic_posted column in topics_posted table, I think its needless ;p
Really ? Can I delete this column or it will cause some problems ?
omg Don't delete it! xD It's require changes in some files... I only propose to do something with this column becouse in my opinion its not useful ;p

Some tips from me :P

FUNCTIONS.PHP
Hi, I think why we use time() here:

Code: Select all

		'CURRENT_TIME'					=> sprintf($user->lang['CURRENT_TIME'], $user->format_date(time(), false, true)),
instead of:

Code: Select all

		'CURRENT_TIME'					=> sprintf($user->lang['CURRENT_TIME'], $user->format_date($user->time_now, false, true)),
$user->time_now is seen to work faster then call time() there..

//////////////////////////

AROUND LINE 500 in /includes/mcp/mcp_warn.php:

Code: Select all

	// We add this to the mod log too for moderators to see that a specific user got warned.
	$sql = 'SELECT forum_id, topic_id
		FROM ' . POSTS_TABLE . '
		WHERE post_id = ' . $post_id;
	$result = $db->sql_query($sql);
	$row = $db->sql_fetchrow($result);
	$db->sql_freeresult($result);

	add_log('mod', $row['forum_id'], $row['topic_id'], 'LOG_USER_WARNING', $user_row['username']);
We call 1 query to posts table, but what if moderator warn user in user profile (not in post)? then we no need this query..

I've change this lines to:

Code: Select all

	// We add this to the mod log too for moderators to see that a specific user got warned.
	$row = array('forum_id' => 0, 'topic_id' => 0);
	if ($post_id != 0)
	{
		$sql = 'SELECT forum_id, topic_id
			FROM ' . POSTS_TABLE . "
			WHERE post_id = $post_id";
		$result = $db->sql_query($sql);
		$row = $db->sql_fetchrow($result);
		$db->sql_freeresult($result);
	}
	add_log('mod', $row['forum_id'], $row['topic_id'], 'LOG_USER_WARNING', $user_row['username']);
I think both changes should be applied to phpbb3 but.. im too lazy to use bugtracker ;p
Last edited by 1234homie on Mon Jan 03, 2011 8:03 pm, edited 4 times in total.
User avatar
1234homie
Registered User
Posts: 439
Joined: Fri Sep 26, 2008 3:17 pm

Re: [3.0.x] Tweaks for large forums

Post by 1234homie »

PS. Anyone trying to optimize this query?
In: functions_display.php (user activity in profiles)

Code: Select all

SELECT forum_id, COUNT(post_id) AS num_posts
FROM phpbb_posts
WHERE poster_id = 2
AND post_postcount = 1
AND forum_id NOT IN (forums)
GROUP BY forum_id
ORDER BY num_posts DESC
LIMIT 1
query time: 13 sec :shock: :shock: :shock: :shock: :shock:
User avatar
A_Jelly_Doughnut
Former Team Member
Posts: 34457
Joined: Sat Jan 18, 2003 1:26 am
Location: Where the Rivers Run
Contact:

Re: [3.0.x] Tweaks for large forums

Post by A_Jelly_Doughnut »

There's no real way to optimize that because of everything the query has to do. By default, the activity query is not run for users having > 5000 posts, you may wish to tweak that number in display_user_activity() in includes/functions_admin.php
A Donut's Blog
"Bach's Prelude (Cello Suite No. 1) is driving Indiana country roads in Autumn" - Ann Kish
User avatar
1234homie
Registered User
Posts: 439
Joined: Fri Sep 26, 2008 3:17 pm

Re: [3.0.x] Tweaks for large forums

Post by 1234homie »

A_Jelly_Doughnut wrote:There's no real way to optimize that because of everything the query has to do. By default, the activity query is not run for users having > 5000 posts, you may wish to tweak that number in display_user_activity() in includes/functions_admin.php
I need it for all users :D I work on it and I minimize load query as much as possible. Also this function is only used in memberlist.php so I move it to this file.

if we cant optimize this query so I think I've found best way and it is... AJAX :D

Why?
For example in one minute 1000 users views their profiles and we run 4 or more queries but... not all of them (maybe max. 1/7 of them) look for activity forums/topics so... We can do ajax request on_click and put in template something like "View stats" ;p

This may helps all who are not a noob in code writing xD

regards ;]
User avatar
A_Jelly_Doughnut
Former Team Member
Posts: 34457
Joined: Sat Jan 18, 2003 1:26 am
Location: Where the Rivers Run
Contact:

Re: [3.0.x] Tweaks for large forums

Post by A_Jelly_Doughnut »

1234homie wrote: I need it for all users :D I work on it and I minimize load query as much as possible. Also this function is only used in memberlist.php so I move it to this file.
It is also called in the UCP. I think its includes/ucp/ucp_front.php.
A Donut's Blog
"Bach's Prelude (Cello Suite No. 1) is driving Indiana country roads in Autumn" - Ann Kish
Oleg
Former Team Member
Posts: 1221
Joined: Sat Jan 30, 2010 4:42 pm
Location: NYC
Contact:

Re: [3.0.x] Tweaks for large forums

Post by Oleg »

Your best bet would be to cache the count in user table. I suspect it would take about the same time to run that query as it would take to aggregate by both users and forums, assuming a full table scan is involved, and if so you could calculate the counts hourly for all users with a single query with the entire update operation taking under a minute.

The calculation query:

select user_id, forum_id, count(*) from posts group by user_id, forum_id;

iterate through the resulting data set keeping a map like this:

$map[user_id][forum_id] = count

Then determine the forum_id for each user_id with the max count. Store it in a different map for future use.

Now you need to update the counts but there is no need to update all user rows - just update the changed ones.

select user_id, biggest_forum_id, biggest_post_count from users;

iterate through this data set comparing forum id/biggest post count to the one obtained above, if there is any difference update the user row otherwise don't update.

In memberlist/ucp query the user table instead of calculating.
Participate in phpBB development: Get involved | Issue tracker | Report a bug | Development board | [url=irc://chat.freenode.net/phpbb-dev]Development IRC chat[/url]
My stuff: mindlinkgame.com
arod-1
Registered User
Posts: 1327
Joined: Mon Sep 20, 2004 1:33 pm

Re: [3.0.x] Tweaks for large forums

Post by arod-1 »

actually, a much more crude way (with some approximations) might work even better:
add two columns to user table: one with the string which this process produces, and another one with the post-count for this user when the string was created.
use some heuristics for when to recalc (say, if post-count less than X always recalc. above it, require that the diff in post count between "now" and the stored number exceeds something - percent, absolute, or some combo).
this can easily cut down the number of times this query is called very significantly - prolly by an order of magnitude, with some small price in the precision of the data.

this may even allow you to continue using this data for users with > 5K posts - just recalc it every <large enough number here> posts or so...

peace.
jeffj
Registered User
Posts: 11
Joined: Fri Nov 12, 2010 3:21 am

Re: [3.0.x] Tweaks for large forums

Post by jeffj »

DoubleTea wrote:
1234homie wrote:PS. what about topic_posted column in topics_posted table, I think its needless ;p
Really ? Can I delete this column or it will cause some problems ?
DON'T delete that column. It gains virtually nothing and will cause code to break without modification.

I say this with all due respect to the author you quoted: there are some great posters in this thread, and there are people with a questionable understanding of the code base and hosting large forums. The poster you quote seems (in my personal opinion) to fall into the latter group. I'd be very wary of any suggestion he makes that isn't supported by other posters.
jeffj
Registered User
Posts: 11
Joined: Fri Nov 12, 2010 3:21 am

Re: [3.0.x] Tweaks for large forums

Post by jeffj »

arod-1 wrote:actually, a much more crude way (with some approximations) might work even better:
add two columns to user table: one with the string which this process produces, and another one with the post-count for this user when the string was created.
use some heuristics for when to recalc (say, if post-count less than X always recalc. above it, require that the diff in post count between "now" and the stored number exceeds something - percent, absolute, or some combo).
this can easily cut down the number of times this query is called very significantly - prolly by an order of magnitude, with some small price in the precision of the data.

this may even allow you to continue using this data for users with > 5K posts - just recalc it every <large enough number here> posts or so...

peace.
The problem is your proposed solution still causes a table scan, and a table scan for a 'large forum' (the point of this thread) is a non-starter. The 'right' solution is probably either a new table or new columns in the user table, with UPDATEs on post submission. Anything that's scheduled to run once every $N minutes probably doesn't scale, because when it runs you're going to be in trouble.
jeffj
Registered User
Posts: 11
Joined: Fri Nov 12, 2010 3:21 am

Re: [3.0.x] Tweaks for large forums

Post by jeffj »

1234homie wrote:
DoubleTea wrote:
1234homie wrote:PS. what about topic_posted column in topics_posted table, I think its needless ;p
Really ? Can I delete this column or it will cause some problems ?
omg Don't delete it! xD It's require changes in some files... I only propose to do something with this column becouse in my opinion its not useful ;p

Some tips from me :P

FUNCTIONS.PHP
Hi, I think why we use time() here:

Code: Select all

		'CURRENT_TIME'					=> sprintf($user->lang['CURRENT_TIME'], $user->format_date(time(), false, true)),
instead of:

Code: Select all

		'CURRENT_TIME'					=> sprintf($user->lang['CURRENT_TIME'], $user->format_date($user->time_now, false, true)),
$user->time_now is seen to work faster then call time() there..

Code: Select all

MBP:~ jeff$ cat test_time.php 
<?php

$start = microtime();

$revs = 1000000;
for($i = 0 ; $i < $revs ; $i++) {
        time();
}

$end = microtime();

var_dump($start);
var_dump($end);
var_dump($end - $start);

$avg = ($end - $start) / $revs;
var_dump($avg);

MBP:~ jeff$ php test_time.php 
string(21) "0.18050200 1294252928"
string(21) "0.48449100 1294252928"
float(0.303989)
float(3.03989E-7)
MBP:~ jeff$ 
This change saves on average ~0.000000303989 seconds on a modern processor, virtually zero gain and subjective readability improvements.
arod-1
Registered User
Posts: 1327
Joined: Mon Sep 20, 2004 1:33 pm

Re: [3.0.x] Tweaks for large forums

Post by arod-1 »

jeffj wrote:The problem is your proposed solution still causes a table scan, and a table scan for a 'large forum' (the point of this thread) is a non-starter. The 'right' solution is probably either a new table or new columns in the user table, with UPDATEs on post submission. Anything that's scheduled to run once every $N minutes probably doesn't scale, because when it runs you're going to be in trouble.
no it does not.
adding two more columns to the users table does not cause *any* additional query.
it *does* inflate the "$user" object, but only by a single string and a single int. if the query discussed here is heavy as was reported, saving it for this additional weight of the user object might be worth it.

unless of course i misread your post and you mean "a table scan" once, when the change is made.
if this is what you meant then i guess you are right, but isn't it the same issue with any installation of MOD or version upgrade?
if this is so, then just wait until the next time you have to shut the forum (say, next time you upgrade) and do it then...
and BTW: this is probably the case for every single solution suggested in this thread - for instance, adding a new index also requires "full table scan"...

peace.
jeffj
Registered User
Posts: 11
Joined: Fri Nov 12, 2010 3:21 am

Re: [3.0.x] Tweaks for large forums

Post by jeffj »

Edit: removed.
User avatar
1234homie
Registered User
Posts: 439
Joined: Fri Sep 26, 2008 3:17 pm

Re: [3.0.x] Tweaks for large forums

Post by 1234homie »

jeffj wrote:
DoubleTea wrote:
1234homie wrote:PS. what about topic_posted column in topics_posted table, I think its needless ;p
Really ? Can I delete this column or it will cause some problems ?
DON'T delete that column. It gains virtually nothing and will cause code to break without modification.

I say this with all due respect to the author you quoted: there are some great posters in this thread, and there are people with a questionable understanding of the code base and hosting large forums. The poster you quote seems (in my personal opinion) to fall into the latter group. I'd be very wary of any suggestion he makes that isn't supported by other posters.
thanks... I only suggested to remove the column and do some changes in the files for those who want to reduce the size of phpbb_topics_posted table. Now you understand me?
nn- wrote:Your best bet would be to cache the count in user table. I suspect it would take about the same time to run that query as it would take to aggregate by both users and forums, assuming a full table scan is involved, and if so you could calculate the counts hourly for all users with a single query with the entire update operation taking under a minute.

The calculation query:

select user_id, forum_id, count(*) from posts group by user_id, forum_id;

iterate through the resulting data set keeping a map like this:

$map[user_id][forum_id] = count

Then determine the forum_id for each user_id with the max count. Store it in a different map for future use.

Now you need to update the counts but there is no need to update all user rows - just update the changed ones.

select user_id, biggest_forum_id, biggest_post_count from users;

iterate through this data set comparing forum id/biggest post count to the one obtained above, if there is any difference update the user row otherwise don't update.

In memberlist/ucp query the user table instead of calculating.
This sounds like a best way for this but... Its not pretty easy but better than my :D
arod-1 wrote:actually, a much more crude way (with some approximations) might work even better:
add two columns to user table: one with the string which this process produces, and another one with the post-count for this user when the string was created.
use some heuristics for when to recalc (say, if post-count less than X always recalc. above it, require that the diff in post count between "now" and the stored number exceeds something - percent, absolute, or some combo).
this can easily cut down the number of times this query is called very significantly - prolly by an order of magnitude, with some small price in the precision of the data.

this may even allow you to continue using this data for users with > 5K posts - just recalc it every <large enough number here> posts or so...

peace.
This sounds to be good too. So I should cache post counts in users table and run query only if its less than X user posts?

forum/topic name is needless in users table becouse these queries are fastest and forum names by default are cached ;p
The problem is your proposed solution still causes a table scan, and a table scan for a 'large forum' (the point of this thread) is a non-starter. The 'right' solution is probably either a new table or new columns in the user table, with UPDATEs on post submission. Anything that's scheduled to run once every $N minutes probably doesn't scale, because when it runs you're going to be in trouble.
More queries on post submission is not good way for me - 100 posts per minute actually ;/


EDIT:

Finally, I created two columns in users table: activity_forum_posts, activity_topic_posts. Every hour I run 1 "big" query and cache array in file. In memberlist, if numbers from array are other than this in database, I update activity_forum_posts and activity_topic_posts. To keep it "up date" when user submit post, I've check forum/topic where he write it. If user write it in "activity" forum/topic then I simply add +1 for those columns. Its good or I should change something on the operation of it?


ps
I spend about 40 minutes to write this post xD I don't like this language.. :P
Oleg
Former Team Member
Posts: 1221
Joined: Sat Jan 30, 2010 4:42 pm
Location: NYC
Contact:

Re: [3.0.x] Tweaks for large forums

Post by Oleg »

jeffj wrote: Anything that's scheduled to run once every $N minutes probably doesn't scale, because when it runs you're going to be in trouble.
You should not be so categorical. For the sake of argument let's assume a few things:

1. Your dataset fits into ram. I expect most boards fit into ram entirely, for the purposes of this discussion we just need to fit the posts table into ram.

2. You have a multi-core processor or some other form of smp.

3. You have a database with mvcc.

Now, in these conditions doing a table scan requires no i/o, has no effect on concurrent reads or writes and loads at most one core. If you do it in response to user requests users will notice, but if you do it in an offline process chances are users won't be able to detect it behind network variance.

An alternative would be to add an index, for example, on user_id + forum_id on the posts table. But this index would have to be updated whenever someone posts, which is online and that update counts against response time. Maybe the update time would be negligible, in which case it would be a much simpler solution. If update time is not negligible then you might prefer a full table scan done offline.

Addendum: this last paragraph is/may be true for a query restricting by both user and forum. If you look at the original query again it must go through all of user's posts (no restriction by forum) to calculate the counts. For users with 5000+ posts it necessarily has to examine 5000+ rows in the posts table which, index or no index, takes time.

Addendum 2: actually, if the index maintains a count of child/leaf nodes in non-leaf nodes, and you add the postcount column to the index (otherwise you need to pull it out of the row anyway), and your database engine is actually smart enough to utilize all this information, conceivably even the original query may be answered by doing index lookups only. A lot of ifs in this sentence.
Participate in phpBB development: Get involved | Issue tracker | Report a bug | Development board | [url=irc://chat.freenode.net/phpbb-dev]Development IRC chat[/url]
My stuff: mindlinkgame.com
Post Reply

Return to “phpBB Discussion”