[2.0.x] Tweaks for large forums

The 2.0.x discussion forum has been locked; this will remain read-only. The 3.0.x discussion forum has been renamed phpBB Discussion.
User avatar
Dog Cow
Registered User
Posts: 2507
Joined: Fri Jan 28, 2005 12:14 am
Contact:

Re: [2.0.x] Tweaks for large forums

Post by Dog Cow »

qspypl wrote:performance of include, include_once, require, require_once:
http://blog.arh.cc/index.php?/archives/ ... n-PHP.html

found benchmark like this but not sure if its valuable, my question is this, if we should remove number of files in phpbb by remove'ing include'ing for example:
open file
index.php

find
include($phpbb_root_path . 'extension.inc');

replace it by
$phpEx = "php";
$starttime = 0;

delete file extension.inc
We can hardcore db/mysql4.php in include/db.php instead of include too if ofcourse it will give notice'able speed gain. I read somwhere, when you use'ing opcode cachce, then include'ing of files is used only when recompile.
I haven't had those two files in over 2 years or so. Thanks for performance link as well, though require() is fastest, it does cause a white page when the file is missing, so it's not always the best to use (esp for cache file!) Also, there's the file() function, (and I think another similar one) where the contents to be read don't have to be a PHP file, so you're not parsing the file as PHP, you're just reading the contents and then can use explode() or some other function to turn the contents into an array or variable.


There are more too include files which can be removed, such as the template .cfg files. And regarding the whole template system, I've completely removed all logical management of templates, such as database tables/entries or those .cfg files. As far as my template engine is concerned, "Here is where my .tpl files are located!" and that's that. Some speed-up, plus critical errors (rare, yes, but still nice to have tpl) can now use the full template instead of a user-unfriendly white page. Requirement: must use only one template! Another nice thing is that since I removed all the user-facing graphics and other files from the tpl dir, I've chmod'ed the directory to disallow access, so only my template engine can get to the files. Later, they will be moved below the web dir.

Also, there's the file() function, (and I think another similar one) where the contents to be read don't have to be a PHP file, so you're not parsing the file as PHP, you're just reading the contents and then can use explode() or some other function to turn the contents into an array or variable.

Update on Multitable Post Distribution
Ok, I released an almost complete version of the doc a page or two back, and I started writing the working demo, which is a forum where you can post new topics/replies, and obviously it uses the methods discussed in the article. There's no authentication or template or any of that nonsense; just a bare-bones demonstration of the concept. It's about 70% complete, and is built and tested on MySQL 5.0.67 and PHP 5.2.4.

This is coming soon; hopefully next week!
Last edited by Dog Cow on Fri Aug 22, 2008 7:05 pm, edited 1 time in total.
da_badtz_one
Registered User
Posts: 376
Joined: Thu Jan 29, 2004 8:25 pm

Re: [2.0.x] Tweaks for large forums

Post by da_badtz_one »

There are some great comments getting thrown around here, thanks to Dog Cow especially for writing up that guide.

I'd like to share a tweak I made a while ago which I forgot to share.

The following query is a query you'll find when trying to find the total count of every user in your database to provide statistical information. To avoid having to rewrite the whole user counting process to feature what phpBB3 uses I used this nifty tweak I found.

So this query:

Code: Select all

	$sql = 'SELECT COUNT(user_id) AS total
		FROM ' . USERS_TABLE . '
		WHERE user_id > ' . ANONYMOUS;
becomes:

Code: Select all

	$sql = 'SELECT COUNT(user_id) -1 AS total
		FROM ' . USERS_TABLE;
The difference is noticeable as shown as the result:

Code: Select all

mysql> SELECT COUNT(user_id) -1 AS total FROM driaw_users ;
+--------+
| total  |
+--------+
| 452979 |
+--------+
1 row in set (0.38 sec)

mysql> SELECT COUNT(user_id) AS total FROM driaw_users WHERE `user_id` > -1 ;
+--------+
| total  |
+--------+
| 452979 |
+--------+
1 row in set (2.26 sec)
I still do not know the reason why it is faster, but if it returns the same result then hey why not? :)

One of the things we learn from using COUNT is that it is slow when put against large tables. The best way to keep statistical information for anything we want to count is to keep a record of it as a number in the config table instead. However this requires accuracy in making sure that the count is kept consistent with the changes made. It will be initially hard to convert everything over to use this statistical count method but atleast there is some indication to tell you whether or not you continue to accurately make these count changes. And that is to compare this method to the old method and see if there are any differences.

For example, if I deleted a user, the user count will decrease by 1.
User avatar
A_Jelly_Doughnut
Former Team Member
Posts: 34459
Joined: Sat Jan 18, 2003 1:26 am
Location: Where the Rivers Run
Contact:

Re: [2.0.x] Tweaks for large forums

Post by A_Jelly_Doughnut »

da_badtz_one: count() with no WHERE clause is cached in MyISAM tables. That's why taking the full table count and subtracting one is considerably faster than scanning the whole table, applying a WHERE clause, and throwing out just one row :)
A Donut's Blog
"Bach's Prelude (Cello Suite No. 1) is driving Indiana country roads in Autumn" - Ann Kish
User avatar
qspypl
Registered User
Posts: 149
Joined: Wed Dec 20, 2006 7:15 pm

Re: [2.0.x] Tweaks for large forums

Post by qspypl »

Dont want be rude but i developed it already at page 72 in this topic da_badtz_one
Bottom of my big post :P.

I put so much queries there u could miss it.

Code: Select all

http://www.phpbb.com/community/viewtopic.php?p=4697685#p4697685
da_badtz_one
Registered User
Posts: 376
Joined: Thu Jan 29, 2004 8:25 pm

Re: [2.0.x] Tweaks for large forums

Post by da_badtz_one »

woops, sorry there qspypl.

@A_Jelly_Doughnut: Thanks. Though I am surprised that I get a faster result even with innodb as I remember reading somewhere that innodb tables did not cache the total number of entries.
User avatar
Dog Cow
Registered User
Posts: 2507
Joined: Fri Jan 28, 2005 12:14 am
Contact:

Re: phpBB tweaks for large forums

Post by Dog Cow »

arod-1 wrote: [soap-box speech]
"and now to something completely different" department:
as to splitting the posts table: imho, this is a huge mistake.
someone told me once: "when you open a can of worms, the only way to put them back is to bring a larger can". in this case, you'll need a huge can, and the gain is minuscule.
just don't do it. you will end up with a crippled product, much more complex code, and no real gain.
just forget about it.
[/soapbox]
Hi, first of all, splitting the posts table wouldn't be something that any forum would do, it would really be something that someone would either write a forum of his own for, or take a pre-existing one and modify it.

Second, the code to do isn't terribly complex as you'd see when you read the doc and/or view the demo forum.

Third of all, imagine a forum with 1.3 billion posts (I know of one.) Do you think all 1 billion posts are stored in the same table? I don't.

Now you have a point if you mean to say it's not practical. And you're right: for a forum with less than a few million posts, then there's no point. But otherwise, then I think it's a worthy endeavor. And plus I thought it would be a fun project to do! :D

Add on: I just found this article- Why MySQL could be slow with large tables ?
arod-1
Registered User
Posts: 1327
Joined: Mon Sep 20, 2004 1:33 pm

Re: [2.0.x] Tweaks for large forums

Post by arod-1 »

the thing is this: by storing your 1.3 bilion posts in 20 tables, you did not do much to solve the problem.
if you want all those 1.3 billion posts to be immediately accessible, every ancient topic postable, every post editable etc., then storing them in several tables does not solve much of the problem, does it?
say you split it by post-id: every post with post id in a known range will reside in a known table.
great, for all queries that need to read a post with a known post id#, you use a smaller table.
however, for all the queries that need to find a post by anything other than post id#, e.g. author or topic id, you actually made the problem worse.
now guess what: most of the time, we search for posts by topic id!.

in the specific (somewhat perverted) case of phpbb2, where the posts table is artificially split to PREFIX_posts and PREFIX_posts_text, there might be a case to split the post_text table[1], but i can't see how splitting the XX_posts helps you: instead of having to scan a single table with 1.3 billion post, you will have to scan, say, 13 tables with 100 million posts each. net gain: 0. code complexity: significant increase.

now, if you talking about "archiving", that's a different story altogether, and that makes a lot of sense.
in general, this means keeping in the "live" board only a fraction of the posts, say from the last XXX days, and periodically moving chunks to a second, "archive" board. of course, the archive board is read-only, and many other features are disabled.
the idea here is that the archive board will have a dog of performance, but at the same time it will have much less use: most people are interested in current discussions, the dives into history are relatively rare,
so you have a high-activity, small (relatively speaking) database, and a second, low-activity huge database, so the performance of these two boards may not be in such a disparity.
with (very) clever linkage between the 2 boards you may even make it feel like a single board, although i doubt you can keep a live thread with a dead tail without unreasonable level of complexity.
basically, you will have to decide that a thread that had no activity for XXX days becomes frozen, and moves in its entirety to the archive.
this kind of operation is much more involved than "splitting the posts table".
(e.g., it would make sense to keep the "live" and "archive" not only in 2 separate tables, but in two separate databases.)
the batch operation that moves posts from the live side to the archive side is also far from trivial. (iirc, bugs in the "prune" operation existed for several years after 2.0.0, and i am not 100% sure it's completely clean even today....)
the more i think about it the more it sounds like a fun project to do, but the scope of such a project is significantly larger than "splitting the posts table".


[1] as to the posts_text table: this table is used for linear access only, with a known post_id.
it is conceivable that splitting this table might help some, but i'd like to see numbers before i take it seriously.

of course, all the above assumes that neither phpbb's brain-damaged "native search" not mysql's slightly-less-brain-damaged fulltext search are active.
on any reasonably large board (let alone your 1.3 billion posts board) i don't believe either of those is even possible.
standard disclaimer:
backup your db and files before you do anything.
absolutely no guarantee.
if you do what i advise and it blows in your face, all you'll hear from me is: "ah... sorry, i guess"
da_badtz_one
Registered User
Posts: 376
Joined: Thu Jan 29, 2004 8:25 pm

Re: [2.0.x] Tweaks for large forums

Post by da_badtz_one »

I respectively disagree with arod-1. There is much to gain even from just splitting the posts table into many smaller tables. I agree that if we split the posts table into smaller tables on one server would have little effect. But having these tables over a number of servers will certainly help distribute the database across.

I think this is more beneficial than having MySQL slaves because we will eventually have indexes too big to fit into RAM so this method will allow our indexes of post ids or forum ids, etc to fit into RAM of different servers to maintain a decent level of performance.

The only downside to this is that we must assume that all posts are equally accessed which is not. So perhaps all posts within that week will be distributed across all servers, all posts from the week to month will be distributed across all servers, etc will solve it.
User avatar
Dog Cow
Registered User
Posts: 2507
Joined: Fri Jan 28, 2005 12:14 am
Contact:

Re: [2.0.x] Tweaks for large forums

Post by Dog Cow »

arod-1 wrote:the thing is this: by storing your 1.3 bilion posts in 20 tables, you did not do much to solve the problem.
if you want all those 1.3 billion posts to be immediately accessible, every ancient topic postable, every post editable etc., then storing them in several tables does not solve much of the problem, does it?
say you split it by post-id: every post with post id in a known range will reside in a known table.
great, for all queries that need to read a post with a known post id#, you use a smaller table.
however, for all the queries that need to find a post by anything other than post id#, e.g. author or topic id, you actually made the problem worse.
now guess what: most of the time, we search for posts by topic id!.

in the specific (somewhat perverted) case of phpbb2, where the posts table is artificially split to PREFIX_posts and PREFIX_posts_text, there might be a case to split the post_text table[1], but i can't see how splitting the XX_posts helps you: instead of having to scan a single table with 1.3 billion post, you will have to scan, say, 13 tables with 100 million posts each. net gain: 0. code complexity: significant increase.
First of all, I spent quite a lot of time thinking about how it would work, plus I read some more documents last evening that truly recommend partitioning large tables to make the data set smaller.

Have you read my document yet? I get the feeling that you haven't. Here it is, the early version: http://dserver.macgui.com/multi-table.rtf

Here's a direct quote from it:
Option 3: Group By Topic ID

This is the third option, and as you have read earlier, apparently the best option from the list. Unlike the first three options, grouping by topic id has the least amount of drawbacks. The method is simple: posts are grouped in to tables based on their topic id. Table 1 may contain topic IDs 1 through 5 million, and table 2 could store posts from topic IDs above 5 million. Using this method, we can further specify that having unique post IDs for all the tables is no longer necessary. That is, with the other two methods, the post id column, which is generally an auto increment value, would have to be updated across all other tables when a new post is inserted to prevent duplicate values. However, by instead grouping the post tables by topic ID, and then assigning each post its in-topic number (ie: the first post of a topic is 1, the second 2, etc), we can eliminate the ability to reference posts solely by their ID, and therefore we can build a potentially infinitely-scalable system by merely adding new tables as the amount of topics grows. With this system, a post would be referenced by its topic id and position in the topic, so the very first post of topic 10 would be referenced like so: 10_1; The 60th post of topic 780 would look like this: 780_60; and so on. In essence, the post ID is no longer a primary key to locate the row, but merely a number used to keep posts appearing in the correct order in their respective topics.
The post_id is no longer needed!.

You can't split by post_id. Well, you can, but that's moronic. Imagine a topic which has a post split on a page. Then what are you going to do? Query the second table and UNION the results? No, of course not! You keep all posts for a topic in the same table.

Which table, then? Well, you update the row for the topic to tell which table it's in. The topic will know, then your forum will query the correct table.
arod-1
Registered User
Posts: 1327
Joined: Mon Sep 20, 2004 1:33 pm

Re: [2.0.x] Tweaks for large forums

Post by arod-1 »

Dog Cow wrote:Have you read my document yet? I get the feeling that you haven't. Here it is, the early version: http://dserver.macgui.com/multi-table.rtf
no i haven't.
after reading your post, though, i did open the document, but was immediately encoutered by:
This entire document is the exclusive intellectual property of, and is copyright 2008 by, D. Finnigan. All rights reserved. No portion of this document may be reproduced in any form without the express permission of D. Finnigan.
.
i don't want to sound like a prude, but this attitude is so antithetical to an OS project that i stopped there and then. (btw, even your short excerpt seem to violate the copyright notice, unless, of course, you are D.Finnigan or you have her permission...)

i agree with you that my initial thought that one would split the tables by post id does not make too much sense, and the right way to do it is by topic id. of course, this means that the tables are not exactly nicely and evenly split, but the statistics of large numbers take care of that...

however, this makes some operations significantly more complex than they already are.
specifically, joining topics (not a built-in operation in 2.0, but there are numerous mods that do it, and it *is* a builtin in 3.0), and the simple fact that a post-id is enough to find the post, and hence its topic. the latter fact is heavily used by all search systems, as well as many links (e.g., my cute and forgotten MOD for "better quoting", which created a link to the original post as part of the generated quote header, standard feature on all leading forum systems except phpbb, all use the fact that "vietopic.php?p=POST_ID" is enough).

the above does not mean it is not doable (in contrast with my initial reaction), but i still believe that splitting the posts table requires very significant amounts of code changes (which, btw, means that using it will prevent you from using many if not most of existing MODs). in other words, i suspect that such a change will, in effect, constitute a fork of the project, in much the same way that the venerable "Forum Hierarchy" MOD created.

at this point, maybe you want to take something else into account: phpbb was never a speed daemon. for all its virtues, high performance was never high on it's list. in fact, there are other BBSs around who run circles around phpbb without ever breaking sweat [EDIT] i only count open-source BBS systems. proprietary systems do not interest me [/EDIT]. if you consider such a significant (twisi) operation to phpbb just to squeeze some drops of performance out of the slug (forgive my language), maybe you want to look first at alternatives...

it's a long time since i read this thread in its entirety, but as far as i remember, all the changes ("tweaks") discussed are local in nature, and can be applied to one or two files with little effect elsewhere.
what you suggest is sweeping code and db change which affects large portions of the core.

(don't get me wrong- phpbb is still my BBS of choice, it's just that i'm aware that performance is not one of its strong suits)

peace.
standard disclaimer:
backup your db and files before you do anything.
absolutely no guarantee.
if you do what i advise and it blows in your face, all you'll hear from me is: "ah... sorry, i guess"
User avatar
Dog Cow
Registered User
Posts: 2507
Joined: Fri Jan 28, 2005 12:14 am
Contact:

Re: [2.0.x] Tweaks for large forums

Post by Dog Cow »

First of all, my name is David Finnigan, and I referred to it as "my" document. I wrote it, and I hereby give only myself permission to redistribute it.


You left out the rest of that paragraph you quoted. Here it is in its entirety:
Copyright Notice wrote: This entire document is the exclusive intellectual property of, and is copyright 2008 by, D. Finnigan. All rights reserved. No portion of this document may be reproduced in any form without the express permission of D. Finnigan. Take note, however, that the ideas presented within this document are not protected and may be taken and implemented by anyone.
That is a fairly standard thing called an IP, or copyright notice. It means that I own the text, since it came from my brain and I recorded it. It then goes on to state that anyone may use the ideas presented in the document. This is pretty universal. You can't go around and steal my work and words, but you are allowed to go out and implement my ideas.

Second of all, I intentionally left it forum-neutral. It will never mention any distinct forum software, and never will. That is pointless, since the ideas and practices can be applied to any forum software out there, even forums which are custom-made, such as the one which will accompany the document when it is completed.

It has been written this way, at a technical level, such that anyone with sufficient knowledge and know-how may benefit from it, by implementing it in his/her forum if he/she so chooses.
arod-1 wrote: specifically, joining topics (not a built-in operation in 2.0, but there are numerous mods that do it, and it *is* a builtin in 3.0)
Yes, I agree. Some (minor) functionality will be lost. But one must weigh the set-backs versus the gains. I myself have never needed to split or merge a topic in the 3 years I've been running my forum. This may be the case for others, or not. However, one must make a decision on if losing some features is a worthy-trade off for having a forum which could potentially handle larger numbers of editable, quotable, and readable posts and topics.

Overall, this topic has in past few days have seen a lot of discussion, which is good. :)
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

Re: [2.0.x] Tweaks for large forums

Post by drathbun »

Splitting large database tables is best handled at the database layer, where the complexity can be masked / hidden from the application. The application should not have to check where the data is, it should just request it, and it should be presented. Super-large databases are typically partitioned.

http://en.wikipedia.org/wiki/Partition_(database)

I'm not saying that Dog Cow's concepts don't have value, but I'm not sure I would take that approach first without first checking to see what my database engine offers.

MySQL 5.1 offers partitioning:

http://dev.mysql.com/doc/refman/5.1/en/ ... rview.html

The first challenge with a partition scheme is that you only get one. :) That means you have to carefully examine your data retrieval needs and determine the most optimal means to split your data across partitions. As a very simple example, suppose that you were storing records for the 52 cards in a deck of playing cards. You could partition by card designation (A, K, Q, J, 10, 9, 8...) or by suit (clubs, diamonds, hearts, spades) or even by color (red, black). Each has advantages and disadvantages. If you want to quickly return information about all of your spades then the suit partition works very well. However, if you partition by suit and then want to know everything about the Aces, you're stuck hitting every partition to return the data.

Partitions are independent of index schemes.

The are several advantages of partitions. First, each partition can be a separate file, giving you greater concurrent access and the ability to spread your database over multiple devices. Second, if you access your data via your partition path, you immediately eliminate n% of the rows based on the distribution of your data across partitions; you never even have to hit the index for those rows. Third, and this addresses the concepts put forth earlier in the topic, partitioning is 100% independent of your application. You don't do a single thing different in your code after you add partitions to your database.
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
User avatar
Dog Cow
Registered User
Posts: 2507
Joined: Fri Jan 28, 2005 12:14 am
Contact:

Re: [2.0.x] Tweaks for large forums

Post by Dog Cow »

Thanks for posting that link; I had done some Google searches a day or two ago and read that article then. I had some doubts about it, though. One is that it is for MySQL 5.1 which is not the official, stable release yet. The other was the limitations on partitioning:
- fulltext indices
- queries for repair/rebuild
- and insert delayed.
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

Re: [2.0.x] Tweaks for large forums

Post by drathbun »

MySQL is hardly the standard for enterprise-sized databases. :) Oracle / DB2 / Teradata all offer very robust partitioning schemes for very large databases, I was just linking to MySQL's docs as a way for folks that are not aware of the concept to pick up some details. You are correct in that partitioning is not in the current 5.0 community server release.

I was trying emphasize that if you get to that point of having such a large database, there are other options than having to mess with the application code. In my opinion, having a database engine do the work is far better. As a theoretical exercise (something I do enjoy, so I understand the interest :) ) the concepts are interesting. But I can't see that I would ever personally consider the approach that you've outlined.
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
User avatar
Dog Cow
Registered User
Posts: 2507
Joined: Fri Jan 28, 2005 12:14 am
Contact:

Re: [2.0.x] Tweaks for large forums

Post by Dog Cow »

drathbun wrote:interesting. But I can't see that I would ever personally consider the approach that you've outlined.
Neither do I, since I don't suppose we'll ever have need for partitioning or any such thing.

Fortunately, September 1st is a lovely labor day, so I will be spending it laboring on finishing up the multi-table thing.

A second thought had come to my mind just recently as well, and that was partitioning across multiple database servers, which I don't think MySQL can do, but maybe some of the other entrerprise DMBS's can do. Do you know?
Locked

Return to “2.0.x Discussion”