[HowTo] Decrease searchtables' size [not for beginners]

The 2.0.x discussion forum has been locked; this will remain read-only. The 3.0.x discussion forum has been renamed phpBB Discussion.
R. U. Serious
Registered User
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

[HowTo] Decrease searchtables' size [not for beginners]

Post by R. U. Serious » Sat Aug 10, 2002 7:15 pm

Overview
  1. Introduction
  2. Fix functions_search.php (in 2.0.1 & 2.0.2)
  3. Make a search_stopwords for heavily indexed words.
  4. Delete search_stopwords from your searchtables
  1. Introduction
    The search function of phpBB uses two tables: search_wordlist which has a unique entry for each word that has ever been typed in your forum, and search_wordmatch which keeps track of which posts match which words from the search_wordlist.
    This is great for speed when searching, but has its drawback in the size of the searchtables. If you have more than enough db-space then don't mind reading this HowTo till the end. It will be of no/little relevance to you.. If however you are interested in reducing you db-size this is just for you! However you will need some knowledge in executing queries in your db (e.g. through phpmyadmin).
    With steps I describe in this HowTo I reduced my total db-size between 15-20%. (More specifically about 1 MB for a 3800 posts-forum (6.5MB to 5.4 MB)). Your results may vary greatly. This is not one of those "Loose weight fast"-scams; No, this really(!) works. I know, it sounds really cheep, but you wouldn't guess how many ppl. fall for these lines... *g*

    phpBB already brings functionality to avoid unnecessary growing of the searchtables. For exapmle it will mark 'heavily used' words with a common_word tag in the search_wordlist and stop adding further matches of this word in the search_wordmatch. For me this was only the case for a single word, so it is more like an "emergency exit", seldomly used, but good to have ;)
    Another thing phpBB does (and which I will concentrate on) is: it can use (and already uses, if you have english as default language) a blacklist of words, and all words on this list are not indexed at all, never. This blacklist is stored in a simple .txt file with one word per line (language/lang_esperanto/search_stopwords.txt). A list for the english language is already delivered with the regular download from your preferred phpBB-Dealer (just kidding :mrgreen: Only d/l phpBB from this site: www.phpbb.com ! ;) ).

    This HowTo will tell you how to make a little fix so that this list gets actually used. I submitted it as a bug report and guess the fix will be in 2.0.2 (edit: well it's not)
    Further it shows you how to find out the most used (and thereby) indexed words. And finally it provides you with a little script that will remove all search_entries from your list of words in search_stopwords.txt.
  2. Fix functions_search.php (in 2.0.1 & 2.0.2)
    This fix will probably be in 2.0.2 (edit: it is not). It was not in CVS 10. Aug. '02. If you use an older Version, you probably will have to make this fix.
    Open includes/functions_search.php and go to line 108 and 109, you will find

    Code: Select all

    	$search_raw_words['text'] = split_words(clean_words('post', $post_text, $stopword_array, $synonym_array));
    	$search_raw_words['title'] = split_words(clean_words('post', $post_title, $stopword_array, $synonym_array));
    
    Please replace with

    Code: Select all

    	$search_raw_words['text'] = split_words(clean_words('post', $post_text, $stopwords_array, $synonym_array));
    	$search_raw_words['title'] = split_words(clean_words('post', $post_title, $stopwords_array, $synonym_array));
    
    The change is very subtle, a single 's' per line in the variable $stopwords_array. edit: Geez, looks like I had mixed it up, it's corrected now. Thanks Ashe!
    This change is necessary so phpBB does not index words on the blacklist. Because of the typo the blacklist was always assumed to be empty.
  3. Make a search_stopwords for heavily indexed words.
    If you are using the english language as default, you may possibly skip this point, although I suggest you don't, because words used at different sites vary. At the two sites I am involved with there were a lot of different Top-Used-Words.

    To make the list very effective you don't want it to be unnecessarily long and at the same time include words that are part of the commonly used language. So they are used very often, yet do not indicate any kind of special information, for example " the, they, can, you, although, almost" etc. In German my Top-Used-Words were "der, die, das, ich" (meaning the, the, the, I ). To find out your list of Top-Used-Words you can run this query on your DB (e.g. in phpmyadmin):

    Code: Select all

    SELECT ls.word_id, ls.word_text, COUNT(wm.word_id)  as entries FROM `phpbb_search_wordlist` as ls LEFT JOIN `phpbb_search_wordmatch` as wm ON ls.word_id=wm.word_id GROUP BY wm.word_id ORDER BY entries DESC LIMIT 0,50
    
    This will return a list of your 50 Top-Used-Words in descending order of their occurence, with three columns:
    • the word_id - this is unique for every word and indicates the relevant entries in the two tables.
    • word_text - this shows you the actual word
    • entries - the number of entries in search_wordmatch for this word.
    Now look at each entry in that list and decide wether you want to keep it in your index or if you are going to delete it and make searching for it impossible. I used all but a handful of the words.

    Save these words in normal text file with one word per line and name that file search_stopwords.txt. Now put this list in the directory: language/lang_esperanto/search_stopwords.txt where esperanto is the language that is set as default-language in your general board-configuration in your AdminCP.

    Now these words will no longer be indexed for the searchfunction.
  4. Delete search_stopwords from your searchtables
    Now you want to remove the thousands of entries of the blacklisted searwords from your searchtables. And after that you should optmize the two tables.I have included an example script, that will do this. It is not pretty, but it works (for me). I have mySQL 3.23 and php 4.1, you might need to make modifications if you have a different setup.
    After you have run the script, you should delete it from the server (although it should not be dangerous to run it twice).

    That's it. Hopefully this works for you. Enjoy :)

Code: Select all

<?php
//*****  reduce_my_searchtables_with_stopwords.php ****//

define('IN_PHPBB', true);
$phpbb_root_path = './';
include($phpbb_root_path . 'extension.inc');
include($phpbb_root_path . 'common.'.$phpEx);
include($phpbb_root_path . 'includes/functions_search.'.$phpEx);

// Start session management
$userdata = session_pagestart($user_ip, PAGE_SEARCH);
init_userprefs($userdata);
// End session management

$stopwords_array = file($phpbb_root_path . 'language/lang_' . $board_config['default_lang'] . "/search_stopwords.txt"); 

$liste='';
foreach($stopwords_array as $curr_word)
{
		$liste .= ( ( $liste != '' ) ? ', ' : '' ) ."'".trim($curr_word)."'";
}

	$sql = "SELECT word_id 
		FROM " . SEARCH_WORD_TABLE . " 
		WHERE word_text IN ($liste)";
	if ( !($result = $db->sql_query($sql)) )
	{
		message_die(GENERAL_ERROR, 'Could not obtain common word list', '', __LINE__, __FILE__, $sql);
	}

	$common_word_id = '';
	while ( $row = $db->sql_fetchrow($result) )
	{
		$common_word_id .= ( ( $common_word_id != '' ) ? ', ' : '' ) . $row['word_id'];
	}

if ($common_word_id=='') message_die(GENERAL_ERROR,'None of the words in the list are in your search_tables.<br>Note: This could also mean the list is empty ;)');
	//echo '>'.trim($curr_word)."<<br>";
	//echo $liste .'<br>'. $common_word_id;
	//exit;

	$sql = "DELETE FROM " . SEARCH_WORD_TABLE . "  
		WHERE word_id IN ($common_word_id)";
	if ( !$db->sql_query($sql) )
	{
		message_die(GENERAL_ERROR, 'Could not delete word match entry', '', __LINE__, __FILE__, $sql);
	}
	$sql = "OPTIMIZE TABLE " . SEARCH_WORD_TABLE;
	if ( !$db->sql_query($sql) )
	{
		message_die(GENERAL_ERROR, 'Could not optimize', '', __LINE__, __FILE__, $sql);
	}

	$sql = "DELETE FROM " . SEARCH_MATCH_TABLE . "  
		WHERE word_id IN ($common_word_id)";
	if ( !$db->sql_query($sql) )
	{
		message_die(GENERAL_ERROR, 'Could not delete word match entry', '', __LINE__, __FILE__, $sql);
	}
	$sql = "OPTIMIZE TABLE " . SEARCH_MATCH_TABLE;
	if ( !$db->sql_query($sql) )
	{
		message_die(GENERAL_ERROR, 'Could not pütimize', '', __LINE__, __FILE__, $sql);
	}
	
	message_die(GENERAL_MESSAGE,'<b>Done!</b><br><br>The following list-entries have been removed from your searchtables:'.$liste);
//echo $liste .'<br>'. $common_word_id;

?>
Disclaimer: You yourself are fully responsible for what you do and wether you follow what some weirdo you never met tells you to. So don't be mad or hold someone else reponsible if somehow, something screws up you your forum while you are trying any of the things described here. ;)
Last edited by R. U. Serious on Fri Aug 30, 2002 1:02 pm, edited 1 time in total.

R. U. Serious
Registered User
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

Post by R. U. Serious » Mon Aug 12, 2002 11:07 am

If anyone has tried this, some feedback would be nice... :)

netclectic
Former Team Member
Posts: 4439
Joined: Wed Mar 13, 2002 3:08 pm
Location: Omnipresent
Contact:

Post by netclectic » Mon Aug 12, 2002 11:13 am

Not yet, though i intend to.

Thanks for highlighting the functions_search problem though. :D
Defend the game:
Image

davidh44
Registered User
Posts: 386
Joined: Sat Mar 09, 2002 5:56 am

Post by davidh44 » Mon Aug 12, 2002 11:38 am

Just did this. Reduced my database size from 30MB (had already optimized all tables) to 25MB. :D

Thanks R.U. Serious. :)

crazy_gringo
Registered User
Posts: 8
Joined: Wed Feb 20, 2002 5:54 pm
Location: Belgium

Post by crazy_gringo » Mon Aug 12, 2002 4:40 pm

This is brilliant! My database size has just been reduced from 15.36 MB to 12.89 MB! :D

User avatar
santiniuk
Registered User
Posts: 21
Joined: Sat Apr 20, 2002 10:46 am
Contact:

Post by santiniuk » Mon Aug 12, 2002 7:46 pm

Looks like a great idea, hope to try out later tonight.
Thanks for taking the time to share your detailed report with us.

It leads me on to another question which I will see if can be answered here before starting a new thread.

I have a server with very limited SQL space which would not require any search facility. (Long story but trust me the search is not required :wink: )

Can anyone advise what would be required to stop data being written to the search tables totally ?

I would then remove the search icon from the forum.

Many thanks

User avatar
santiniuk
Registered User
Posts: 21
Joined: Sat Apr 20, 2002 10:46 am
Contact:

Post by santiniuk » Mon Aug 12, 2002 9:38 pm

Just wanting to follow up...

This process worked a treat for me.

Although not a huge saving my database went from 3.2Mb to 2.7Mb

Thanks again.

(I was a chicken and backed up my DB first :wink: )

R. U. Serious
Registered User
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

Post by R. U. Serious » Mon Aug 12, 2002 9:56 pm

@santiniuk, actually you saved about 1/6 of space which seems to be what everybody is getting. Of course you can repeat this process, and add more top-words to your blacklist, although the savings in space will be less and less for each further word you add.

If you completely want to disable searching (can't imagine you won't regrett this) a good place might be either posting.php or functions_search.php, you can simply add a "return;" statement in the relevant functions.

lars_msh
Registered User
Posts: 36
Joined: Thu May 23, 2002 8:17 pm

Post by lars_msh » Mon Aug 12, 2002 10:11 pm

Thank you!

This is very useful, very much appreciated. I made it show the top 100 words and selected 65 of them. DB went from around 25 to 21Mb. It's near essential to look through the words - e.g. on my music forums "music", "album" and "single" probably ought to stay searchable!

I ran into the most minor troubles, just in case any newbies get thrown:

1. Don't forget to change the name of the tables in the SQL query if you don't have phpbb_ as your table prefix. For me, phpbb_search_wordlist is just search_wordlist and phpbb_search_wordmatch is search_wordmatch. Just depends how it's set up.

2. I decided to save that PHP script in the contrib directory for want of a better place, so it threw up some errors. I changed this line (near the top):

$phpbb_root_path = './';

To this:

$phpbb_root_path = '../';

Otherwise, just put it in your main forums directory.

Worked nicely. :-)

User avatar
santiniuk
Registered User
Posts: 21
Joined: Sat Apr 20, 2002 10:46 am
Contact:

Post by santiniuk » Mon Aug 12, 2002 10:56 pm

R.U

Excuse my ignorance but can you please expand on this.
If you completely want to disable searching (can't imagine you won't regrett this) a good place might be either posting.php or functions_search.php, you can simply add a "return;" statement in the relevant functions.


Thank you

R. U. Serious
Registered User
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

Post by R. U. Serious » Mon Aug 12, 2002 11:02 pm

Sure no problem: Go to line 282 in functions_post.php (that line no. refers to 2.0.1) an d you will find
add_search_words($post_id, stripslashes($post_message), stripslashes($post_subject));


Try commenting it out (put a // in front of that line). And I think it will stop indexing any words, which will make your search utterly useless (as you already know).

;)

btw: The fix in the above how_to did not make it into 2.0.2, I guess they had to release it at one point ;) So you should still make that change even if you have 2.0.2

R. U. Serious
Registered User
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

Post by R. U. Serious » Wed Aug 14, 2002 4:50 pm

After quite a few poeple reported this to work fine, maybe it would be a good idea to move this to the tutorials forum... just an idea...;)

User avatar
santiniuk
Registered User
Posts: 21
Joined: Sat Apr 20, 2002 10:46 am
Contact:

Post by santiniuk » Thu Aug 29, 2002 7:41 am

Just saved almost 10Mb on a 40Mb database using your method R.U

This is a really great tip. In my opinion this topic should be stored safe.

Thanks again

Anton Rongen
Registered User
Posts: 22
Joined: Wed Oct 16, 2002 1:50 pm
Location: The Netherlands
Contact:

Post by Anton Rongen » Wed Oct 16, 2002 1:51 pm

When I clear the table, when will it fill?

I mean, when I make a post on the forum will the word be added then to the database or is their continous a script running and putting the words in the table?

(sorry for bad english)

netclectic
Former Team Member
Posts: 4439
Joined: Wed Mar 13, 2002 3:08 pm
Location: Omnipresent
Contact:

Post by netclectic » Wed Oct 16, 2002 1:54 pm

The words are added when posts are made. So clearing the table will cause you to lose all words added already! 8O
Defend the game:
Image

Locked

Return to “2.0.x Discussion”