[ALPHA] Search Stopwords Manager 0.2.0 (Last update Feb 25)

A place for MOD Authors to post and receive feedback on MODs still in development. No MODs within this forum should be used within a live environment! No new topics are allowed in this forum.
Forum rules
READ: phpBB.com Board-Wide Rules and Regulations

IMPORTANT: MOD Development Forum rules

On February 1, 2009 this forum will be set to read only as part of retiring of phpBB2.
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

[ALPHA] Search Stopwords Manager 0.2.0 (Last update Feb 25)

Post by drathbun »

*** 0.2.0 ***
Feb 25, 2006 updates
Changed to ALPHA status, assigned version 0.2.0
Posted screen shots and detailed development notes in this post later in this topic.
No code is available for downloading for this version
*** END 0.2.0 Notes ***

*** Original post text below ***
One of the more common complaints from any size board is the size of the phpbb_search_wordmatch table. Short summary: this table contains the association between words in phpbb_search_wordlist and the posts that those words appear in. This table is crucial to support the standard phpBB search algorithm; if you empty this table, you cannot search the posts on your board.

When words reach a certain threshold, they becomes less useful (and more stressful) to use as search terms. One answer to this is to update your stopwords file. There are two issues ... one is that the file is a text file (rather than a database table) so the only way to insert / update / delete is to open the file directly. The second is that adding a word to the stopwords list does not remove current records from the phpbb_search_wordmatch table.

I am tossing this idea out for development. It could be a nice tool for board admins. I would envision this being an entry for the ACP with several options.
  • Edit stopwords file
    • Provide a way to edit the stopwords table
    • Track changes so that new stopwords will be removed from the phpbb_search_wordmatch and phpbb_search_wordlist tables
  • Stopword Suggestions
    • Scan current phpbb_search_wordmatch table, counting instances of a word
    • List most common words as candidates for new stopwords
    • Provide easy interface for a "one-click" stopword add
As this is just an idea at this point, I would entertain any further suggestions. Disclaimer: some of the operations required by this MOD would cause heavy database access. I would suggest that the MOD only be used during "off-peak" hours as a result. For example, the following query could be used to generate the suggested stopwords list:

Code: Select all

mysql> select w.word_id, w.word_text, count(*) as word_instances
    -> from phpbb_search_wordlist w
    -> , phpbb_search_wordmatch m
    -> where w.word_id = m.word_id
    -> group by w.word_id, w.word_text
    -> order by word_instances desc
    -> limit 25;
+---------+-----------+----------------+
| word_id | word_text | word_instances |
+---------+-----------+----------------+
|      92 | code      |            417 |
|      36 | work      |            266 |
|      44 | forum     |            259 |
|      68 | i         |            253 |
|      32 | php       |            247 |
|    1787 | think     |            233 |
|      23 | site      |            218 |
|       3 | phpbb     |            216 |
|     372 | user      |            213 |
|     127 | make      |            211 |
|      25 | done      |            199 |
|    1342 | change    |            188 |
|    1439 | ill       |            184 |
|    1343 | changes   |            184 |
|    3837 | phase     |            181 |
|    1531 | post      |            178 |
|      20 | board     |            177 |
|    1855 | server    |            177 |
|      96 | database  |            177 |
|    1576 | set       |            172 |
|      22 | test      |            171 |
|      63 | back      |            166 |
|    1405 | file      |            166 |
|      31 | ive       |            159 |
|    2394 | let       |            158 |
+---------+-----------+----------------+
25 rows in set (0.50 sec)
As you can see, the query ran in 0.50 seconds on a board with only 57K rows in the phpbb_search_wordmatch table. The same query run on a board with nearly 5 million records in the wordmatch table takes between 20-30 seconds.

Of course this MOD could be written to include the ability to edit the search_synonyms.txt file as well, but that has less impact on the search performance.

Thoughts? Suggestions? Good idea? Bad idea?
Last edited by drathbun on Sat Feb 25, 2006 6:44 am, edited 1 time in total.
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
penelope_b
Registered User
Posts: 70
Joined: Thu Dec 01, 2005 3:37 am

Post by penelope_b »

Suh-weet idea. Love it.

(Says the chick who just got done doing a bunch of stopwords editing and SQL querying by hand. :P)
TStarGermany
Registered User
Posts: 16
Joined: Fri Nov 11, 2005 7:58 pm
Contact:

Post by TStarGermany »

this is going to be a useful mod...

btw/offtopic: does anyone offer a larger search_stopwords.txt than the one delivered with phpbb for downloading ? the search didn't give any useful results
DemonBob
Registered User
Posts: 226
Joined: Thu Jan 15, 2004 7:04 pm
Location: Louisiana, United States
Contact:

Post by DemonBob »

This would be a nice idea, it would be even better if someone could port the stopwords to be read from the database, in it's own table, then modifiy the search.php to look for the stop words in the db.
TStarGermany
Registered User
Posts: 16
Joined: Fri Nov 11, 2005 7:58 pm
Contact:

Post by TStarGermany »

i just created a stopwords list out of the 1000 most common english words (i manually took out the ones who might be useful,especially nouns etc) about 630 remain,that's ~3 times as much as the OEM list ^^ ... i could post it if someone is interested in it...

(don't look at me, i can't attach it 8)
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

Post by drathbun »

TStarGermany wrote: this is going to be a useful mod...

It's also not going to be developed for a while, due to the underwhelming response to my initial post. At least by me, anyone else is free to take the idea and run with it.
does anyone offer a larger search_stopwords.txt than the one delivered with phpbb for downloading ?

In my opinion, the stopwords list should be customized to your board. What works as a stopword for me might not work for you. That's part of what I envisioned this MOD to do... was to suggest potential new stopwords based on frequency of word use in your database.
DemonBob wrote: This would be a nice idea, it would be even better if someone could port the stopwords to be read from the database, in it's own table, then modifiy the search.php to look for the stop words in the db.

Aack! Everyone else is trying to cache things by writing text files, and you want to move something back into the database! :-D In all seriousness, for something that does not change regularly a text file is the best solution.
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
TStarGermany
Registered User
Posts: 16
Joined: Fri Nov 11, 2005 7:58 pm
Contact:

Post by TStarGermany »

hm what a pity..
i certainly believe that the searchwordlist, respectively its creation is something that deserves attention,more than the stopwords text file...

looking through my search_wordlist table i realise there are several problems
a) wrong spelling by the author...awful!!!
b) words that are illigitimately connected to each other by the writer, for example: 100miles, 50ps, 11999$,ecofriendliness and so one..
c) creative words/made up stuff...e.g. damz, omfg!!11eleven,muaa,muahaha etc.pp

those 3 totally useless groups together make quite a noticeable part of my wordlist table... not to be beaten by a simple stopword list 8(
Paddic
Registered User
Posts: 198
Joined: Tue May 18, 2004 10:58 am

Suggestion for sql

Post by Paddic »

Break the sql query into two. So that the "counting part" is as simple as possible, for example:

Code: Select all

select word_id, count(word_id) as cnt from phpbb_search_wordmatch group by word_id order by cnt desc limit 20
took 30 seconds to run with 7+ million rows. I did not dare to run the complex version :P but it is way slower.

After you get the word_id's and counts, you can get the words by using word_id, something like

Code: Select all

... where word_id in ( <list of id's returned from 1st query> )
Sorry that my examples are a bit short, but I know you are a pro :), so I'm sure you'll get the idea.

One thing that could be given a thought is the problem in phpbb, that stopword file is determined from user's language setting. Of course, if you have only one language on your board, it will not be a problem. My suggestion would be, that only one stopword file is used, regardless of the user language setting.
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

Post by drathbun »

I certainly plan to do this, eventually. :-) I need such a tool for myself. But as I said, I got nearly zero response on my initial posting, so I assumed that nobody else was interested. I currently manage my stopwords list via a few sql queries and directly editing the text file, so it's not like I don't have something already.

But since there does seem to be some interest, what sort of ideas do you have for features? Any specific requests?

I can say that I will not write code that will pull the stopwords into the database. I plan to leave it as a text file. But I do plan to offer an admin panel page to edit / manage the list. Anything else?
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
TStarGermany
Registered User
Posts: 16
Joined: Fri Nov 11, 2005 7:58 pm
Contact:

Post by TStarGermany »

i'd like to see some intelligence in the creation and administration of stopwords...
- beeing able to tell the difference between real/nonsense acronyms and abbreviations (SUV vs. etc)
- eliminating misspelled words (impossible?)
- splitting up words that were artificially put together (10$,200km...)

now that i think about it.. that's not to be done with PHP/SQL i'm afraid :\
R45
Registered User
Posts: 2830
Joined: Tue Nov 27, 2001 10:42 pm

Post by R45 »

Think it's a great idea for a MOD. I would support having it in a table in the database, you can just have the script write over the text file when changes are made. (Just to make it easier for you, I would think it would be easier to design the MOD referencing a table rather than having to open and parse the text file to do anything).

Perhaps handle the synonyms file one time, if you feel like :)
User avatar
jvini
Registered User
Posts: 300
Joined: Tue Sep 24, 2002 11:20 pm
Contact:

Post by jvini »

Would love this, my search_wordlist has reached about 5.5 million and I'm looking for some way to optimize the Search...considering just disabling it all together since it is a HUGE strain on my board.

Would love this panel though. :D
User avatar
poyntesm
Registered User
Posts: 1671
Joined: Tue Jan 18, 2005 11:19 am
Location: Dublin, Ireland
Contact:

Post by poyntesm »

Hmm...must have missed the first post. This is a great idea and a good example of a new MOD that needs doing. Go for it.
User avatar
drathbun
Former Team Member
Posts: 12204
Joined: Thu Jun 06, 2002 3:51 pm
Location: TOPICS_TABLE
Contact:

Post by drathbun »

I'll put it on my list. :-) To be honest, I've sort of "moved on" from this idea to others for the moment, so it will probably be a while before any actual development (and thus alpha or beta code) would become available. But I do intend to continue reading the topic and discussing ideas.

For example... storing words in a database to avoid parsing the text file was suggested, and I respectully think that's not needed. There really isn't any "parsing" to do, as the stopwords file is simply a list of words, one word per line. It's not like there is any structure to manage. I think setting up a database table would be overkill.

Checking for misspelled words is a great idea, but I would have to think about how that would be done. I would need access to some sort of dictionary (obviously) and some way to manage that file as well.

I do intend to also offer changes to "fix" the stopwords / searchwords management process. If you look at the sample output I put back in the first post, you'll see some shorter words (1 or 2 characters long). By definition those are not supposed to be indexed. Somewhere along the way the regex that parses out the post into words was changed, and shorter words are now making their way into the database. I have a board that is running an older version of the regex and the searchword database has no words fewer than 3 or greater than 20 characters, which is what it is supposed to do. So I'll wrap that "fix" into this MOD as well. At least I will publish the changes and let some of the regex and/or security experts look at the code and see if we can figure out why it was changed (if for performance or security) and why it's broken.

But the core function for the MOD was intended to simply be an ACP entry that allows me to open the stopwords file and add words while at the same time removing those words from the phpbb_search_wordlist and phpbb_search_wordmatch tables so that I don't have to do it manually. :-) I will regularly run the query I posted above and delete most (if not all) of the words on the top 10. My search_wordmatch table is now over 5 million rows. :shock:
I blog about phpBB: phpBBDoctor blog
Still using phpbb2? So am I! Click below for details
Image
User avatar
vrflyer
Registered User
Posts: 503
Joined: Mon Feb 17, 2003 8:27 pm

Post by vrflyer »

Def. watching this one... 8)
Always Under Construction !
Post Reply

Return to “[2.0.x] MODs in Development”