Recreating The Search Index

Do not post support requests, bug reports or feature requests. Discuss phpBB here. Non-phpBB related discussion goes in General Discussion!
Get Involved
User avatar
warmweer
Jr. Extension Validator
Posts: 11674
Joined: Fri Jul 04, 2003 6:34 am
Location: Van Allen Bel ... gium

Re: Recreating The Search Index

Post by warmweer »

ChinaGal wrote: Tue Feb 11, 2020 6:44 pm Understood, thank you very much. So I will put it at 2%. Otherwise, I don't think it would take more than a few hours to reindex our site. Its still very new.
Unless I've misinterpreted the common word threshold all these years, putting it at 2% increases the number of common words (compared to the default 5%)

Common word threshold:
Words which are contained in a greater percentage of all posts will be regarded as common. Common words are ignored in search queries. Set to zero to disable. Only takes effect if there are more than 100 posts. If you want words that are currently regarded as common to be reconsidered you have to recreate the index.

The way I understand it: if 3% is your threshold, then words that occur in more than 3% of the posts are considered too common, therefore if you reduce the threshold to 2%, a word only needs to be present in 2% of the posts to be considered too common.

BTW in the very beginning I also lowered the threshshold instead of increasing it. non-Native English speakers will probably misinterpret this easily.
Spelling is freeware, which means you can use it for free.
On the other hand, it is not open source, which means you cannot change it or publish it in a modified form.


Time flies like an arrow, but fruit flies like a banana.
User avatar
ChinaGal
Registered User
Posts: 185
Joined: Fri Jan 24, 2020 10:02 am

Re: Recreating The Search Index

Post by ChinaGal »

Thank you for the clarification on all that.

Well, here is the latest. I deleted the index and created a new one. I was preparing to go to bed and let it run. Was about to set my computer not to go to sleep after a certain period of time, and then suddenly the whole thing was done in less than 60 seconds. That was fast! - By the way, I had my threshold set to 2%.

Then I went back in and did a search for those keywords I was needing to get working in the search. Unfortunately they still were giving me the same error messages as before and I couldn't search by them.

Then I figured, since it only took a minute to reindex, I would try and run it again with a threshold of 0%. Nothing to lose I figured. Well, it took less than 60 seconds again and the stats changed a lot on the reindex numbers obviously, but now all the words we need are there! I am so pleased. This is great!

Also, I think I am going to keep the threshold at 0% for now as it doesn't seem to be slowing things down as far as I can see. For us, the search results need to be very precise. It might also make a difference between someone finding what they are looking for on the site and not. Prior to the changes I made to the search settings, and the reindexing, the search results were very hit and miss.
User avatar
ChinaGal
Registered User
Posts: 185
Joined: Fri Jan 24, 2020 10:02 am

Re: Recreating The Search Index

Post by ChinaGal »

By the way, I think the total amount of server space taken up by our site grew by about 3/4 of 1MB after the reindexing, which is understandable, as the index tables probably grew in size quite a bit as a result of moving the threshold from the default of 5% to 0%, but we can definitely live with that. So no complaints there at all.
KYPREO
Registered User
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am

Re: Recreating The Search Index

Post by KYPREO »

phpBB Native Fulltext with 0% common word threshold is perfectly acceptable for a small board. However, watch the size of the search index as your board grows larger.

mySQL fulltext is arguably a better option for a medium sized forum, both in terms of search index size and the fact you can do phrase searching (this is not possible with phpBB Native Fulltext).

Anything over a million posts and you'd be silly not to use Sphinx.

To give you an idea of the differences, I have done quite a bit of experimentation on this in the last 6 months. Here are my observations based on my board (~1.9 million posts, although has been well over 2 million in the past):

* phpBB Native Fulltext search index was 3.5-4GB alone, over 50% of the total forum database size. This is with a common word threshold of 5%. In a test environment it would take over 24 hours to rebuild the search index. In a live environment this would take much much longer. In fact, we first moved from phpBB2 to 3 and this search engine was first introduced, it took 2 weeks to build the search index. :o Search queries with large numbers of results would take over 30 seconds. Many of the most important keywords on our forums were counted witihn the 5% threshold and so the search function became largely useless over time.

* mySQL Fulltext had much better granularity as a search backend. You could actually find the posts you were after and you can search for phrases. Search index size was also much much better at around 1.5GB and only took a few hours to rebuild. However, the search index was still stuck with a minimum word length of 3 characters. More complex search also took longer than phpBB native - well over 30 seconds, so it is not suitable for deployment on a large board

* SphinxSearch - with a minimum character length of 1 (ie every single word indexed) and infinite inline wildcards, the search index is under 1GB. It takes less than 10 minutes to rebuild from nothing. In fact it is scheduled to completely rebuild the index every day at around 2am when the board is quiet. There is a delta index running every 5 minutes to update search results as soon as they are posted. This delta index takes less than 10ms to build. Searches with extraordinary complexity all take less than 100ms. Most searches are done in a few thousands of a second. I can do phrase searching, proximity searching, quorum searching, exact searching, searching for grammatical equivalents (lemmatization) and custom mapping of search terms. Sphinx for the win....assuming your hosting allows you to install on your server and you are prepared for the learning curve to set it up.
phpBB user since 2002
www.AusRotary.com
User avatar
ChinaGal
Registered User
Posts: 185
Joined: Fri Jan 24, 2020 10:02 am

Re: Recreating The Search Index

Post by ChinaGal »

@KYPREO - Thank you for that nice breakdown of all the index types.

At 0% threshold, I am sure the size of the search engine will grow larger and consume more server space, but that is fine for us right now. Maybe as time goes on we won't need it to be so granular and we can increase it to 1%-2%.

I don't think our forum will ever get up to anything above 100,000 topics, at least not for 5-7 years I am thinking. 1,500-2,000 posts a month seems like what we might be adding as we go forward, but it is hard to know for sure I guess.

I am just curious what type of site you are running that gets that kind of post volume of nearly 2 million posts? It is a site that has been online for quite a long time too?

By the way, I also saw your detailed trouble shooting post in the SEO Sitemap Extension support area. Another good post. I just installed that extension too. And it seems to be working fine I am happy to say. The dynamic sitemap comes up in a browser when you enter the URL.
KYPREO
Registered User
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am

Re: Recreating The Search Index

Post by KYPREO »

ChinaGal wrote: Wed Feb 12, 2020 8:50 am I don't think our forum will ever get up to anything above 100,000 topics, at least not for 5-7 years I am thinking. 1,500-2,000 posts a month seems like what we might be adding as we go forward, but it is hard to know for sure I guess.

I am just curious what type of site you are running that gets that kind of post volume of nearly 2 million posts? It is a site that has been online for quite a long time too?
It's a niche automotive forum. It is slow-paced now as most automotive forums have become, but it has been going for 19 years (18 years on phpBB) and at its height it was not uncommon to have well over 1,000 posts a day. Around 500,000 million posts have been deleted over the journey to keep the database smaller, but even still the board has an average of 300 posts per day over 18 years. I'm currently hosting around 150,000 images as well.
By the way, I also saw your detailed trouble shooting post in the SEO Sitemap Extension support area. Another good post. I just installed that extension too. And it seems to be working fine I am happy to say. The dynamic sitemap comes up in a browser when you enter the URL.
I'm glad it was helpful!
phpBB user since 2002
www.AusRotary.com
User avatar
warmweer
Jr. Extension Validator
Posts: 11674
Joined: Fri Jul 04, 2003 6:34 am
Location: Van Allen Bel ... gium

Re: Recreating The Search Index

Post by warmweer »

KYPREO wrote: Wed Feb 12, 2020 9:46 pm ...Around 500,000 million posts have been deleted over the journey to keep the database smaller, but even still the board has an average of 300 posts per day over 18 years. I'm currently hosting around 150,000 images as well.
300 posts/day x 365 x 18 = ± 2 000 000 posts and you've deleted 500,000 million posts.
That means only 0.4% of all posts were kept :lol: YOu must have millions of unhappy members.

And I was flabbergasted when I learnt that the JLAforums have 500 million posts.
Now even more flabbergasted learning that you evaporated a thousand JLA forums. ;)
Spelling is freeware, which means you can use it for free.
On the other hand, it is not open source, which means you cannot change it or publish it in a modified form.


Time flies like an arrow, but fruit flies like a banana.
KYPREO
Registered User
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am

Re: Recreating The Search Index

Post by KYPREO »

:lol: When you think half a million and write 500,000 million. :lol:
phpBB user since 2002
www.AusRotary.com
User avatar
ChinaGal
Registered User
Posts: 185
Joined: Fri Jan 24, 2020 10:02 am

Re: Recreating The Search Index

Post by ChinaGal »

Wow, that is a big site though. And all the work to delete old posts takes a lot of time too. Don't worry, the type on the post deletion numbers didn't throw me off.

Just to give me an idea, how much server space does the entire site take up, including all the file attachments?

And how much might it take up with 2 million posts, without any file attachments?
KYPREO
Registered User
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am

Re: Recreating The Search Index

Post by KYPREO »

ChinaGal wrote: Thu Feb 13, 2020 4:34 am Wow, that is a big site though. And all the work to delete old posts takes a lot of time too. Don't worry, the type on the post deletion numbers didn't throw me off.

Just to give me an idea, how much server space does the entire site take up, including all the file attachments?

And how much might it take up with 2 million posts, without any file attachments?
3GB database, ~1GB Sphinx plus search index, >15GB images. In an ideal world, that many images should be hosted solely on remote storage like Amazon S3 and then served through a CDN. That's a work in progress.
phpBB user since 2002
www.AusRotary.com
User avatar
ChinaGal
Registered User
Posts: 185
Joined: Fri Jan 24, 2020 10:02 am

Re: Recreating The Search Index

Post by ChinaGal »

Wow, that's a lot. I bet backups take a long time. I guess it is worth it though to deal with all that when you have a very successful forum like yours.
KYPREO
Registered User
Posts: 392
Joined: Fri Feb 02, 2018 9:56 am

Re: Recreating The Search Index

Post by KYPREO »

ChinaGal wrote: Thu Feb 13, 2020 6:10 am Wow, that's a lot. I bet backups take a long time. I guess it is worth it though to deal with all that when you have a very successful forum like yours.
Database backups only take 10 minutes at most. I have written a Powershell script that runs mysqldump and saves directly into Gzip, then securely uploads to an Amazon S3 bucket, logs any errors then sends an email confirmation with details of the backup process (success/failure/nature of error). This is scheduled to run daily. The Amazon bucket is configured to expire backups after 7 days. In the event of a catastrophe, i would be able to download 7 days worth of database backups without access to the server. Because each backup is only 500mb compressed, i never exceed the quotas for the Amazon free tier.

I plan on sharing the script but i don't think many people run Windows servers so it may not be useful.🤷‍♂️
phpBB user since 2002
www.AusRotary.com

Return to “phpBB Discussion”