Assistance with code to fix search functionality when using Sphinx backend

Need some custom code changes to the phpBB core simple enough that you feel doesn't require an extension? Then post your request here so that community members can provide some assistance.

NOTE: NO OFFICIAL SUPPORT IS PROVIDED IN THIS SUB-FORUM
Forum rules
READ: phpBB.com Board-Wide Rules and Regulations

NOTE: NO OFFICIAL SUPPORT IS PROVIDED IN THIS SUB-FORUM
KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Tue Dec 03, 2019 1:12 am

INTRODUCTION

As per this ticket https://tracker.phpbb.com/browse/PHPBB3-16234 search functionality when using the Sphinx engine backend has been broken since phpBB 3.2.2.

This problem emanates from a fix to this issue: https://tracker.phpbb.com/browse/PHPBB3-15367 Essentially, the problem previously was that use of a / or $ would throw errors, as those are special characters used by Sphinx's extended search syntax and Sphinx would return an error if those characters were used in an unexpected way.

The fix for that issue used the SphinxAPI EscapeString function to escape every possible special character, which essentially breaks Sphinx's extended search syntax and means it only ever performs an all word search or any word search. None of ", *, +, |, NOT, AND, OR behave as expected.

As such, EscapeString is a very crude way of addressing Sphinx syntax errors. A much better option is to analyse phpBB search queries and turn them into a query Sphinx can understand. This will mean you can take advantage of the extended functionality Sphinx offers over phpBB native fulltext and mySQL fulltext.

I have forked phpBB on GitHub and will submit a pull request for code to address the ticket I created.

So far, I have fixed:

- use of exact phrase searching eg "this is a test"

- ensuring the words NOT and OR will search for the exact word NOT and OR when placed inside a phrase, and not turned into a - and | operator, as is currently the case. Eg "search not working" currently returns "search - working.

This is a problem going back to 3.1.5 / June 2015: https://tracker.phpbb.com/browse/PHPBB3-13958 and I have a fix for it.

- fixed use or NOT (-), OR (|) and bracket operators, which is currently broken.

- hyphenated words are now properly parsed and turned into search queries Sphinx can process. Currently, phpBB turns hyphens into minuses (ie NOT search operator) and therefore processes know-it-all as know -it -all (ie know, but not it and not all). With my code fix, know-it-all now becomes ("know it all")|knowitall) so it will look for all instances of the hyphenated string as both a phrase with spaces and as a single word.

So I've gotten to a point where search queries behave in accordance with the instructional text in search.php.

However, there are a number of additional Sphinx extended characters that could be leveraged to take advantage of Sphinx's capabilities, namely:
  • proximity search operator:

    eg "hello world"~10

    Finds the words hello and world within 10 words of each other
  • quorum matching operator:

    eg "the world is a wonderful place"/3

    Finds 3 words in the phrase list, eg it will return a post with "the world is" or "a wonderful place" etc.
PROBLEM

To check proper syntax, I thought of doing preg_replace to only escape instances of / or ~ where the string is:
1.not preceded by a phrase in quotation marks; and
2. not followed by a 1-2 digit number, either at the end of a phrase or followed by whitespace.

So, to do a preg_replace, this needs to pass (ie incorrect syntax and operator needs to be escaped or replaced with whitespace):
  • http://www
  • this is a test /3
  • /abcd
  • /123456
  • one quotation mark only"~3
This needs to fail (ie correct syntax and search operator preserved)
  • "this is a test"/2
  • "needle haystack"~10
PARTIAL SOLUTION

The following regex will find a / or ~ not followed by a 2 digit number (either at the end of a string or followed white space):

Code: Select all

/(\/|~)(?!([1-9]{1,2}($|\s)))/
Passes :
  • /489
  • /werg
  • /www.
Fails:
  • ~10
  • /2 some other queries
Now the tricky thing is test for the character not being immediately preceded by a phrase within quotation marks. I wanted to do a negative lookbehind, but you can't because you cannot stick a substring of unlimited length within a lookbehind.

Alternative solution

Instead, I was thinking I could invert the regex, so I use preg_match to test for correct syntax, then if it fails, it is followed by str_replace.

Eg:

This regex will identify the correct Sphinx syntax:

Code: Select all

/".+"(\/|~)(?=([1-9]{1,2}($|\s)))/
Passes:
"the world is a wonderful place"/3
"hello world"~10

Fails:
"this is missing a quotation mark/10
"letters shouldn't followed numbers"~1qwerty

Turned into a negative preg_match and str_replace:

Code: Select all

if(!preg_match('/".+"(\/|~)(?=([1-9]{1,2}($|\s)))/', $keywords)
	{
		str_replace(array( '/', '~' ), array(' '\/', '\~'), $keywords)
	}
I think this will allow one instance of correct syntax to pass, but if there is a mix of correct and incorrect usage, it won't work. So long as there is at least one usage of correct syntax, the $keywords string will evaluate as correct (even if other usages of the special characters are incorrect).

Is this the best way to do it, or is there another neater solution just using preg_replace and successfully testing for any incorrect usage of a / or ~?

Forgive me if I am making any amateur mistakes - I am new to all this but really keen to improve search functionality for both my board and the phpBB community.
phpBB user since 2002
www.AusRotary.com

User avatar
AmigoJack
Registered User
Posts: 5659
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by AmigoJack » Tue Dec 03, 2019 8:36 am

Thanks for finally looking into this issue.

I pretty much lost track of phpBB's code since 3.1 - with 3.0 the main aspect of the search implementation was separating the input text into keywords, which would also mean "hello world" ~10 is split into the three "keywords" (better: tokens) "hello, world" and ~10. For me the most reliable way was to work with these tokens afterwards (i.e. splicing "hello and world" back together into the phrase it needs to be). Likewise tokens like [~/][0-9]+ can be easily spotted, too. Maybe that way it works out better than trying to use regexes only? Of course, this also means the split-into-tokens process is bullitproof by ensuring "hello world"~10 (without space) is split into 3 tokens, too.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Tue Dec 03, 2019 9:46 pm

Thanks very much. I'll study how the phpBB native search works with splitting up the keywords and see if that gives me some clues.

There is a function called "split_keywords" which does this. Funnily enough, there is an equivalent function called "split_keywords" function in the equivalent Sphinx code, but it is really misnamed as it doesn't actually do any splitting of anything. The query is passed through to Sphinx as a single string and Sphinx does the splitting up etc internally. The problem with the current implementation of Sphinx is that phpBB manipulates the query into something that does not match user intention (eg turning "search not working" into search - working) or strips out search operators that are valid and Sphinx can actually use.
phpBB user since 2002
www.AusRotary.com

User avatar
AmigoJack
Registered User
Posts: 5659
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by AmigoJack » Wed Dec 04, 2019 7:41 am

As per https://github.com/phpbb/phpbb/blob/master/phpBB/search.php neither the event "core.search_modify_param_before", nor "core.search_modify_param_after" can be used to modify the splitting. In https://github.com/phpbb/phpbb/blob/master/phpBB/phpbb/search/fulltext_sphinx.php the event "core.search_sphinx_keywords_modify_options" is also no help, as it doesn't provide access to search_query for whatever reason.

I think it's impossible for an extension - only modifying the code may help (or re-implementing the whole Sphinx code thru your extension by extending the existing class fulltext_sphinx and then "just" overwriting public function split_keywords(&$keywords, $terms) to using your own code - that's what I would try).
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

User avatar
mrgoldy
Jr. Extension Validator
Posts: 1213
Joined: Tue Oct 06, 2009 7:34 pm
Location: The Netherlands
Name: Gijs
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by mrgoldy » Wed Dec 04, 2019 10:04 am

I think it would be acceptable to create a PR for the core for this one, is it not?

Otherwise if you want it as an extension and you need those variables, you can create a PR to alter the events to include the required data.

What AmigoJack said is also possible for an extension, extending a class and 'decorating' the parent. See https://symfony.com/doc/3.4/service_con ... ation.html

Also, there currently is a PR in the works by rubencm about refactoring search classes, https://github.com/phpbb/phpbb/pull/5440
Perhaps this helps you out or does something similar to what you're after.
I am not too familar with Sphinx so I will let you be the judge of that.


I personally am a bit unclear on what exactly you want help with in regards to the regex.
You want some help writing a solid or correct regexp? Or am I mistaken?

If you need any assistance with PR's or the likes, feel free to ask.

KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Wed Dec 04, 2019 9:36 pm

Thank you both.
mrgoldy wrote:
Wed Dec 04, 2019 10:04 am
I think it would be acceptable to create a PR for the core for this one, is it not?
AmigoJack wrote:
Wed Dec 04, 2019 7:41 am
"just" overwriting public function split_keywords(&$keywords, $terms) to using your own code - that's what I would try).
Yep, I'm definitely not doing an extension. Since this is essentially a bug fix and feature enhancement, I am intending to create a PR for the core and fix the problem there. The solution I have so far modifies public function split_keywords(&$keywords, $terms) in fulltext_sphinx.php, as well as the errant uses of the Sphinx API EscapeString function.
mrgoldy wrote:
Wed Dec 04, 2019 10:04 am
I personally am a bit unclear on what exactly you want help with in regards to the regex.
You want some help writing a solid or correct regexp? Or am I mistaken?
Correct.

I need a solid regexp to match the characters / or ~ used with incorrect syntax, to then escape that character or replace it with whitespace, so that the search does not throw an error.

This requires a regexp to find / or ~ NOT surrounded by the correct characters.

The problem I am having doing this is that lookbehinds must be fixed length.

This (?<!")(\/|~) will match a / or ~ not immediately preceded by a ". Ideally, I'd want a negative lookbehind for a complete phrases surrounded by "", eg (?<!".+")(\/|~) but that won't work, as the .+ makes the lookbehind non-fixed width.
phpBB user since 2002
www.AusRotary.com

User avatar
mrgoldy
Jr. Extension Validator
Posts: 1213
Joined: Tue Oct 06, 2009 7:34 pm
Location: The Netherlands
Name: Gijs
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by mrgoldy » Wed Dec 04, 2019 9:46 pm

Okay, and to better understand it.
This regex is going over the text a user enters into the search's input field.
It should then remove or replace all / and ~ characters that are not 'correct'.
And they would be correct if they were preceded by a quoted string ("my string") and followed by a number, which in turn is followed by a whitespace.
So only these should not be matched: "my string"/10 and "another keyword"~5.

KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Wed Dec 04, 2019 9:53 pm

The closest I have gotten since my first post is this:

/^(\/|~)|(\/|~)(?!([1-9]{1,2}($|\s)))|(?:[^"])(\/|~)|^[^"].+(\/|~)/

Finds a / or ~ where:
* it is found at the start of the search query
* it is not followed by a 1-2 digit number at the end of a phrase or whitespace
* it is not immediately preceded by a "
* it is not at the end of a search query beginning with a "and followed by any character any number of times.

Examples:

Passes and finds a / or ~ in:

/34
this is a test phrase"~3
"this is missing the last quotation mark/3
"this uses too high a number"/999
"this places invalid characters after the numbers"/10something

Fails (and therefore no replacement necessary in)

"this is a test phrase"~3
"find two of 123 and 456"/2

I think the only permutation of a valid Sphinx query that is producing a false positive is where there is a keyword followed a phrase with the / or ~ operator, eg

keyword1 | keyword2 | "keyword3 keyword4"~10

This is a valid search query, but my regex above finds the ~.

I'm getting closer, but not quite there yet.

Here's a link to my testing of the regex above: https://regex101.com/r/WNzSFV/1
phpBB user since 2002
www.AusRotary.com

KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Wed Dec 04, 2019 9:54 pm

mrgoldy wrote:
Wed Dec 04, 2019 9:46 pm
Okay, and to better understand it.
This regex is going over the text a user enters into the search's input field.
It should then remove or replace all / and ~ characters that are not 'correct'.
And they would be correct if they were preceded by a quoted string ("my string") and followed by a number, which in turn is followed by a whitespace.
So only these should not be matched: "my string"/10 and "another keyword"~5.
Spot on. Except the number can be followed by whitespace or appear at the end of the string it will still be valid.
phpBB user since 2002
www.AusRotary.com

User avatar
mrgoldy
Jr. Extension Validator
Posts: 1213
Joined: Tue Oct 06, 2009 7:34 pm
Location: The Netherlands
Name: Gijs
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by mrgoldy » Wed Dec 04, 2019 10:45 pm

Code: Select all

$strings = [
	'/34',
	'this is a test phrase"~3',
	'"this is missing the last quotation mark/3',
	'"this uses too high a number"/999',
	'"this places invalid characters after the numbers"/10something',
	'"this is a test phrase"~3',
	'"find two of 123 and 456"/2',
	'keyword1 | keyword2 | "keyword3 keyword4"~10',
];


foreach ($strings as $string)
{
	$string = strrev($string);

	$string = preg_replace('#[/~](?!"[^"]+")#', '', $string);

	$string = strrev($string);

	$string = preg_replace('#[/~](?!\d{1,2}(\s|$))#', '', $string);

	print_r($string);
	print_r('<br>');
}
Seems to do it for me.
First remove all ~ and / that are not preceeded (flipped the string for easier looking) by a proper quoted string.
Then remove all ~ and / that are not followed by one or two digits followed by a white space or end of string.

KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Wed Dec 04, 2019 10:54 pm

I would never have thought to reverse the string! 1000 times thank you.🙏 This looks like it does the trick. I will give it a go with the rest of the code
phpBB user since 2002
www.AusRotary.com

User avatar
AmigoJack
Registered User
Posts: 5659
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by AmigoJack » Thu Dec 05, 2019 7:30 am

mrgoldy wrote:
Wed Dec 04, 2019 10:45 pm
strrev();
Don't use it directly, as it isn't character safe - you could end up splitting Unicode characters in bytes that may match then.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

KYPREO
Registered User
Posts: 162
Joined: Fri Feb 02, 2018 9:56 am
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by KYPREO » Thu Dec 05, 2019 12:22 pm

OK this is what works:

Code: Select all

			$keywords = strrev($keywords);
			$keywords = preg_replace(array('#\/(?!"[^"]+")#', '#~(?!"[^"]+")#'), array('\/', '\~'), $keywords);
			$keywords = strrev($keywords);
			$keywords = preg_replace(array('#(/|\\\/)(?![1-9](\s|$))#', '#~(?!\d{1,2}(\s|$))#'), array('\/', '\~'), $keywords);	
Rather than delete the problematic characters, I escaped them so Sphinx will search for the characters literally. This would ensure that someone searching, for example, for a URL http://www.etc would be able to find that text.

If I need to encode $keywords before reversing it to preserve Unicode characters, then let me know. The PHP page linked above suggested UTF8 strings were fine, and the comment saying Unicode wasn't was downvoted.

Subject to that Unicode issue, I now have the complex search functionality sorted. Quorum and proximity searches now work!

Here is the full amended split_keywords function for Sphinx searching...

Code: Select all

	public function split_keywords(&$keywords, $terms)
	{
		// Keep quotes and new lines
		$keywords = str_replace(array('&quot;', "\n"), array('"', ' '), trim($keywords));

		/*
		* Find hyphenated words and replace with string to search for either the exact phrase with spaces or as a single word without spaces
		* Eg search for "know-it-all" becomes ("know it all"|"knowitall") 
		*
		*/
		$keywords = preg_replace(array('/((?:\p{L}|\p{N})+)-((?:\p{L}|\p{N})+)(?:-((?:\p{L}|\p{N})+))?/i'), array('("$1-$2-$3"|$1$2$3*)'), $keywords);
		$keywords = preg_replace(array('#(\S)-#', '# "\|#'), array('$1 ', '"|'), $keywords);

		if ($terms == 'all')
		{
			/**
			* Find and escape / and ~ characters that do not follow correct syntax as quorum matching and proximity search operators. 
			*
			* Examples of allowable usage:
			*
			* Quorum matching:  "the world is a wonderful place"/3 -- finds 3 of the words within the phrase. Number must be between 1 and 9.
			* Proximity search: "hello world"~10 -- finds hello and world within 10 words of each other. Number can be between 1 and 99
			*
			* Characters not meeting the above criteria are escaped and searched literally by Sphinx backend
			*/
			$keywords = strrev($keywords);
			$keywords = preg_replace(array('#\/(?!"[^"]+")#', '#~(?!"[^"]+")#'), array('\/', '\~'), $keywords);
			$keywords = strrev($keywords);
			$keywords = preg_replace(array('#(/|\\\/)(?![1-9](\s|$))#', '#~(?!\d{1,2}(\s|$))#'), array('\/', '\~'), $keywords);	
			
			// Find, cleanup and, if necessary, escape verbal operators and special characters used by Sphinx search
			$match		= array('#\sor\s(?=([^"]*"[^"]*")*[^"]*$)#i', '#\snot\s(?=([^"]*"[^"]*")*[^"]*$)#i', '#@#', '#\^#', '#\$#', '#\=#');
			$replace	= array(' | ', ' -', '\@', '\^', '\$', '\=');

			$keywords = preg_replace($match, $replace, $keywords);
			$this->sphinx->SetMatchMode(SPH_MATCH_EXTENDED);
		}
		else
		{
			$match = array ( '\\', '(',')','|','-','!','@','~','&', '/', '^', '$', '=' );
			$replace = array ( ' ', ' ', ' ', ' ',' ',' ',' ',' ', ' ', ' ', ' ', ' ', ' ' );

			$keywords = str_replace ($match, $replace, $keywords);
			$this->sphinx->SetMatchMode(SPH_MATCH_ANY);
		}

		if (strlen($keywords) > 0)
		{
			$this->search_query = str_replace('"', '&quot;', $keywords);
			return true;
		}

		return false;
	}

I'm now going to tackle the next 2 bits that are broken:

1. highlighting of words on the results page - this falls to pieces with phrase searching and search operators, such that found words/phrases are often not highlighted.

2. the instructional text in language/en/search.php which currently doesn't explain that phrase searching is possible with the right backend. There is actually a function set up to define whether phrase searching is enabled in the chosen search backend. Under the current code, if phrase searching is disabled, a search containing quotation marks will return a custom page saying phrase searching is disabled. However, only the postgreSQL search mode defines whether or not phrase searching is disabled - the other search modes are silent on this. What this means is that the instructional text could say something like Use quotation marks, eg "this phrase" to search for an exact match (if enabled on this board). Then if the search backend does not support it, it will tell a user who tries to use quotation marks. For this to work, it just needs the phrase search disabled flag added to phpBB Fulltext native, since that does not support phrase searching. For Sphinx and mySQL search mode, phrase searching works.

Board administrators with Sphinx search (there are probably not many out there) could make further custom amendments to language/en/search.php to explain proximity and quorum searching. I know I will.
phpBB user since 2002
www.AusRotary.com

User avatar
AmigoJack
Registered User
Posts: 5659
Joined: Tue Jun 15, 2010 11:33 am
Location: グリーン ヒル ゾーン
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by AmigoJack » Thu Dec 05, 2019 3:44 pm

KYPREO wrote:
Thu Dec 05, 2019 12:22 pm
The PHP page linked above suggested UTF8 strings were fine, and the comment saying Unicode wasn't was downvoted.
I linked to the non-downvoted comment and it provided its own function, hence the text "this works" (not "it works"). Unicode will break - in case you need a demo I'll write one and we can all be sure.

/includes/utf/utf_tools.php does not provide a counterpart function, but since mb_*() is used mostly you can use the given example yourself. However, /vendor/patchwork/utf8/src/Patchwork/Utf8.php comes with Utf8:: strrev() which should serve what you need.
The worst thing about censorship is ███████████
Affin wrote:
Tue Nov 20, 2018 9:51 am
The problem is probably not my English but you do not want to understand correctly.
...
We will not come anybody anyway, nevertheless, it's best to shit this.

User avatar
mrgoldy
Jr. Extension Validator
Posts: 1213
Joined: Tue Oct 06, 2009 7:34 pm
Location: The Netherlands
Name: Gijs
Contact:

Re: Assistance with code to fix search functionality when using Sphinx backend

Post by mrgoldy » Thu Dec 05, 2019 8:15 pm

Does unicode have a place in a search string? As in, isn't it just possible to strip that before hand?

Post Reply

Return to “phpBB Custom Coding”