PHP Regex quick question

Discussion forum for Extension Writers regarding Extension Development.
User avatar
axe70
Registered User
Posts: 254
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

P.s just re-added something lost during fly tests, that i assume should remain in place, so the complete (at moment) regexp to me is this:

Code: Select all

$s = preg_match_all('/(https?)(:\/\/)(.[^< ]*?)(\.?|\?){1}(\.jpg|\.jpeg|\.gif|\.png|[a-z][\.|=]jpg|[\.|=]jpeg|[\.|=]gif|[\.|=]png){1}/ui', $string, $matches, PREG_SET_ORDER);
last post example right here below has been edited with this same:
viewtopic.php?p=15605486#p15605486
anybody knows how it could be reduced?
User avatar
JLA
Registered User
Posts: 580
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

axe70 wrote:
Fri Oct 09, 2020 11:25 pm
P.s just re-added something lost during fly tests, that i assume should remain in place, so the complete (at moment) regexp to me is this:

Code: Select all

$s = preg_match_all('/(https?)(:\/\/)(.[^< ]*?)(\.?|\?){1}(\.jpg|\.jpeg|\.gif|\.png|[a-z][\.|=]jpg|[\.|=]jpeg|[\.|=]gif|[\.|=]png){1}/ui', $string, $matches, PREG_SET_ORDER);
last post example right here below has been edited with this same:
viewtopic.php?p=15605486#p15605486
anybody knows how it could be reduced?
Can you please explain your changes?
User avatar
axe70
Registered User
Posts: 254
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

yes, if i'm not wrong:

Code: Select all

'/(https?)(:\/\/)(.[^< ]*?)(\.?|\?){1}(\.jpg|\.jpeg|\.gif|\.png|[a-z][\.|=]jpg|[\.|=]jpeg|[\.|=]gif|[\.|=]png){1}/ui'
a match need to start with
http or https
then :\\
then any character, but not a < and nor a space
then a dot or a ? and need to be single
then .jpg, jpeg, gif or png, or a sequence between a-z (and should maybe be added 0-9)
followed by a dot or a = , jpg, jpeg, gif, or png
only one time
\ui mean UTF-8 mode, case insensitive match

but reading it, i also imagined that it can fail if an image contain a space, like http://www.google.com/te s_t.JPG or the name of the image contains numbers, or the pic= is not pic, but something like http://www.google.com/dodacom?pic32=gif or something else, then it will not match.

Then, into a string like this:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/testyryr533443t.jpgANYTHINGCANBEHERE 
hjhj https://www.google.com/test.php fsfsf http://www.google.com/te s_t.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=gif&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic32=gif&stuff=lalala  httpS://www.google.com/whatever.png
https://www.xxxxxxxxxx.com/xxxxxx/pngx-xxxx/xxxxxx/x-xxxr-xxxx-gifx-xxx-xxxxxx-xxxxxx-xxxxx-x-xx-xxxx-xxx-xxpng http://www.google.com/dodacom?pic=jpg&stuff=lalala<https://pbs.twimg.com/media/113344WoAAUgPO?format=jpeg&name=900x900>
fehfiwhfeiw120270 blabla http://www.thexxx.xxx/xxxx/203012131/XXXX/444444444/-1/xxxxxxxxxxxx4444%3FTitle%3Dxxx-xxxxxxy-xxf-xxxxxl-xxxx-xxxxx&xx=xx&xx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxePng';
this updated one:

Code: Select all

$s = preg_match_all('/(https?:\/\/)(.[^< ]*?\/).[^< =]*?[\w ]+?(\.jpg|\.jpeg|\.gif|\.png|[\w][\.|=]jpg|[\.|=]jpeg|[\.|=]gif|[\.|=]png){1}/ui', $string, $matches, PREG_SET_ORDER);
$res = array_column($matches, 0);
$res = array_map('strtolower',$res);
print_r($res);
exit;
now return this:

Code: Select all

Array
(
    [0] => https://www.google.com/testyryr533443t.jpg
    [1] => http://www.google.com/te s_t.jpg
    [2] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=gif
    [3] => http://www.google.com/test.gif
    [4] => http://www.google.com/dodacom?pic32=gif
    [5] => https://www.google.com/whatever.png
    [6] => http://www.google.com/dodacom?pic=jpg
    [7] => https://pbs.twimg.com/media/113344woaaugpo?format=jpeg
)
do not know if your images can be with a space or underscore, and how in case it is stored, but just in case this now should cover more possibilities.
If you want to remove or found an error let know!
User avatar
JLA
Registered User
Posts: 580
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

axe70 wrote:
Sat Oct 10, 2020 11:32 pm
yes, if i'm not wrong:

Code: Select all

'/(https?)(:\/\/)(.[^< ]*?)(\.?|\?){1}(\.jpg|\.jpeg|\.gif|\.png|[a-z][\.|=]jpg|[\.|=]jpeg|[\.|=]gif|[\.|=]png){1}/ui'
a match need to start with
http or https
then :\\
then any character, but not a < and nor a space
then a dot or a ? and need to be single
then .jpg, jpeg, gif or png, or a sequence between a-z (and should maybe be added 0-9)
followed by a dot or a = , jpg, jpeg, gif, or png
only one time
\ui mean UTF-8 mode, case insensitive match

but reading it, i also imagined that it can fail if an image contain a space, like http://www.google.com/te s_t.JPG or the name of the image contains numbers, or the pic= is not pic, but something like http://www.google.com/dodacom?pic32=gif or something else, then it will not match.

Then, into a string like this:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/testyryr533443t.jpgANYTHINGCANBEHERE 
hjhj https://www.google.com/test.php fsfsf http://www.google.com/te s_t.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=gif&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic32=gif&stuff=lalala  httpS://www.google.com/whatever.png
https://www.xxxxxxxxxx.com/xxxxxx/pngx-xxxx/xxxxxx/x-xxxr-xxxx-gifx-xxx-xxxxxx-xxxxxx-xxxxx-x-xx-xxxx-xxx-xxpng http://www.google.com/dodacom?pic=jpg&stuff=lalala<https://pbs.twimg.com/media/113344WoAAUgPO?format=jpeg&name=900x900>
fehfiwhfeiw120270 blabla http://www.thexxx.xxx/xxxx/203012131/XXXX/444444444/-1/xxxxxxxxxxxx4444%3FTitle%3Dxxx-xxxxxxy-xxf-xxxxxl-xxxx-xxxxx&xx=xx&xx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxePng';
this updated one:

Code: Select all

$s = preg_match_all('/(https?:\/\/)(.[^< ]*?\/).[^< =]*?[\w ]+?(\.jpg|\.jpeg|\.gif|\.png|[\w][\.|=]jpg|[\.|=]jpeg|[\.|=]gif|[\.|=]png){1}/ui', $string, $matches, PREG_SET_ORDER);
$res = array_column($matches, 0);
$res = array_map('strtolower',$res);
print_r($res);
exit;
now return this:

Code: Select all

Array
(
    [0] => https://www.google.com/testyryr533443t.jpg
    [1] => http://www.google.com/te s_t.jpg
    [2] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=gif
    [3] => http://www.google.com/test.gif
    [4] => http://www.google.com/dodacom?pic32=gif
    [5] => https://www.google.com/whatever.png
    [6] => http://www.google.com/dodacom?pic=jpg
    [7] => https://pbs.twimg.com/media/113344woaaugpo?format=jpeg
)
do not know if your images can be with a space or underscore, and how in case it is stored, but just in case this now should cover more possibilities.
If you want to remove or found an error let know!
Thinking about this some more - think it could be

Our matches could always start with
http:// or https:// or ftp:// or www. without the http:// https:// ftp:// (didn't think of www only before)

after could have any valid url (including numbers of course) characters to match. Spaces would not be a valid url character right?

Could end in .jpg .png .jpeg .gif, etc but they might not end in this for the url sequence.

Code: Select all

 (Like https://www.xxx.xxx/blahblahblah?blahblahblahjpg or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d3rblahblahblah=dfowodofefodjpg or  https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d3rblahblahblah=jpgdfowodofefod 
 
there could be a query sequence with ? and some "valid" query string characters (what is valid for a genuine query string character) then = then one of the image endings.

The old regex seemed to match on full url + query sequence ending in jpg png gif ,etc which now I imagine could be a possibility and we would match on this and just validate it later in case this is not a valid image URL. Full items ending in jpg png gif, etc seems like a safe bet. The ultimate goal is to examine the string and pull out as many valid image urls as possible - just need to think of what are the real world possibilities how users could enter (if valid) and we can capture them with our regex.

So many possibilities to think of - giving me a Saturday night headache. LOL
User avatar
axe70
Registered User
Posts: 254
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

... following your interesting last one, just because it resulted hard (to me) to be resolved, i looked into it minutes, between something and something else during these days ... so now

into a string like this, that add your last issues:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/testyryr533443t.jpgANYTHINGCANBEHERE 
hjhj https://www.google.com/test.php fsfsf http://www.google.com/te s_t.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=gif&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic32=gif&stuff=lalala  httpS://www.google.com/whatever.png
https://www.xxxxxxxxxx.com/xxxxxx/pngx-xxxx/xxxxxx/x-xxxr-xxxx-gifx-xxx-xxxxxx-xxxxxx-xxxxx-x-xx-xxxx-xxx-xxpng http://www.google.com/dodacom?pic=jpg&stuff=lalala<https://pbs.twimg.com/media/113344WoAAUgPO?format=jpeg&name=900x900>
fehfiwhfeiw120270 blabla http://www.thexxx.xxx/xxxx/203012131/XXXX/444444444/-1/xxxxxxxxxxxx4444%3FTitle%3Dxxx-xxxxxxy-xxf-xxxxxl-xxxx-xxxxx&xx=xx&xx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxePng
https://www.xxx.xxx/blahblahblah?blahblahblahjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d5rblahblahblah=gifwodofefodjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d8rblahblahblah=jpgdfowodofefod';
with this regexp:

Code: Select all

$s = preg_match_all('/((ftp|https?):\/\/)(.[^< ]*?\/).[^< =\?]*?[\w ]+?((\.jpg|\.gif|\.png|\.jpeg){1}|(\?[a-z0-9]+)={1}(jpg|jpeg|gif|png){1})/ui', $string, $matches, PREG_SET_ORDER);
$res = array_column($matches, 0);
$res = array_map('strtolower',$res);
print_r($res);
exit;
the result will be again this:

Code: Select all

Array
(
    [0] => https://www.google.com/testyryr533443t.jpg
    [1] => http://www.google.com/te s_t.jpg
    [2] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=gif
    [3] => http://www.google.com/test.gif
    [4] => http://www.google.com/dodacom?pic32=gif
    [5] => https://www.google.com/whatever.png
    [6] => http://www.google.com/dodacom?pic=jpg
    [7] => https://pbs.twimg.com/media/113344woaaugpo?format=jpeg
)
i assume may it is still not perfect? Probably, then now, i ask myself, how it can fail. Expecting soon a reported error.
Beside the fact, that a subsequent regexp, could check for each single result in various ways, and discard unwanted matches, making the joke perfect, we want the joke into an unique one!

Any correction, improvement, hint and suggestion about this, would be really appreciated.
Last edited by kinerity on Wed Oct 14, 2020 1:07 pm, edited 1 time in total.
Reason: Removed link. Please don't promote your WordPress phpBB in an unrelated topic.
User avatar
axe70
Registered User
Posts: 254
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

axe70 wrote:
Wed Oct 14, 2020 12:46 pm
... following your interesting last one, just because it resulted hard (to me) to be resolved, i looked into it minutes, between something and something else during these days ... so now

into a string like this, that add your last issues:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/testyryr533443t.jpgANYTHINGCANBEHERE 
hjhj https://www.google.com/test.php fsfsf http://www.google.com/te s_t.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=gif&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic32=gif&stuff=lalala  httpS://www.google.com/whatever.png
https://www.xxxxxxxxxx.com/xxxxxx/pngx-xxxx/xxxxxx/x-xxxr-xxxx-gifx-xxx-xxxxxx-xxxxxx-xxxxx-x-xx-xxxx-xxx-xxpng http://www.google.com/dodacom?pic=jpg&stuff=lalala<https://pbs.twimg.com/media/113344WoAAUgPO?format=jpeg&name=900x900>
fehfiwhfeiw120270 blabla http://www.thexxx.xxx/xxxx/203012131/XXXX/444444444/-1/xxxxxxxxxxxx4444%3FTitle%3Dxxx-xxxxxxy-xxf-xxxxxl-xxxx-xxxxx&xx=xx&xx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxePng
https://www.xxx.xxx/blahblahblah?blahblahblahjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d5rblahblahblah=gifwodofefodjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d8rblahblahblah=jpgdfowodofefod';
with this regexp:

Code: Select all

$s = preg_match_all('/((ftp|https?):\/\/)(.[^< ]*?\/).[^< =\?]*?[\w ]+?((\.jpg|\.gif|\.png|\.jpeg){1}|(\?[a-z0-9]+)={1}(jpg|jpeg|gif|png){1})/ui', $string, $matches, PREG_SET_ORDER);
$res = array_column($matches, 0);
$res = array_map('strtolower',$res);
print_r($res);
exit;
the result will be again this:

Code: Select all

Array
(
    [0] => https://www.google.com/testyryr533443t.jpg
    [1] => http://www.google.com/te s_t.jpg
    [2] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=gif
    [3] => http://www.google.com/test.gif
    [4] => http://www.google.com/dodacom?pic32=gif
    [5] => https://www.google.com/whatever.png
    [6] => http://www.google.com/dodacom?pic=jpg
    [7] => https://pbs.twimg.com/media/113344woaaugpo?format=jpeg
)
i assume may it is still not perfect? Probably, then now, i ask myself, how it can fail. Expecting soon a reported error.
Beside the fact, that a subsequent regexp, could check for each single result in various ways, and discard unwanted matches, making the joke perfect, we want the joke into an unique one!

Any correction, improvement, hint and suggestion about this, would be really appreciated.
p.s:
note that the part (\?[a-z0-9]+)={1}, could be changed into (\?[\w]+)={1} to match more characters.
User avatar
axe70
Registered User
Posts: 254
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

this should be foolproof:

Code: Select all

$s = preg_match_all('/((ftp|https?):\/\/)(.[^< ]*?\/)[^< =\?]*?[\w ]+?((\.jpg|\.gif|\.png|\.jpeg){1}|(\?[\w]+=){1}(jpg|jpeg|gif|png){1})/ui', $string, $matches, PREG_SET_ORDER);
... until you do not discover that this is not perfect
User avatar
JLA
Registered User
Posts: 580
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

JLA wrote:
Mon Oct 05, 2020 6:35 pm
Hi everyone - lots of activity on this I see. Sorry for the delay in reply. I am currently testing all the proposed. 1st, a little more info to help clarify the situation. The project I'm working on is for a string that deals with some user input.

1. This input can include at time some https (also I imagine could be http and ftp too) urls (not href URLS) that are enclosed with < and > symbols. URLs can sometimes include query strings so we must think about that. Also long as the URL has valid characters, then it should be fine provided it has a < to start and > at the end and meets other conditions. When we remove a < or > - it should be replaced with a white space to preserve it's position and the position of all the other data in the string.

2. This user input can also include < and > symbols for any reason in the string that are not related to these enclosed URLs. We have to account for this and not touch these symbols and their positions in the string

3. We also have to account for user error and possible malicious/disruptive user input. For example a user would by chance could do the following (and desired result):

Code: Select all

<https://blahblahblah> - this is a match, we want to remove the < and > from the URL

http://blahblahblah>  this is not a match because the URL does not start with a < so we do not touch this

<<http://blahblahblah>> - this is a match but we only remove the 1st < and > and leave the other which results in < http://blahblahblah >

<http://blahblah<>blahblah> - this is not a match because the URL contains invalid character before the 1st > after the 1st < - we do not touch this
4. Another thing to consider: Since we are speaking of valid URLs, at this point it complicates things to much to check if the actual URL is valid (having something like a .com or .net or .eu, etc). Our test string below really doesn't cover this and thinking it is complicates things too much. If a user screws up in that way then that is their problem. We only want to make sure that what we look for starts with <https:// and between the <https:// and the 1st > there are only characters that would normally be valid in a URL or non malicious query string). Does the non malicious query string again complicate too much? If so then perhaps then between <https:// and the 1st > we just want to make sure that there is not another < If so it would not be a match. Open to opinions on this.

These are some examples and hopefully the intent is clear.

So now on to the test string of

Code: Select all

$string = 'Today at <-thiscanbeanyassortmentofanyrandomchars-><http://thiscanbeanyassortmentofanyrandomchars>it\'s raining but it is ok <b style="color:red">for me because</b> i\'m a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah<https://blahblahblah>blahblahblah>
Blah<https://blahblah> 5555 <https://blahblah<>blahblah>blahblahblah  <https://blahblah>blublublu> <https://blah> <blah>
blah blah blah<blah><https://blahblahblah><blahblahblah <https://blahblahmissing <https://anotherblahblah>';
We will go from most recent solution to earliest

October 5, 2020 at 9:03am axe70 suggested

Code: Select all

$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
which results to appear everything seems good? We will call this **SUCCESS #1**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah 

In another solution from same post

Code: Select all


$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', '\1\2\3\4\5', $string);

Not a good result because looks like positions in string changed because matching < or > did were not replaced with whitespace. Does anyone see anything else?? **Failure #1**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah

Another solution be axe70 on October 5, 2020

Code: Select all


$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s; 
 
 
This also seems good We will call this **SUCCESS #2** Missing anything??

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah  5555 <https://blahblah<>blahblah>blahblahblah   https://blahblah blublublu>  https://blah  <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing  https://anotherblahblah 

Another solution proposed by AbaddonOrmuz on October 5, 2020

Code: Select all

$replaced = preg_replace('#<(https?://[^<>]+)>#', '\1', $text);
Not a good result because looks like positions in string changed because matching < or > did were not replaced with whitespace. Does anyone see anything else?? **Failure #2**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah

Another solution proposed by axe70 on October 5, 2020 at 3:58am

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;
Seems to be another success. Anything missing?? **SUCCESS #3**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah  5555 <https://blahblah<>blahblah>blahblahblah   https://blahblah blublublu>  https://blah  <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing  https://anotherblahblah 

In another solution by axe70 on October 5 at 7:21

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $string, -1 ,$cr);

if($cr > 0){
 foreach($matches as $m => $mv){
      $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
   // $s = preg_replace('/\#\#my007placeolder\#\#/', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); //  respect what found on string, re-adding spaces or not based on if there are or not

	 //print_r($s);echo'<br />'; // demo of the "why"
  }
}

echo $s;
Seems to be another success. Anything missing?? **SUCCESS #4**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah  5555 <https://blahblah<>blahblah>blahblahblah   https://blahblah blublublu>  https://blah  <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing  https://anotherblahblah 



There was another couple solutions by axe70 but we will skip those and move to the original solution proposed by him that we found to be working but applied with our test string here

Code: Select all

$s = preg_match_all('/ ?<(ftp|http|https?):\/\/(.*?)\> ?/', $string, $matches, PREG_SET_ORDER); 
$s = preg_replace('/ ?<(ftp|http|https?):\/\/(.*?)\> ?/', '##my007placeolder##', $string, -1 ,$cr);
$pn = 0;
if($cr > 0){
	foreach($matches as $m){
	 $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$pn][1] . '://' . $matches[$pn][2] . ' ', $s, 1); // one x time, in order
	 //print_r($s);echo'<br />';
	 $pn++;
  }
}

echo $s;
**Failure #3**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 https://blahblah< blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah https://blahblahmissing <https://anotherblahblah 
This one results in a failure because

Code: Select all

This line in our string starts off
Blah<https://blahblah> 5555 <https://blahblah<>blahblah>blahblahblah  <https://blahblah>blublublu> <https://blah> <blah>

and was changed to
Blah https://blahblah 5555 https://blahblah< blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>

and this line
blah blah blah<blah><https://blahblahblah><blahblahblah <https://blahblahmissing <https://anotherblahblah>

was incorrectly changed to
blah blah blah<blah> https://blahblahblah <blahblahblah https://blahblahmissing <https://anotherblahblah 
*
*
*
*
*
So, it appears that we have four successful options. The question shall be, is anything missing in any of these four and what is the consensus on the best one to use considering the requirements and scenrios I've tried to explain?

OPTION 1

Code: Select all

$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
OPTION 2

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;  
OPTION 3

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;
OPTION 4

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $string, -1 ,$cr);

if($cr > 0){
 foreach($matches as $m => $mv){
      $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
   // $s = preg_replace('/\#\#my007placeolder\#\#/', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); //  respect what found on string, re-adding spaces or not based on if there are or not

	 //print_r($s);echo'<br />'; // demo of the "why"
  }
}

echo $s;
Thank everyone so much for their time and involvement on this. Hopefully these efforts can be of use to others too in the future.
Something odd today. When using our last solution

Code: Select all

 $axe70s = preg_replace('#( ?)<{1}(ftp|http|https)(:\/\/)(.[^<]*?)>{1}( ?)#ui',  ' \2\3\4 ', $message);
		
$message = $axe70s;
for this particular regex and when we saw a string that had some unescaped double quotes and single quotes in same string - strange behavior. String was returned as empty - only at the point after this regex. Changed to this other solution and now seems normal. Not sure why at this point why 1st failed

This is working (for now)

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $message, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $message, -1 ,$cr);

if($cr > 0){
 foreach($matches as $m => $mv){
      $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
 
  }
}

$message = $s;
User avatar
axe70
Registered User
Posts: 254
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

if going on you've find the way to accomplish, that's ok!
for this particular regex and when we saw a string that had some unescaped double quotes and single quotes in same string - strange behavior. String was returned as empty - only at the point after this regex. Changed to this other solution and now seems normal. Not sure why at this point why 1st failed
i assume that the regexp fail, if on the string there is something like this char £ so to resolve, where on regexp there is the last
/ui OR #ui
just remove the u so the regexp should not fail when there are chars like this.

Anyway, i just took a fast look into, stressing the regexp into latest string. All working fine,
into any test, any possible condition, until you do not find a sequence like this:

https://www.google.com/test.phphttp://www.google.com/test.jpg
the regexp do not fail if is
https://www.google.com/test.gifhttp://www.google.com/test.jpg
because https://www.google.com/test.gif match
but will fail, into any other case, where structure of the string respected, but do not finish with .jpg|.gif|.png|.jpeg
then also
https://www.google.com/test.WATHEVERHEREhttp://www.google.com/test.jpg
will match and return

into a string like this

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/testyryr533443t.jpgANYTHINGCANBEHERE 
hjhj https://www.google.com/test.phpkhttps://www.google.com/test.phpkhttps://www.google.com/test.phpkhttp://www.google.com/test.jpghttps://www.google.com/test.pdf
https://www.google.com/test.php fsfsf http://www.google.com/te s_t.JPGhttp://www.google.com/te £s_t.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=gif&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic32=gif&stuff=lalala  httpS://www.google.com/whatever.png
https://www.xxxxxxxxxx.com/xxxxxx/pngx-xxxx/xxxxxx/x-xxxr-xxxx-gifx-xxx-xxxxxx-xxxxxx-xxxxx-x-xx-xxxx-xxx-xxpng<http://www.google.com/dodacom?pic=jpg&stuff=lalala<https://pbs.twimg.com/media/113344WoAAUgPO?format=jpeg&name=900x900>
fehfiwhfeiw120270 blabla http://www.thexxx.xxx/xxxx/203012131/XXXX/444444444/-1/xxxxxxxxxxxx4444%3FTitle%3Dxxx-xxxxxxy-xxf-xxxxxl-xxxx-xxxxx&xx=xx&xx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxePng
https://www.xxx.xxx/blahblahblah?blahblahblahjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d5rblahblahblah=gifwodofefodjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d8rblahblahblah=jpgdfowodofefod
<br />
rergergge ffwefftpftp://www.google.com/dodacom/EjgPYArWoAAUgPO?format=gif
<b>testme</b>https://my-site.com/media/1234007abc?format=jpg<i>iui</i>';
this reduced regexp (that is the same as on my last post):

Code: Select all

$s = preg_match_all('/(ftp|https?:\/\/)(.[^< =\?]*?)[\w ]*?((\.jpg|\.gif|\.png|\.jpeg)|\?[\w]+=(jpg|jpeg|gif|png){1})/i', $string, $matches, PREG_SET_ORDER);
will return this:

Code: Select all

Array
(
    [0] => https://www.google.com/testyryr533443t.jpg
    [1] => https://www.google.com/test.phpkhttps://www.google.com/test.phpkhttps://www.google.com/test.phpkhttp://www.google.com/test.jpg
    [2] => https://www.google.com/test2.gif
    [3] => http://www.google.com/te s_t.jpg
    [4] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=gif
    [5] => http://www.google.com/test.gif
    [6] => http://www.google.com/dodacom?pic32=gif
    [7] => https://www.google.com/whatever.png
    [8] => http://www.google.com/dodacom?pic=jpg
    [9] => https://pbs.twimg.com/media/113344woaaugpo?format=jpeg
    [10] => ftpftp://www.google.com/dodacom/ejgpyarwoaaugpo?format=gif
    [11] => https://my-site.com/media/1234007abc?format=jpg
)
which is wrong, on index 1 and index 10

The example of index 1, could contain any number of same string format, from 1 until any other, but the fact is, that the unique good will be ever the last one. About this we should be sure, because as said the regexp, if there is after a good format image, instead match. So wrong matches, can occur only before, exactly like:

Code: Select all

[1] => https://www.google.com/test.phphttp://www.google.com/test.phphttps://www.google.com/test.phpkhttp://www.google.com/test.jpg
in fact the subsequent https://www.google.com/test2.gif correctly match after the above as single match
or

Code: Select all

 [10] => ftpftp://www.google.com/dodacom/ejgpyarwoaaugpo?format=gif
where ftp repeated

then to purge those eventual errors, the unique valid way i've found at moment is this:

Code: Select all

if(!empty($res)){
	
 foreach($res as $p => $r){

  $ss0 = preg_split("/(ftp|https?):\/\//", $r, -1, PREG_SPLIT_DELIM_CAPTURE);
  $ss1 = array_pop($ss0); // on last position, there is ever the right matched
  $ss2 = array_pop($ss0); // then on last, after the above removed, ever (his related) http or https or ftp
  
  $ary_res[] = $ss2 . '://' . $ss1; // rebuild
 }
 
}
  
print_r($ary_res);
the result will be this:

Code: Select all

Array
(
    [0] => https://www.google.com/testyryr533443t.jpg
    [1] => http://www.google.com/test.jpg
    [2] => https://www.google.com/test2.gif
    [3] => http://www.google.com/te s_t.jpg
    [4] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=gif
    [5] => http://www.google.com/test.gif
    [6] => http://www.google.com/dodacom?pic32=gif
    [7] => https://www.google.com/whatever.png
    [8] => http://www.google.com/dodacom?pic=jpg
    [9] => https://pbs.twimg.com/media/113344woaaugpo?format=jpeg
    [10] => ftp://www.google.com/dodacom/ejgpyarwoaaugpo?format=gif
    [11] => https://my-site.com/media/1234007abc?format=jpg
)

there is a more convenient? Do not know! I assume that the regexp can accomplish to all needs if fixed in a way that i can't imagine now, or another new one, of course, that also at moment i can't imagine.

To run the script and see results, this is the complete .php file, resumed:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/testyryr533443t.jpgANYTHINGCANBEHERE 
hjhj https://www.google.com/test.phpkhttps://www.google.com/test.phpkhttps://www.google.com/test.phpkhttp://www.google.com/test.jpghttps://www.google.com/test2.gif
https://www.google.com/test.php fsfsf http://www.google.com/te s_t.JPGhttp://www.google.com/te £s_t.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=gif&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic32=gif&stuff=lalala  httpS://www.google.com/whatever.png
https://www.xxxxxxxxxx.com/xxxxxx/pngx-xxxx/xxxxxx/x-xxxr-xxxx-gifx-xxx-xxxxxx-xxxxxx-xxxxx-x-xx-xxxx-xxx-xxpng<http://www.google.com/dodacom?pic=jpg&stuff=lalala<https://pbs.twimg.com/media/113344WoAAUgPO?format=jpeg&name=900x900>
fehfiwhfeiw120270 blabla http://www.thexxx.xxx/xxxx/203012131/XXXX/444444444/-1/xxxxxxxxxxxx4444%3FTitle%3Dxxx-xxxxxxy-xxf-xxxxxl-xxxx-xxxxx&xx=xx&xx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxePng
https://www.xxx.xxx/blahblahblah?blahblahblahjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d5rblahblahblah=gifwodofefodjpg 
or https://www.xxx.xxx/blahblahblah?blahblahblah?23h4d8rblahblahblah=jpgdfowodofefod
<br />
rergergge ffwefftpftp://www.google.com/dodacom/EjgPYArWoAAUgPO?format=gif
<b>testme</b>https://my-site.com/media/1234007abc?format=jpg<i>iui</i>';
print_r($string); 
echo '<br /><br />';

$s = preg_match_all('/(ftp|https?:\/\/)(.[^< =\?]*?)[\w ]*?((\.jpg|\.gif|\.png|\.jpeg)|\?[\w]+=(jpg|jpeg|gif|png){1})/i', $string, $matches, PREG_SET_ORDER);

echo'<pre>';

$res = array_column($matches, 0);
$res = array_map('strtolower',$res);
$res = array_map('trim',$res);
print_r($res);

if(!empty($res)){
	
 foreach($res as $p => $r){

  $ss0 = preg_split("/(ftp|https?):\/\//", $r, -1, PREG_SPLIT_DELIM_CAPTURE);
  $ss1 = array_pop($ss0); // on last position, should be ever the right matched img
  $ss2 = array_pop($ss0); // then on last, after the above removed, there is ever (his related) http or https or ftp
  
  $ary_res[] = $ss2 . '://' . $ss1; // so rebuild
 }
 
}
  
print_r($ary_res);

exit;
and it should never fail
Post Reply

Return to “Extension Writers Discussion”