PHP Regex quick question

Discussion forum for Extension Writers regarding Extension Development.
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

with your regexp
https://blahblah
is turned into
https://blahblahblublublu>
which is wrong?

Nice example way, even if also, it do not let perform further things into matches, like this do:

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;
p.s in the while, of course ftp|https? is better than ftp|http|https? which mean nothing in this order in effect, so i just changed it
Last edited by axe70 on Mon Oct 05, 2020 1:31 pm, edited 1 time in total.
User avatar
AbaddonOrmuz
Recognised Extension Developer
Posts: 1046
Joined: Wed Dec 25, 2013 9:06 pm
Location: /dev/null
Name: Alfredo
Contact:

Re: PHP Regex quick question

Post by AbaddonOrmuz »

axe70 wrote:
Mon Oct 05, 2020 10:26 am
with your regexp
https://blahblah
is turned into
https://blahblahblublublu>
which is wrong?
Mmm... I didn't see that in the OP.
axe70 wrote:
Mon Oct 05, 2020 10:26 am
Nice example way, even if also, it do not let perform further things into matches, like this do:
And what's the point of that? The OP doesn't mention he wants to manipulate each match.
Some of my phpBB extensions:
Image Imgur | :chart_with_upwards_trend: SEO Metadata | Image Markdown | :lock: Auto-lock Topics
:trophy: Check out all my validated extensions :trophy:

:penguin: Arch Linux user :penguin:
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

You are right, it do not require in this case maybe, so to achieve what asked, may we can argue easily, that your solution would be faster.

Anyway, your regexp into something like this:
<https://blahblah>blublublu>
return this:
https://blahblahblublublu>
instead of this:
https://blahblah blublublu>
supposing this is what is the needed result (and for what i have understand this is what is required as result).

Changing your regexp into this $replaced = preg_replace('#<(https?://[^<]+)>#', '\1', $string); do not work, because into other kind of instances, it seem to fails.
So the regexp/preg_replace need to be more complex in your case, to achieve the same?

p.s in the while, of course ftp|https? is better than ftp|http|https? which mean nothing in this order in effect, so i just changed it into last below working snippet example
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

Could be this your modified version that work:

Code: Select all

$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
that seem to work as expected, and of course it is faster for this scope!

p.s i leaved ( ?) ... it can be removed because in the example (at positions 1 and 5) are ignored and replaced by arbitrary white space.

So the version that add or not white spaces, if there are or not, can be this:

Code: Select all

$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', '\1\2\3\4\5', $string);
So nice. Can be improved or changed further more, the regexp where search for :// can be omitted and arbitrary re-added like white spaces, or included all together into the match (then the regexp need to be little different)
[EDITED]
User avatar
JLA
Registered User
Posts: 589
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

Hi everyone - lots of activity on this I see. Sorry for the delay in reply. I am currently testing all the proposed. 1st, a little more info to help clarify the situation. The project I'm working on is for a string that deals with some user input.

1. This input can include at time some https (also I imagine could be http and ftp too) urls (not href URLS) that are enclosed with < and > symbols. URLs can sometimes include query strings so we must think about that. Also long as the URL has valid characters, then it should be fine provided it has a < to start and > at the end and meets other conditions. When we remove a < or > - it should be replaced with a white space to preserve it's position and the position of all the other data in the string.

2. This user input can also include < and > symbols for any reason in the string that are not related to these enclosed URLs. We have to account for this and not touch these symbols and their positions in the string

3. We also have to account for user error and possible malicious/disruptive user input. For example a user would by chance could do the following (and desired result):

Code: Select all

<https://blahblahblah> - this is a match, we want to remove the < and > from the URL

http://blahblahblah>  this is not a match because the URL does not start with a < so we do not touch this

<<http://blahblahblah>> - this is a match but we only remove the 1st < and > and leave the other which results in < http://blahblahblah >

<http://blahblah<>blahblah> - this is not a match because the URL contains invalid character before the 1st > after the 1st < - we do not touch this
4. Another thing to consider: Since we are speaking of valid URLs, at this point it complicates things to much to check if the actual URL is valid (having something like a .com or .net or .eu, etc). Our test string below really doesn't cover this and thinking it is complicates things too much. If a user screws up in that way then that is their problem. We only want to make sure that what we look for starts with <https:// and between the <https:// and the 1st > there are only characters that would normally be valid in a URL or non malicious query string). Does the non malicious query string again complicate too much? If so then perhaps then between <https:// and the 1st > we just want to make sure that there is not another < If so it would not be a match. Open to opinions on this.

These are some examples and hopefully the intent is clear.

So now on to the test string of

Code: Select all

$string = 'Today at <-thiscanbeanyassortmentofanyrandomchars-><http://thiscanbeanyassortmentofanyrandomchars>it\'s raining but it is ok <b style="color:red">for me because</b> i\'m a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah<https://blahblahblah>blahblahblah>
Blah<https://blahblah> 5555 <https://blahblah<>blahblah>blahblahblah  <https://blahblah>blublublu> <https://blah> <blah>
blah blah blah<blah><https://blahblahblah><blahblahblah <https://blahblahmissing <https://anotherblahblah>';
We will go from most recent solution to earliest

October 5, 2020 at 9:03am axe70 suggested

Code: Select all

$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
which results to appear everything seems good? We will call this **SUCCESS #1**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah 

In another solution from same post

Code: Select all


$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', '\1\2\3\4\5', $string);

Not a good result because looks like positions in string changed because matching < or > did were not replaced with whitespace. Does anyone see anything else?? **Failure #1**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah

Another solution be axe70 on October 5, 2020

Code: Select all


$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s; 
 
 
This also seems good We will call this **SUCCESS #2** Missing anything??

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah  5555 <https://blahblah<>blahblah>blahblahblah   https://blahblah blublublu>  https://blah  <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing  https://anotherblahblah 

Another solution proposed by AbaddonOrmuz on October 5, 2020

Code: Select all

$replaced = preg_replace('#<(https?://[^<>]+)>#', '\1', $text);
Not a good result because looks like positions in string changed because matching < or > did were not replaced with whitespace. Does anyone see anything else?? **Failure #2**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah

Another solution proposed by axe70 on October 5, 2020 at 3:58am

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;
Seems to be another success. Anything missing?? **SUCCESS #3**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah  5555 <https://blahblah<>blahblah>blahblahblah   https://blahblah blublublu>  https://blah  <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing  https://anotherblahblah 

In another solution by axe70 on October 5 at 7:21

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $string, -1 ,$cr);

if($cr > 0){
 foreach($matches as $m => $mv){
      $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
   // $s = preg_replace('/\#\#my007placeolder\#\#/', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); //  respect what found on string, re-adding spaces or not based on if there are or not

	 //print_r($s);echo'<br />'; // demo of the "why"
  }
}

echo $s;
Seems to be another success. Anything missing?? **SUCCESS #4**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah  5555 <https://blahblah<>blahblah>blahblahblah   https://blahblah blublublu>  https://blah  <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing  https://anotherblahblah 



There was another couple solutions by axe70 but we will skip those and move to the original solution proposed by him that we found to be working but applied with our test string here

Code: Select all

$s = preg_match_all('/ ?<(ftp|http|https?):\/\/(.*?)\> ?/', $string, $matches, PREG_SET_ORDER); 
$s = preg_replace('/ ?<(ftp|http|https?):\/\/(.*?)\> ?/', '##my007placeolder##', $string, -1 ,$cr);
$pn = 0;
if($cr > 0){
	foreach($matches as $m){
	 $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$pn][1] . '://' . $matches[$pn][2] . ' ', $s, 1); // one x time, in order
	 //print_r($s);echo'<br />';
	 $pn++;
  }
}

echo $s;
**Failure #3**

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 https://blahblah< blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah https://blahblahmissing <https://anotherblahblah 
This one results in a failure because

Code: Select all

This line in our string starts off
Blah<https://blahblah> 5555 <https://blahblah<>blahblah>blahblahblah  <https://blahblah>blublublu> <https://blah> <blah>

and was changed to
Blah https://blahblah 5555 https://blahblah< blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>

and this line
blah blah blah<blah><https://blahblahblah><blahblahblah <https://blahblahmissing <https://anotherblahblah>

was incorrectly changed to
blah blah blah<blah> https://blahblahblah <blahblahblah https://blahblahmissing <https://anotherblahblah 
*
*
*
*
*
So, it appears that we have four successful options. The question shall be, is anything missing in any of these four and what is the consensus on the best one to use considering the requirements and scenrios I've tried to explain?

OPTION 1

Code: Select all

$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
OPTION 2

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;  
OPTION 3

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
  foreach($matches as $m => $mv){
   // $mv could be used to manipulate each match as more like, doing magic things here
      $s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
   // $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
	 //print_r($s);echo'<br />';
  }
}

 echo $s;
OPTION 4

Code: Select all

$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $string, -1 ,$cr);

if($cr > 0){
 foreach($matches as $m => $mv){
      $s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
   // $s = preg_replace('/\#\#my007placeolder\#\#/', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); //  respect what found on string, re-adding spaces or not based on if there are or not

	 //print_r($s);echo'<br />'; // demo of the "why"
  }
}

echo $s;
Thank everyone so much for their time and involvement on this. Hopefully these efforts can be of use to others too in the future.
User avatar
HaioPaio
Registered User
Posts: 124
Joined: Mon Jan 08, 2018 7:39 pm

Re: PHP Regex quick question

Post by HaioPaio »

I'm sitting here in my chair, watching all this, mouth wide open, Scotch in my right hand is getting warm...
I had started looking into Regex some time ago. First time I see how long the way is to the first bivouac on the long way to the summit.
Frightens me a bit.
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

the faster is:
echo $replaced = preg_replace('#( ?)<{1}(ftp|http|https)(:\/\/)(.[^<]*?)>{1}( ?)#ui', ' \2\3\4 ', $string);
or without adding white spaces if there are not (that's not what you want):
echo $replaced = preg_replace('#( ?)<{1}(ftp|http|https)(:\/\/)(.[^<]*?)>{1}( ?)#ui', '\1\2\3\4\5', $string);

for your requirements.
Can you test the first and let know in case, where it fail.
Take in case exactly the first here, that's little modified, on copy paste, i had do some mistake
Last edited by axe70 on Tue Oct 06, 2020 1:02 pm, edited 2 times in total.
User avatar
JLA
Registered User
Posts: 589
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

axe70 wrote:
Mon Oct 05, 2020 9:54 pm
the faster is:
echo $replaced = preg_replace('#( ?)<{1}(ftp|https?)(:\/\/)(.[^<]*?)>{1}( ?)#ui', ' \2\3\4 ', $string);
or without adding white spaces if there are not (that's not what you want):
echo $replaced = preg_replace('#( ?)<{1}(ftp|https?)(:\/\/)(.[^<]*?)>{1}( ?)#ui', '\1\2\3\4\5', $string);

for your requirements.
Can you test the first and let know in case, where it fail.
Take in case exactly the first here, that's little modified, on copy paste, i had do some mistake
on this

Code: Select all

$replaced = preg_replace('#( ?)<{1}(ftp|https?)(:\/\/)(.[^<]*?)>{1}( ?)#ui',  '\1\2\3\4\5', $string);[/c]
This fails Failure

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah
reason (couple examples below)

Code: Select all

This line:
Today at <-thiscanbeanyassortmentofanyrandomchars-><http://thiscanbeanyassortmentofanyrandomchars>it\'s raining but it is ok <b style="color:red">for me because</b> i\'m a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>

changed to:
Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a> 

Missing white space at beginning of match: chars->http://thiscanbeanyassortmentofanyrandomcharsit's raini

This line:
blah blah blah<https://blahblahblah>blahblahblah>

changed to: blah blah blahhttps://blahblahblahblahblahblah> - white space missing at beginning 
Using this

Code: Select all

$replaced = preg_replace('#( ?)<{1}(ftp|https?)(:\/\/)(.[^<]*?)>{1}( ?)#ui',  ' \2\3\4 ', $string);
ResultsSuccess

Code: Select all

Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah.     Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah  https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah 
A note - speed is not important for us on this. Priority is accuracy. See as we have multiple options - just hope not missing anything in the ones that show success.

Thanks again!!!!
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

Hi all! First of all, @AbaddonOrmuz did the faster way, even the regexp was not the required one to accomplish with all your requests, but only to the first on first post you did (not completely). But it is the faster way to do this. If you do not have to accomplish with further hard string manipulations, it is the faster way.

I try to resume (sorry my Eng), so may you'll understand the power of regexp and why, it is required to know how they works: you can't do easily, like with regexp, things like this. It can be considered a language himself.

This is the line of code, that is what you want, adding white spaces at start and end of the match (the match mean, the string you want to search for and that you want to change, in this case)

Code: Select all

$replaced = preg_replace('#( ?)<{1}(ftp|http|https)(:\/\/)(.[^<]*?)>{1}( ?)#ui',  ' \2\3\4 ', $string);
the simple explain to this regexp '#( ?)<{1}(ftp|http|https)(:\/\/)(.[^<]*?)>{1}( ?)#ui' is this in words:

starting and closing chars # or /, like into the regexp that before i wrote, are the wrapper of the regexp, what is inside into these chars, is the regexp.
all enclosed into (...) capture the match, you see there are 5 into this regexp. so they become 1 2 3 4 5, each will contain the match, in order.
( ?) capture a white space, that can exist, but also not on string, the char ? mean this: so you see i've put one at start, one at end: so they are 1 and 5. They will contain a white space, or nothing, based on if there is effectively a white space or not at start and end of matched string.
then using: ' \2\3\4 ' as replacement, you go to add arbitrary white space on start and end of the match (what you want), while using '1\2\3\4\5', if 1 and 5 had not match a white space, then they will contain nothing, and the result is what you say wrong result, but not because it is wrong, but because it do not accomplish with what you want: if a white space is not found, will not be added: while ' \2\3\4 ', add arbitrary white space at start and end of matches when replace.

Sorry for this basic and may not conventional or totally correct explain.

There is much more to say about regexp, anyway i'm not a regexp ninja. Regexp can be so complicate and long, doing magic things on strings: you need to know how they works, because before or later, you'll come over the fact that without, you can't easily resolve many things like this.
Regexp are a must, and i would like to add, a state of mind.
Regexp can be used for several things, all important: in some case you'll want only to know if something exists or not into a string, it is often used to know/detect if strings are "secure", for example.

If anything more needed to finalize your code with regular expressions, may let know!
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

I will not follow answering to PM if not on particular cases
... do not know why this thread is on this forum, anyway i assume can be useful ...

Question:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpgANYTHINGCANBEHERE ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic=gif&stuff=lalala  httpS://www.google.com/whatever.png';
need to return this:

Code: Select all

Array
(
    [0] => https://www.google.com/test.jpg
    [1] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=jpg
    [2] => http://www.google.com/test.gif
    [3] => http://www.google.com/dodacom?pic=gif
    [4] => https://www.google.com/whatever.png
)
so i answered with this:

Code: Select all

$s = preg_match_all('/( ?)<?(http|https)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}( ?)/ui', $string, $matches, PREG_SET_ORDER);

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   $res[] = $mv[0][0] == '<' ? strtolower(substr(trim($mv[0]), 1)) : strtolower(trim($mv[0]));
  }
}

echo'<pre>';
print_r($res);
anybody know another way to accomplish?

Question:
particularly, there is a problem that at moment i have not understand on fly how to resolve. I will return over.
The regexp above return correct matches, but,
<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg&name=900x900>
returned with
< at start, so mine subsequent $res[] = $mv[0][0] == '<' ? strtolower(substr($mv[0], 1)) : strtolower(trim($mv[0]));

Anybody knows, how to exclude the < at start? to be matched, in this particular case?

[EDITED]
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

What stupid:

Code: Select all

$s = preg_match_all('/(https?)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui', $string, $matches, PREG_SET_ORDER);

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   //$res[] = $mv[0][0] == '<' ? strtolower(substr(trim($mv[0]), 1)) : strtolower(trim($mv[0])); // so this can be omitted 
   $res[] = strtolower(trim($mv[0]));
  }
}

echo'<pre>';
print_r($res);
the regexp could remove also this [^<] since there is no need of it (almost at moment it seem to me)
[EDITED]
Last edited by axe70 on Tue Oct 06, 2020 3:41 pm, edited 1 time in total.
User avatar
JLA
Registered User
Posts: 589
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

axe70 wrote:
Tue Oct 06, 2020 3:10 pm
What stupid:

Code: Select all

$s = preg_match_all('/(http|https)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui', $string, $matches, PREG_SET_ORDER);

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   //$res[] = $mv[0][0] == '<' ? strtolower(substr(trim($mv[0]), 1)) : strtolower(trim($mv[0])); // so this can be omitted 
   $res[] = strtolower(trim($mv[0]));
  }
}

echo'<pre>';
print_r($res);
the regexp could remove also this [^<] since there is no need of it (almost at moment it seem to me)
[EDITED]
In this script **currently** on the test string

Code: Select all

$pattern = '`.*?((http|https)://[\w#$&+,\/:;=?@.-]+.jpg)[^\w#$&+,\/:;=?@.-]*?`i';
   preg_match_all($pattern,$string,$images);
   
print_r($images);
Returns matches in $images[1] of the array but now only for jpg

Code: Select all


Array
(
    [0] => Array
        (
            [0] =>  blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpg
            [1] => ANYTHINGCANBEHERE ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
        )

    [1] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
        )

    [2] => Array
        (
            [0] => https
            [1] => https
        )

)
So possible to improve and maybe better way to do this?

From this string it should get all these urls. We do not want to complicate things. Basically only for all http or https urls and finds the 1st jpg, JPG, gif, GIF, png or PNG and adds that to the array.

This code in more simple way would pull the following from our string. Look for http:// or https:// and go to 1st .jpg or .JPG or .GIF or .gif or .png or .PNG It it finds this then takes that and adds to the array.

We obviously want to change and improve this to give this result in $images

Code: Select all

https://www.google.com/test.jpg
https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
http://www.google.com/test.gif
http://www.google.com/dodacom?pic=gif
httpS://www.google.com/whatever.png
Maybe perhaps if we do not see .jpg, etc in main url - then only find =jpg or =gif or =png etc in query string if exists since there might be URLS that have jpg or gif or png characters in other types of query strings that are not images but other types of valid urls that we do not want to capture or disturb for later code in a script.
User avatar
JLA
Registered User
Posts: 589
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

JLA wrote:
Tue Oct 06, 2020 3:32 pm
axe70 wrote:
Tue Oct 06, 2020 3:10 pm
What stupid:

Code: Select all

$s = preg_match_all('/(http|https)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui', $string, $matches, PREG_SET_ORDER);

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   //$res[] = $mv[0][0] == '<' ? strtolower(substr(trim($mv[0]), 1)) : strtolower(trim($mv[0])); // so this can be omitted 
   $res[] = strtolower(trim($mv[0]));
  }
}

echo'<pre>';
print_r($res);
the regexp could remove also this [^<] since there is no need of it (almost at moment it seem to me)
[EDITED]
In this script **currently** on the test string

Code: Select all

$pattern = '`.*?((http|https)://[\w#$&+,\/:;=?@.-]+.jpg)[^\w#$&+,\/:;=?@.-]*?`i';
   preg_match_all($pattern,$string,$images);
   
print_r($images);
Returns matches in $images[1] of the array but now only for jpg

Code: Select all


Array
(
    [0] => Array
        (
            [0] =>  blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpg
            [1] => ANYTHINGCANBEHERE ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
        )

    [1] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
        )

    [2] => Array
        (
            [0] => https
            [1] => https
        )

)
So possible to improve and maybe better way to do this?

From this string it should get all these urls. We do not want to complicate things. Basically only for all http or https urls and finds the 1st jpg, JPG, gif, GIF, png or PNG and adds that to the array.

This code in more simple way would pull the following from our string. Look for http:// or https:// and go to 1st .jpg or .JPG or .GIF or .gif or .png or .PNG It it finds this then takes that and adds to the array.

We obviously want to change and improve this to give this result in $images

Code: Select all

https://www.google.com/test.jpg
https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
http://www.google.com/test.gif
http://www.google.com/dodacom?pic=gif
httpS://www.google.com/whatever.png
Maybe perhaps if we do not see .jpg, etc in main url - then only find =jpg or =gif or =png etc in query string if exists since there might be URLS that have jpg or gif or png characters in other types of query strings that are not images but other types of valid urls that we do not want to capture or disturb for later code in a script.
Made a change in the test string

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpgANYTHINGCANBEHERE https://www.google.com/test.php http://www.google.com/test.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic=gif&stuff=lalala  httpS://www.google.com/whatever.png';
Ran your regex as as follows

Code: Select all

$pattern = '/(http|https)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui';

 preg_match_all($pattern,$string,$images);
   
print_r($images);
Gives bad result

Code: Select all

Array
(
    [0] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https://www.google.com/test.php http://www.google.com/test.JPG
            [2] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
            [3] => http://www.google.com/test.gif
            [4] => http://www.google.com/dodacom?pic=gif
            [5] => httpS://www.google.com/whatever.png
        )

    [1] => Array
        (
            [0] => https
            [1] => https
            [2] => https
            [3] => http
            [4] => http
            [5] => httpS
        )

    [2] => Array
        (
            [0] => ://
            [1] => ://
            [2] => ://
            [3] => ://
            [4] => ://
            [5] => ://
        )

    [3] => Array
        (
            [0] => www.google.com/test
            [1] => www.google.com/test.php http://www.google.com/test
            [2] => pbs.twimg.com/media/EjgPYArWoAAUgPO?forma
            [3] => www.google.com/test
            [4] => www.google.com/dodacom?pic=
            [5] => www.google.com/whatever
        )

    [4] => Array
        (
            [0] => .
            [1] => .
            [2] => 
            [3] => .
            [4] => 
            [5] => .
        )

    [5] => Array
        (
            [0] => jpg
            [1] => JPG
            [2] => t=jpg
            [3] => gif
            [4] => gif
            [5] => png
        )

)
User avatar
axe70
Registered User
Posts: 279
Joined: Sun Nov 17, 2002 10:55 am
Location: Italy
Contact:

Re: PHP Regex quick question

Post by axe70 »

String:

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpgANYTHINGCANBEHERE ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic=gif&stuff=lalala  httpS://www.google.com/whatever.png';

Code: Select all

$s = preg_match_all('/(https?)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui', $string, $matches, PREG_SET_ORDER);
echo'<pre>';
print_r($matches);
return:

Code: Select all

Array
(
    [0] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https
            [2] => ://
            [3] => www.google.com/test
            [4] => .
            [5] => jpg
        )

    [1] => Array
        (
            [0] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
            [1] => https
            [2] => ://
            [3] => pbs.twimg.com/media/EjgPYArWoAAUgPO?forma
            [4] => 
            [5] => t=jpg
        )

    [2] => Array
        (
            [0] => http://www.google.com/test.gif
            [1] => http
            [2] => ://
            [3] => www.google.com/test
            [4] => .
            [5] => gif
        )

    [3] => Array
        (
            [0] => http://www.google.com/dodacom?pic=gif
            [1] => http
            [2] => ://
            [3] => www.google.com/dodacom?pic=
            [4] => 
            [5] => gif
        )

    [4] => Array
        (
            [0] => httpS://www.google.com/whatever.png
            [1] => httpS
            [2] => ://
            [3] => www.google.com/whatever
            [4] => .
            [5] => png
        )

)
All matches are in [i][0], then

Code: Select all

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   $res[] = strtolower(trim($mv[0]));
  }
}
result:

Code: Select all

Array
(
    [0] => https://www.google.com/test.jpg
    [1] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=jpg
    [2] => http://www.google.com/test.gif
    [3] => http://www.google.com/dodacom?pic=gif
    [4] => https://www.google.com/whatever.png
)
do not know at moment if can exist something shorter or more precise to accomplish

[EDITED]
Last edited by axe70 on Tue Oct 06, 2020 4:11 pm, edited 2 times in total.
User avatar
JLA
Registered User
Posts: 589
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS
Contact:

Re: PHP Regex quick question

Post by JLA »

JLA wrote:
Tue Oct 06, 2020 3:45 pm
JLA wrote:
Tue Oct 06, 2020 3:32 pm
axe70 wrote:
Tue Oct 06, 2020 3:10 pm
What stupid:

Code: Select all

$s = preg_match_all('/(http|https)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui', $string, $matches, PREG_SET_ORDER);

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   //$res[] = $mv[0][0] == '<' ? strtolower(substr(trim($mv[0]), 1)) : strtolower(trim($mv[0])); // so this can be omitted 
   $res[] = strtolower(trim($mv[0]));
  }
}

echo'<pre>';
print_r($res);
the regexp could remove also this [^<] since there is no need of it (almost at moment it seem to me)
[EDITED]
In this script **currently** on the test string

Code: Select all

$pattern = '`.*?((http|https)://[\w#$&+,\/:;=?@.-]+.jpg)[^\w#$&+,\/:;=?@.-]*?`i';
   preg_match_all($pattern,$string,$images);
   
print_r($images);
Returns matches in $images[1] of the array but now only for jpg

Code: Select all


Array
(
    [0] => Array
        (
            [0] =>  blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpg
            [1] => ANYTHINGCANBEHERE ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
        )

    [1] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
        )

    [2] => Array
        (
            [0] => https
            [1] => https
        )

)
So possible to improve and maybe better way to do this?

From this string it should get all these urls. We do not want to complicate things. Basically only for all http or https urls and finds the 1st jpg, JPG, gif, GIF, png or PNG and adds that to the array.

This code in more simple way would pull the following from our string. Look for http:// or https:// and go to 1st .jpg or .JPG or .GIF or .gif or .png or .PNG It it finds this then takes that and adds to the array.

We obviously want to change and improve this to give this result in $images

Code: Select all

https://www.google.com/test.jpg
https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
http://www.google.com/test.gif
http://www.google.com/dodacom?pic=gif
httpS://www.google.com/whatever.png
Maybe perhaps if we do not see .jpg, etc in main url - then only find =jpg or =gif or =png etc in query string if exists since there might be URLS that have jpg or gif or png characters in other types of query strings that are not images but other types of valid urls that we do not want to capture or disturb for later code in a script.
Made a change in the test string

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpgANYTHINGCANBEHERE https://www.google.com/test.php http://www.google.com/test.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic=gif&stuff=lalala  httpS://www.google.com/whatever.png';
Ran your regex as as follows

Code: Select all

$pattern = '/(http|https)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui';

 preg_match_all($pattern,$string,$images);
   
print_r($images);
Gives bad result

Code: Select all

Array
(
    [0] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https://www.google.com/test.php http://www.google.com/test.JPG
            [2] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
            [3] => http://www.google.com/test.gif
            [4] => http://www.google.com/dodacom?pic=gif
            [5] => httpS://www.google.com/whatever.png
        )

    [1] => Array
        (
            [0] => https
            [1] => https
            [2] => https
            [3] => http
            [4] => http
            [5] => httpS
        )

    [2] => Array
        (
            [0] => ://
            [1] => ://
            [2] => ://
            [3] => ://
            [4] => ://
            [5] => ://
        )

    [3] => Array
        (
            [0] => www.google.com/test
            [1] => www.google.com/test.php http://www.google.com/test
            [2] => pbs.twimg.com/media/EjgPYArWoAAUgPO?forma
            [3] => www.google.com/test
            [4] => www.google.com/dodacom?pic=
            [5] => www.google.com/whatever
        )

    [4] => Array
        (
            [0] => .
            [1] => .
            [2] => 
            [3] => .
            [4] => 
            [5] => .
        )

    [5] => Array
        (
            [0] => jpg
            [1] => JPG
            [2] => t=jpg
            [3] => gif
            [4] => gif
            [5] => png
        )

)
Ok, tested and seem to find fault in this.

Using this updated test string

Code: Select all

$string = ' blah blah blah ANYTHINGCANBEHEREhttps://www.google.com/test.jpgANYTHINGCANBEHERE https://www.google.com/test.php http://www.google.com/test.JPG ANYTHING<https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg&name=900x900>http://www.google.com/test.gifhttp://www.google.com/dodacom?pic=gif&stuff=lalala  httpS://www.google.com/whatever.png';
Using your code

Code: Select all

$s = preg_match_all('/(https?)(:\/\/)(.[^<]*?)(\.?|\?){1}(jpg|jpeg|gif|png|[a-z]=jpg|jpeg|gif|png){1}/ui', $string, $matches, PREG_SET_ORDER);
echo'<pre>';
print_r($matches);

if(count($matches) > 0){
  foreach($matches as $m => $mv){
   $res[] = strtolower(trim($mv[0]));
  }
}

print_r ($res);
Gives failed results

Code: Select all

<pre>Array
(
    [0] => Array
        (
            [0] => https://www.google.com/test.jpg
            [1] => https
            [2] => ://
            [3] => www.google.com/test
            [4] => .
            [5] => jpg
        )

    [1] => Array
        (
            [0] => https://www.google.com/test.php http://www.google.com/test.JPG
            [1] => https
            [2] => ://
            [3] => www.google.com/test.php http://www.google.com/test
            [4] => .
            [5] => JPG
        )

    [2] => Array
        (
            [0] => https://pbs.twimg.com/media/EjgPYArWoAAUgPO?format=jpg
            [1] => https
            [2] => ://
            [3] => pbs.twimg.com/media/EjgPYArWoAAUgPO?forma
            [4] => 
            [5] => t=jpg
        )

    [3] => Array
        (
            [0] => http://www.google.com/test.gif
            [1] => http
            [2] => ://
            [3] => www.google.com/test
            [4] => .
            [5] => gif
        )

    [4] => Array
        (
            [0] => http://www.google.com/dodacom?pic=gif
            [1] => http
            [2] => ://
            [3] => www.google.com/dodacom?pic=
            [4] => 
            [5] => gif
        )

    [5] => Array
        (
            [0] => httpS://www.google.com/whatever.png
            [1] => httpS
            [2] => ://
            [3] => www.google.com/whatever
            [4] => .
            [5] => png
        )

)
Array
(
    [0] => https://www.google.com/test.jpg
    [1] => https://www.google.com/test.php http://www.google.com/test.jpg
    [2] => https://pbs.twimg.com/media/ejgpyarwoaaugpo?format=jpg
    [3] => http://www.google.com/test.gif
    [4] => http://www.google.com/dodacom?pic=gif
    [5] => https://www.google.com/whatever.png
)
Post Reply

Return to “Extension Writers Discussion”