Hi everyone - lots of activity on this I see. Sorry for the delay in reply. I am currently testing all the proposed. 1st, a little more info to help clarify the situation. The project I'm working on is for a string that deals with some user input.
1. This input can include at time some https (also I imagine could be http and ftp too) urls (not href URLS) that are enclosed with < and > symbols. URLs can sometimes include query strings so we must think about that. Also long as the URL has valid characters, then it should be fine provided it has a < to start and > at the end and meets other conditions. When we remove a < or > - it should be replaced with a white space to preserve it's position and the position of all the other data in the string.
2. This user input can also include < and > symbols for any reason in the string that are not related to these enclosed URLs. We have to account for this and not touch these symbols and their positions in the string
3. We also have to account for user error and possible malicious/disruptive user input. For example a user would by chance could do the following (and desired result):
Code: Select all
<https://blahblahblah> - this is a match, we want to remove the < and > from the URL
http://blahblahblah> this is not a match because the URL does not start with a < so we do not touch this
<<http://blahblahblah>> - this is a match but we only remove the 1st < and > and leave the other which results in < http://blahblahblah >
<http://blahblah<>blahblah> - this is not a match because the URL contains invalid character before the 1st > after the 1st < - we do not touch this
4. Another thing to consider: Since we are speaking of valid URLs, at this point it complicates things to much to check if the actual URL is valid (having something like a .com or .net or .eu, etc). Our test string below really doesn't cover this and thinking it is complicates things too much. If a user screws up in that way then that is their problem. We only want to make sure that what we look for starts with <https:// and between the <https:// and the 1st > there are only characters that would normally be valid in a URL or non malicious query string). Does the non malicious query string again complicate too much? If so then perhaps then between <https:// and the 1st > we just want to make sure that there is not another < If so it would not be a match. Open to opinions on this.
These are some examples and hopefully the intent is clear.
So now on to the test string of
Code: Select all
$string = 'Today at <-thiscanbeanyassortmentofanyrandomchars-><http://thiscanbeanyassortmentofanyrandomchars>it\'s raining but it is ok <b style="color:red">for me because</b> i\'m a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blah<https://blahblahblah>blahblahblah>
Blah<https://blahblah> 5555 <https://blahblah<>blahblah>blahblahblah <https://blahblah>blublublu> <https://blah> <blah>
blah blah blah<blah><https://blahblahblah><blahblahblah <https://blahblahmissing <https://anotherblahblah>';
We will go from most recent solution to earliest
October 5, 2020 at 9:03am axe70 suggested
Code: Select all
$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
which results to appear everything seems good? We will call this
**SUCCESS #1**
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah
In another solution from same post
Code: Select all
$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', '\1\2\3\4\5', $string);
Not a good result because looks like positions in string changed because matching < or > did were not replaced with whitespace. Does anyone see anything else??
**Failure #1**
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah
Another solution be axe70 on October 5, 2020
Code: Select all
$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
foreach($matches as $m => $mv){
// $mv could be used to manipulate each match as more like, doing magic things here
$s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
// $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
//print_r($s);echo'<br />';
}
}
echo $s;
This also seems good We will call this
**SUCCESS #2** Missing anything??
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah
Another solution proposed by AbaddonOrmuz on October 5, 2020
Code: Select all
$replaced = preg_replace('#<(https?://[^<>]+)>#', '\1', $text);
Not a good result because looks like positions in string changed because matching < or > did were not replaced with whitespace. Does anyone see anything else??
**Failure #2**
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars->http://thiscanbeanyassortmentofanyrandomcharsit's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blahhttps://blahblahblahblahblahblah>
Blahhttps://blahblah 5555 <https://blahblah<>blahblah>blahblahblah https://blahblahblublublu> https://blah <blah>
blah blah blah<blah>https://blahblahblah<blahblahblah <https://blahblahmissing https://anotherblahblah
Another solution proposed by axe70 on October 5, 2020 at 3:58am
Code: Select all
$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
foreach($matches as $m => $mv){
// $mv could be used to manipulate each match as more like, doing magic things here
$s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
// $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
//print_r($s);echo'<br />';
}
}
echo $s;
Seems to be another success. Anything missing??
**SUCCESS #3**
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah
In another solution by axe70 on October 5 at 7:21
Code: Select all
$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $string, -1 ,$cr);
if($cr > 0){
foreach($matches as $m => $mv){
$s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
// $s = preg_replace('/\#\#my007placeolder\#\#/', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // respect what found on string, re-adding spaces or not based on if there are or not
//print_r($s);echo'<br />'; // demo of the "why"
}
}
echo $s;
Seems to be another success. Anything missing??
**SUCCESS #4**
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 <https://blahblah<>blahblah>blahblahblah https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah <https://blahblahmissing https://anotherblahblah
There was another couple solutions by axe70 but we will skip those and move to the original solution proposed by him that we found to be working but applied with our test string here
Code: Select all
$s = preg_match_all('/ ?<(ftp|http|https?):\/\/(.*?)\> ?/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/ ?<(ftp|http|https?):\/\/(.*?)\> ?/', '##my007placeolder##', $string, -1 ,$cr);
$pn = 0;
if($cr > 0){
foreach($matches as $m){
$s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$pn][1] . '://' . $matches[$pn][2] . ' ', $s, 1); // one x time, in order
//print_r($s);echo'<br />';
$pn++;
}
}
echo $s;
**Failure #3**
Code: Select all
Today at <-thiscanbeanyassortmentofanyrandomchars-> http://thiscanbeanyassortmentofanyrandomchars it's raining but it is ok <b style="color:red">for me because</b> i'm a <a href="https://www.google.com/search?q=fish&oq=snail">snail</a>
blah blah. Blahblah<blahblah><blahblah>. blah blah
blah blah blah https://blahblahblah blahblahblah>
Blah https://blahblah 5555 https://blahblah< blahblah>blahblahblah https://blahblah blublublu> https://blah <blah>
blah blah blah<blah> https://blahblahblah <blahblahblah https://blahblahmissing <https://anotherblahblah
This one results in a failure because
Code: Select all
This line in our string starts off
Blah<https://blahblah> 5555 <https://blahblah<>blahblah>blahblahblah <https://blahblah>blublublu> <https://blah> <blah>
and was changed to
Blah https://blahblah 5555 https://blahblah< blahblah>blahblahblah https://blahblah blublublu> https://blah <blah>
and this line
blah blah blah<blah><https://blahblahblah><blahblahblah <https://blahblahmissing <https://anotherblahblah>
was incorrectly changed to
blah blah blah<blah> https://blahblahblah <blahblahblah https://blahblahmissing <https://anotherblahblah
*
*
*
*
*
So, it appears that we have four successful options. The question shall be, is anything missing in any of these four and what is the consensus on the best one to use considering the requirements and scenrios I've tried to explain?
OPTION 1
Code: Select all
$replaced = preg_replace('#( ?)\<{1}(ftp|https?)(:\/\/)(.[^<]*?)\>{1}( ?)#ui', ' \2\3\4 ', $string);
OPTION 2
Code: Select all
$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
foreach($matches as $m => $mv){
// $mv could be used to manipulate each match as more like, doing magic things here
$s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
// $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
//print_r($s);echo'<br />';
}
}
echo $s;
OPTION 3
Code: Select all
$s = preg_match_all('/( ?)<{1}(ftp|https?):\/\/(.[^<]*?)>{1}( ?)/ui', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|https?):\/\/(.[^<]*?)>{1}/ui', '#W3JB007PH#', $string, -1 ,$cr);
if($cr > 0){
foreach($matches as $m => $mv){
// $mv could be used to manipulate each match as more like, doing magic things here
$s = preg_replace('/\#W3JB007PH\#/u', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, add spaces before and after placeholder
// $s = preg_replace('/\#W3JB007PH\#/u', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // one x time, re-add spaces only if found on the string
//print_r($s);echo'<br />';
}
}
echo $s;
OPTION 4
Code: Select all
$s = preg_match_all('/( ?)<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}( ?)/', $string, $matches, PREG_SET_ORDER);
$s = preg_replace('/<{1}(ftp|http|https?):\/\/(.[^<]*?)>{1}/', '##my007placeolder##', $string, -1 ,$cr);
if($cr > 0){
foreach($matches as $m => $mv){
$s = preg_replace('/\#\#my007placeolder\#\#/', ' ' . $matches[$m][2] . '://' . $matches[$m][3] . ' ', $s, 1); // one x time, in order, adding a space in this case, at right/left
// $s = preg_replace('/\#\#my007placeolder\#\#/', $matches[$m][1] . $matches[$m][2] . '://' . $matches[$m][3] . $matches[$m][4], $s, 1); // respect what found on string, re-adding spaces or not based on if there are or not
//print_r($s);echo'<br />'; // demo of the "why"
}
}
echo $s;
Thank everyone so much for their time and involvement on this. Hopefully these efforts can be of use to others too in the future.