BBCode-safe truncating of text

Discussion forum for MOD Writers regarding MOD Development.
Locked
fberci
Translator
Posts: 77
Joined: Sat Jan 01, 2005 12:09 pm
Location: Budapest, Hungary
Name: Bertalan Fodor
Contact:

BBCode-safe truncating of text

Post by fberci »

I plan to write an Advanced Syndication MOD and as a part of that I have written the following function which truncates text to a given length, and closes BBCode tags that remained open, so the parsed result will always be XHTML valid. I think others could use this as well and I've already seen other implementations of this functions (that were not so strict), so I share this function.

Code: Select all

/**
* BBCode-safe truncating of text
*
* @param string $text Text containing BBCode tags to be truncated
* @param string $uid BBCode uid
* @param int $max_length Text length limit
* @param string $bitfield BBCode bitfield (optional)
* @param bool $enable_bbcode Whether BBCode is enabled (true by default)
* @return string
*/
function trim_text ($text, $uid, $max_length, $bitfield = '', $enable_bbcode = true)
{
	// If there is any custom BBCode that can have space in its argument, turn this on, 
	// but else I suggest turning this off as it adds one additional (cached) SQL query
	$check_custom_bbcodes = true;

	if ($enable_bbcode && $check_custom_bbcodes)
	{
		global $db;
		static $custom_bbcodes = array();

		// Get all custom bbcodes
		if (empty($custom_bbcodes))
		{
			$sql = 'SELECT bbcode_id, bbcode_tag
				FROM ' . BBCODES_TABLE;
			$result = $db->sql_query($sql, 108000);

			while ($row = $db->sql_fetchrow($result))
			{
				// There can be problems only with tags having an argument
				if (substr($row['bbcode_tag'], -1, 1) == '=')
				{
					$custom_bbcodes[$row['bbcode_id']] = array('[' . $row['bbcode_tag'], ':' . $uid . ']');
				}
			}
			$db->sql_freeresult($result);
		}
	}

	// First truncate the text
	if (utf8_strlen($text) > $max_length)
	{
		$text = utf8_substr($text, 0, $max_length);

		// Do not cut the text in the middle of a word
		$text = substr($text, 0, strrpos ($text, ' '));

		// Append three dots indicating that this is not the real end of the text
		$text .= ' …';
		
		if (!$enable_bbcode)
		{
			return $text;
		}
	}
	else
	{
		return $text;
	}

	// Some tags may contain spaces inside the tags themselves.
	// If there is any tag that had been started but not ended
	// cut the string off before it begins and add three dots
	// to the end of the text again as this has been just cut off too.
	$unsafe_tags = array(
		array('<', '>'),
		array('[quote="', "":$uid]"),
	);

	// If bitfield is given only check for tags that are surely existing in the text
	if (!empty($bitfield))
	{
		// Get all used tags
		$bitfield = new bitfield($bitfield);
		$bbcodes_set = $bitfield->get_all_set();

		// Add custom BBCodes having a parameter and being used
		// to the array of potential tags that can be cut apart.
		foreach ($custom_bbcodes as $bbcode_id => $bbcode_name)
		{
			if (in_array($bbcode_id, $bbcodes_set))
			{
				$unsafe_tags[] = $bbcode_name;
			}
		}
	}
	// Do the check for all possible tags
	else
	{
		$unsafe_tags += $custom_bbcodes;
	}

	foreach($unsafe_tags as $tag)
	{
		if (($start_pos = strrpos($text, $tag[0])) > strrpos($text, $tag[1]))
		{
			$text = substr($text, 0, $start_pos) . ' …';
		}
	}

	// Get all of the BBCodes the text contains.
	// If it does not contain any than just skip this step.
	// Preg expression is borrowed from strip_bbcode()
	if (preg_match_all("#\[(\/?)([a-z0-9\*\+\-]+)(?:=(".*"|[^\]]*))?(?::[a-z])?(?:\:$uid)\]#", $text, $matches, PREG_PATTERN_ORDER) != 0)
	{
		$open_tags = array();

		for ($i = 0, $size = sizeof($matches[0]); $i < $size; ++$i)
		{
			$bbcode_name = &$matches[2][$i];
			$opening = ($matches[1][$i] == '/') ? false : true;

			// If a new BBCode is opened add it to the array of open BBCodes
			if ($opening)
			{
				$open_tags[] = array(
					'name' => $bbcode_name,
					'plus' => ($opening && $bbcode_name == 'list') ? (!empty($matches[3][$i]) ? ':o' : ':u') : '',
				);
			}
			// If a BBCode is closed remove it from the array of open BBCodes.
			// As always only the last opened open tag can be closed
			// we only need to remove the last element of the array.
			else
			{
				array_pop($open_tags);
			}
		}

		// Sort open BBCode tags so the most recently opened will be the first (because it has to be closed first)
		krsort ($open_tags);

		// Close remaining open BBCode tags
		foreach ($open_tags as $tag)
		{
			$text .= '[/' . $tag['name'] . $tag['plus'] . ':' . $uid . ']';	
		}
	}

	return $text;
}
Last edited by fberci on Fri Feb 22, 2008 10:14 pm, edited 1 time in total.
User avatar
EXreaction
Former Team Member
Posts: 5666
Joined: Sun Aug 21, 2005 9:31 pm
Location: Wisconsin, U.S.
Name: Nathan

Re: BBCode-safe truncating of text

Post by EXreaction »

I'll give it a try soon and let you know how it works. :)
User avatar
EXreaction
Former Team Member
Posts: 5666
Joined: Sun Aug 21, 2005 9:31 pm
Location: Wisconsin, U.S.
Name: Nathan

Re: BBCode-safe truncating of text

Post by EXreaction »

Works great, nice work. :)
ToonArmy
Former Team Member
Posts: 4608
Joined: Sat Mar 06, 2004 5:29 pm
Location: Worcestershire, UK
Name: Chris Smith
Contact:

Re: BBCode-safe truncating of text

Post by ToonArmy »

First apologies for the year+ topic necro.

Nathan suggested I used this code, I've noticed a couple of issues.

Code: Select all

$unsafe_tags += $custom_bbcodes; 
You should use array_merge(), it will not merge non-associative arrays as you think.

Secondly, the following code:

Code: Select all

   foreach($unsafe_tags as $tag)
   {
      if (($start_pos = strrpos($text, $tag[0])) > strrpos($text, $tag[1]))
      {
         $text = substr($text, 0, $start_pos) . ' …';
      }
   } 
$unsafe_tags contains both string values and arrays, this code works with the arrays but it compares the positions of the first character of the custom bbcode tag and the second character. Which is not what is desired. Also if you could explain why I'd save me working out. Thanks :)

Here is the modified code I've got based off Nathan's:

Code: Select all

<?php
/**
* BBCode-safe truncating of text
*
* Originally from {@link http://www.phpbb.com/community/viewtopic.php?f=71&t=670335}
* slightly modified to trim at either the first found end line or space by EXreaction.
*
* Modified by Chris Smith to trim to a specified number of paragraphs and/or a maximum
* number of characters, and provide configurable stopping positions. Made some performance
* improvements as well.
*
* @author fberci (http://www.phpbb.com/community/memberlist.php?mode=viewprofile&u=158767)
* @author EXreaction (http://www.phpbb.com/community/memberlist.php?mode=viewprofile&u=202401)
* @author Chris Smith <toonarmy@phpbb.com> (http://www.phpbb.com/community/memberlist.php?mode=viewprofile&u=108642)
* @param string	$text			Text containing BBCode tags to be truncated
* @param string	$uid			BBCode uid
* @param int	$max_length		Text length limit
* @param int	$max_paragraphs	Maximum number of paragraphs permitted
* @param array	$stops			Characters to stop max length search at
* @param string	$replacement	Replacment suffix for the removed text
* @param string	$bitfield		BBCode bitfield (optional)
* @param bool	$enable_bbcode	Whether BBCode is enabled (true by default)
* @return string Resulting trimmed text
*/
function trim_text($text, $uid, $max_length, $max_paragraphs = 0, $stops = array(' ', "\n"), $replacement = '...', $bitfield = '', $enable_bbcode = true)
{
	if ($enable_bbcode)
	{
		static $custom_bbcodes = array();

		// Get all custom bbcodes
		if (empty($custom_bbcodes))
		{
			global $db;

			$sql = 'SELECT bbcode_id, bbcode_tag
				FROM ' . BBCODES_TABLE;
			$result = $db->sql_query($sql, 3600);

			while ($row = $db->sql_fetchrow($result))
			{
				// There can be problems only with tags having an argument
				if (substr($row['bbcode_tag'], -1, 1) == '=')
				{
					$custom_bbcodes[$row['bbcode_id']] = array('[' . $row['bbcode_tag'], ':' . $uid . ']');
				}
			}
			$db->sql_freeresult($result);
		}
	}

	$trimmed = false;

	// Paragraph trimming
	if ($max_paragraphs && $max_paragraphs < preg_match_all('#\n\s*\n#m', $text, $matches))
	{
		$find = $matches[0][$max_paragraphs - 1];
		// Grab all the matches preceeding the paragraph to trim at, finds
		// those that match the trim marker, sum them to skip over them.
		$skip = sizeof(array_intersect(array_slice($matches[0], 0, $max_paragraphs - 1), array($find)));
		$pos = 0;

		do
		{
			$pos = utf8_strpos($text, $find, $pos + 1);
			$skip--;
		}
		while ($skip >= 0);

		$text = utf8_substr($text, 0, $pos);

		$trimmed = true;
	}

	// First truncate the text
	if ($max_length && utf8_strlen($text) > $max_length)
	{
		$pos = 0;
		$length = 0;

		if (!is_array($stops[0]))
		{
			$stops = array($stops);
		}

		foreach ($stops as $stop_group)
		{
			if (!is_array($stop_group))
			{
				continue;
			}

			foreach ($stop_group as $k => $v)
			{
				$find = (is_string($v)) ? $v : $k;
				$include = is_bool($v) && $v;

				if (($_pos = utf8_strpos(utf8_substr($text, $max_length), $find)) !== false)
				{
					if ($_pos < $pos || !$pos)
					{
						// This is a better find, it cuts the text shorter
						$pos = $_pos;
						$length = $include ? utf8_strlen($find) : 0;
					}
				}
			}

			if ($pos)
			{
				// Include the length of the search string if requested
				$max_length += $pos + $length;
				break;
			}
		}

		// Trim off spaces, this will miss UTF8 spacers :(
		$text = rtrim(utf8_substr($text, 0, $max_length));

		$trimmed = true;
	}

	// No BBCode or no trimming return
	if (!$enable_bbcode || !$trimmed)
	{
		return $text . ($trimmed ? $replacement : '');
	}

	// Some tags may contain spaces inside the tags themselves.
	// If there is any tag that had been started but not ended
	// cut the string off before it begins and add three dots
	// to the end of the text again as this has been just cut off too.
	$unsafe_tags = array(
		array('<', '>'),
		array('[quote="', "":$uid]"),
	);

	// If bitfield is given only check for tags that are surely existing in the text
	if (!empty($bitfield))
	{
		// Get all used tags
		$bitfield = new bitfield($bitfield);

		// isset() provides better performance
		$bbcodes_set = array_flip($bitfield->get_all_set());

		// Add custom BBCodes having a parameter and being used
		// to the array of potential tags that can be cut apart.
		foreach ($custom_bbcodes as $bbcode_id => $bbcode_name)
		{
			if (isset($bbcodes_set[$bbcode_id]))
			{
				$unsafe_tags[] = $bbcode_name;
			}
		}
	}
	// Do the check for all possible tags
	else
	{
		$unsafe_tags = array_merge($unsafe_tags, $custom_bbcodes);
	}

	// @todo Fix this block
	foreach ($unsafe_tags as $tag)
	{
		if (($start_pos = strrpos($text, $tag[0])) > strrpos($text, $tag[1]))
		{
			$text = substr($text, 0, $start_pos) . ' ...';
		}
	}

	// Get all of the BBCodes the text contains.
	// If it does not contain any than just skip this step.
	// Preg expression is borrowed from strip_bbcode()
	if (preg_match_all("#\[(\/?)([a-z0-9_\*\+\-]+)(?:=(".*"|[^\]]*))?(?::[a-z])?(?:\:$uid)\]#", $text, $matches, PREG_PATTERN_ORDER) != 0)
	{
		$open_tags = array();

		for ($i = 0, $size = sizeof($matches[0]); $i < $size; ++$i)
		{
			$bbcode_name = &$matches[2][$i];
			$opening = ($matches[1][$i] == '/') ? false : true;

			// If a new BBCode is opened add it to the array of open BBCodes
			if ($opening)
			{
				$open_tags[] = array(
					'name' => $bbcode_name,
					'plus' => ($opening && $bbcode_name == 'list' && !empty($matches[3][$i])) ? ':o' : '',
				);
			}
			// If a BBCode is closed remove it from the array of open BBCodes.
			// As always only the last opened open tag can be closed
			// we only need to remove the last element of the array.
			else
			{
				array_pop($open_tags);
			}
		}

		// Sort open BBCode tags so the most recently opened will be the first (because it has to be closed first)
		krsort ($open_tags);

		// Close remaining open BBCode tags
		foreach ($open_tags as $tag)
		{
			$text .= '[/' . $tag['name'] . $tag['plus'] . ':' . $uid . ']';
		}
	}

	// Append the replacement
	return $text . $replacement;
}
Edits: updated the code
Chris SmithGitHub
fberci
Translator
Posts: 77
Joined: Sat Jan 01, 2005 12:09 pm
Location: Budapest, Hungary
Name: Bertalan Fodor
Contact:

Re: BBCode-safe truncating of text

Post by fberci »

Sorry for the late reply, I have meant to reply to this for a long time, just haven't gotten to it.

It took me some time to figure out/remember, but it's turned out that I made a very bad decision when naming the $bbcode_name variable, as it is an array (which is populated at the beginning of the script) so there is no problem with the foreach.

However, this version of the function doesn't handle well custom BBCodes having a parameter containing spaces (in some cases it can cut the text before it should), but I have already figured out a solution for fixing this.
ToonArmy
Former Team Member
Posts: 4608
Joined: Sat Mar 06, 2004 5:29 pm
Location: Worcestershire, UK
Name: Chris Smith
Contact:

Re: BBCode-safe truncating of text

Post by ToonArmy »

And how did you fix it?
Chris SmithGitHub
fberci
Translator
Posts: 77
Joined: Sat Jan 01, 2005 12:09 pm
Location: Budapest, Hungary
Name: Bertalan Fodor
Contact:

Re: BBCode-safe truncating of text

Post by fberci »

Here is the updated code:

Code: Select all

<?php
/**
* BBCode-safe truncating of text
*
* Originally from {@link http://www.phpbb.com/community/viewtopic.php?f=71&t=670335}
* slightly modified to trim at either the first found end line or space by EXreaction.
*
* Modified by Chris Smith to trim to a specified number of paragraphs and/or a maximum
* number of characters, and provide configurable stopping positions. Made some performance
* improvements as well.
*
* Just like phpBB3 this function doesn't support embedding BBCodes in BBCode parameters
* either except for [quote].
*
* @author fberci (http://www.phpbb.com/community/memberlist.php?mode=viewprofile&u=158767)
* @author EXreaction (http://www.phpbb.com/community/memberlist.php?mode=viewprofile&u=202401)
* @author Chris Smith <toonarmy@phpbb.com> (http://www.phpbb.com/community/memberlist.php?mode=viewprofile&u=108642)
* @param string	$text			Text containing BBCode tags to be truncated
* @param string	$uid			BBCode uid
* @param int	$max_length		Text length limit
* @param int	$max_paragraphs	Maximum number of paragraphs permitted
* @param array	$stops			Characters to stop max length search at
* @param string	$replacement	Replacment suffix for the removed text
* @param string	$bitfield		BBCode bitfield (optional)
* @param bool	$enable_bbcode	Whether BBCode is enabled (true by default)
* @return string Resulting trimmed text
*/
function trim_text($text, $uid, $max_length, $max_paragraphs = 0, $stops = array(' ', "\n"), $replacement = '…', $bitfield = '', $enable_bbcode = true)
{
	$orig_text = $text;
	
	if ($enable_bbcode)
	{
		static $custom_bbcodes = array();

		// Get all custom bbcodes
		if (empty($custom_bbcodes))
		{
			global $db;

			$sql = 'SELECT bbcode_id, bbcode_tag, second_pass_match
				FROM ' . BBCODES_TABLE;
			$result = $db->sql_query($sql, 3600);

			while ($row = $db->sql_fetchrow($result))
			{
				// There can be problems only with tags having an argument
				if (substr($row['bbcode_tag'], -1, 1) == '=')
				{
					$custom_bbcodes[$row['bbcode_id']] = array('[' . $row['bbcode_tag'], ':' . $uid . ']', str_replace('$uid', $uid, $row['second_pass_match']));
				}
			}
			$db->sql_freeresult($result);
		}
	}

	$trimmed = false;

	// Paragraph trimming
	if ($max_paragraphs && $max_paragraphs < preg_match_all('#\n\s*\n#m', $text, $matches))
	{
		$find = $matches[0][$max_paragraphs - 1];
		// Grab all the matches preceeding the paragraph to trim at, finds
		// those that match the trim marker, sum them to skip over them.
		$skip = sizeof(array_intersect(array_slice($matches[0], 0, $max_paragraphs - 1), array($find)));
		$pos = 0;

		do
		{
			$pos = utf8_strpos($text, $find, $pos + 1);
			$skip--;
		}
		while ($skip >= 0);

		$text = utf8_substr($text, 0, $pos);

		$trimmed = true;
	}

	// First truncate the text
	if ($max_length && utf8_strlen($text) > $max_length)
	{
		$pos = 0;
		$length = 0;

		if (!is_array($stops[0]))
		{
			$stops = array($stops);
		}

		foreach ($stops as $stop_group)
		{
			if (!is_array($stop_group))
			{
				continue;
			}

			foreach ($stop_group as $k => $v)
			{
				$find = (is_string($v)) ? $v : $k;
				$include = is_bool($v) && $v;

				if (($_pos = utf8_strpos(utf8_substr($text, $max_length), $find)) !== false)
				{
					if ($_pos < $pos || !$pos)
					{
						// This is a better find, it cuts the text shorter
						$pos = $_pos;
						$length = $include ? utf8_strlen($find) : 0;
					}
				}
			}

			if ($pos)
			{
				// Include the length of the search string if requested
				$max_length += $pos + $length;
				break;
			}
		}

		// Trim off spaces, this will miss UTF8 spacers :(
		$text = rtrim(utf8_substr($text, 0, $max_length));

		$trimmed = true;
	}

	// No BBCode or no trimming return
	if (!$enable_bbcode || !$trimmed)
	{
		return $text . ($trimmed ? $replacement : '');
	}

	// Some tags may contain spaces inside the tags themselves.
	// If there is any tag that had been started but not ended
	// cut the string off before it begins.
	$unsafe_tags = array(
		array('<', '>'),
		array('[quote="', "":$uid]"), // 3rd parameter true here too for now
	);

	// If bitfield is given only check for those tags that are surely existing in the text
	if (!empty($bitfield))
	{
		// Get all used tags
		$bitfield = new bitfield($bitfield);

		// isset() provides better performance
		$bbcodes_set = array_flip($bitfield->get_all_set());

		// Add custom BBCodes having a parameter and being used
		// to the array of potential tags that can be cut apart.
		foreach ($custom_bbcodes as $bbcode_id => $bbcode_tag)
		{
			if (isset($bbcodes_set[$bbcode_id]))
			{
				$unsafe_tags[] = $bbcode_tag;
			}
		}
	}
	// Else do the check for all possible tags
	else
	{
		$unsafe_tags = array_merge($unsafe_tags, $custom_bbcodes);
	}

	foreach ($unsafe_tags as $tag)
	{
		// Ooops, we are in the middle of an opening BBCode or HTML tag,
		// truncate the string before the opening tag
		if (($start_pos = strrpos($text, $tag[0])) > strrpos($text, $tag[1]))
		{
			// Wait, is this really an opening tag or does it just look like one?
			$match = array();
			if (isset($tag[2]) && preg_match($tag[2], substr($orig_text, $start_pos), $match, PREG_OFFSET_CAPTURE) != 0 && $match[0][1] === 0)
			{
				$text = rtrim(substr($text, 0, $start_pos));
			}
		}
	}

	$text = $text . $replacement;

	// Get all of the BBCodes the text contains.
	// If it does not contain any than just skip this step.
	// Preg expression is borrowed from strip_bbcode()
	if (preg_match_all("#\[(\/?)([a-z0-9_\*\+\-]+)(?:=(".*"|[^\]]*))?(?::[a-z])?(?:\:$uid)\]#", $text, $matches, PREG_PATTERN_ORDER) != 0)
	{
		$open_tags = array();

		for ($i = 0, $size = sizeof($matches[0]); $i < $size; ++$i)
		{
			$bbcode_name = &$matches[2][$i];
			$opening = ($matches[1][$i] == '/') ? false : true;

			// If a new BBCode is opened add it to the array of open BBCodes
			if ($opening)
			{
				$open_tags[] = array(
					'name' => $bbcode_name,
					'plus' => ($opening && $bbcode_name == 'list' && !empty($matches[3][$i])) ? ':o' : '',
				);
			}
			// If a BBCode is closed remove it from the array of open BBCodes.
			// As always only the last opened open tag can be closed,
			// so we only need to remove the last element of the array.
			else
			{
				array_pop($open_tags);
			}
		}

		// Sort open BBCode tags so the most recently opened will be the first (because it has to be closed first)
		krsort ($open_tags);

		// Close remaining open BBCode tags
		foreach ($open_tags as $tag)
		{
			$text .= '[/' . $tag['name'] . $tag['plus'] . ':' . $uid . ']';
		}
	}

	return $text;
}
PS: One of you modified the function so that it cuts the text after the limit, I assume it was intentional.
User avatar
EXreaction
Former Team Member
Posts: 5666
Joined: Sun Aug 21, 2005 9:31 pm
Location: Wisconsin, U.S.
Name: Nathan

Re: BBCode-safe truncating of text

Post by EXreaction »

That code has some issues yet...

Code: Select all

test<!--…
Locked

Return to “[3.0.x] MOD Writers Discussion”