首页猿问截断包含HTML的文本，忽略标签

截断包含HTML的文本，忽略标签

PHP Html/CSS

白猪掌柜的 2019-10-09 15:51:47

我想截断一些文本（从数据库或文本文件加载），但其中包含HTML，因此包含了标签，并且将返回较少的文本。然后，这可能导致标签未关闭或部分关闭（因此Tidy可能无法正常工作，并且内容仍然较少）。我如何基于文本截断（并且可能在到达表时停止，因为这可能会导致更复杂的问题）。substr("Hello, my name is Sam. I´m a web developer.",0,26)."..."将导致：Hello, my name</st...我想要的是：Hello, my name is Sam. I´m...我怎样才能做到这一点？虽然我的问题是关于如何在PHP中进行操作，但最好知道如何在C＃中进行操作...要么应该可以，因为我认为我可以将方法移植过来（除非它是内置的）方法）。还要注意，我包括了一个HTML实体´-必须将其视为单个字符（而不是本示例中的7个字符）。strip_tags 是一个备用，但我会丢失格式和链接，并且HTML实体仍然会出现问题。

查看完整描述

3 回答

潇湘沐

TA贡献1816条经验获得超6个赞

我已经按照您的建议编写了一个将HTML截断的函数，但是没有打印出来，而是将其保存在字符串变量中。也处理HTML实体。

/**

* function to truncate and then clean up end of the HTML,

* truncates by counting characters outside of HTML tags

* @author alex lockwood, alex dot lockwood at websightdesign

* @param string $str the string to truncate

* @param int $len the number of characters

* @param string $end the end string for truncation

* @return string $truncated_html

* **/

public static function truncateHTML($str, $len, $end = '…'){

//find all tags

$tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities

preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );

//WSDDebug::dump($matches); exit;

$i =0;

//loop through each found tag that is within the $len, add those characters to the len,

//also track open and closed tags

// $matches[$i][0] = the whole tag string --the only applicable field for html enitities

// IF its not matching an &htmlentity; the following apply

// $matches[$i][1] = the start of the tag either '<' or '</'

// $matches[$i][2] = the tag name

// $matches[$i][3] = the end of the tag

//$matces[$i][$j][0] = the string

//$matces[$i][$j][1] = the str offest

while($matches[$i][0][1] < $len && !empty($matches[$i])){

$len = $len + strlen($matches[$i][0][0]);

if(substr($matches[$i][0][0],0,1) == '&' )

$len = $len-1;

//if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting

//ignore empty/singleton tags for tag counting

if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){

//double check

if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')

$openTags[] = $matches[$i][2][0];

elseif(end($openTags) == $matches[$i][2][0]){

array_pop($openTags);

}else{

$warnings[] = "html has some tags mismatched in it: $str";

}

$i++;

}

$closeTags = '';

if (!empty($openTags)){

$openTags = array_reverse($openTags);

foreach ($openTags as $t){

$closeTagString .="</".$t . ">";

}

if(strlen($str)>$len){

// Finds the last space from the string new length

$lastWord = strpos($str, ' ', $len);

if ($lastWord) {

//truncate with new len last word

$str = substr($str, 0, $lastWord);

//finds last character

$last_character = (substr($str, -1, 1));

//add the end text

$truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);

}

//restore any open tags

$truncated_html .= $closeTagString;

}else

$truncated_html = $str;

return $truncated_html;

}

反对回复 2019-10-09

30秒到达战场

TA贡献1828条经验获得超6个赞

100％准确但非常困难的方法：

使用DOM迭代字符
使用DOM方法删除剩余元素
序列化DOM

简单的暴力破解方法：

使用preg_split('/(<tag>)/')PREG_DELIM_CAPTURE将字符串拆分为标签（不是元素）和文本片段。
测量所需的文本长度（它将是拆分后的第二个元素，您可能会html_entity_decode()用来帮助精确测量）
剪切字符串（&[^\s;]+$在末尾修剪以除去可能切碎的实体）
使用HTML Tidy修复它

反对回复 2019-10-09

3 回答
0 关注
763 浏览

关注

截断包含HTML的文本，忽略标签

截断包含HTML的文本，忽略标签

3 回答

相关问题推荐

添加回答

热搜

最近搜索清空

截断包含HTML的文本，忽略标签

截断包含HTML的文本，忽略标签

3 回答

相关问题推荐

添加回答