首页猿问如何在golang中用空字符串替换...

如何在golang中用空字符串替换所有html标签

慕尼黑的夜晚无繁华 2023-06-12 16:47:56

<div> </div>我正在尝试用正则表达式模式替换 golang 中空字符串 (" ") 上的所有 html 标记，例如... ^[^.\/]*$/g，以匹配所有关闭标记。前任：</div>我的解决方案：package mainimport ( "fmt" "regexp")const Template = `^[^.\/]*$/g`func main() { r := regexp.MustCompile(Template) s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>" res := r.ReplaceAllString(s, "") fmt.Println(res)}但输出相同的源字符串。怎么了？请帮忙。感谢期望结果应该："afsdf4534534!@@!!#345345afsdf4534534!@@!!#"

查看完整描述

4 回答

波斯汪

TA贡献1811条经验获得超4个赞

对于那些来这里寻找快速解决方案的人来说，有一个库可以做到这一点：bluemonday。

包bluemonday提供了一种将 HTML 元素和属性的白名单描述为策略的方法，并将该策略应用于来自可能包含标记的用户的不受信任的字符串。所有不在白名单上的元素和属性都将被删除。

package main

import (

"fmt"

"github.com/microcosm-cc/bluemonday"

)

func main() {

// Do this once for each unique policy, and use the policy for the life of the program

// Policy creation/editing is not safe to use in multiple goroutines

p := bluemonday.StripTagsPolicy()

// The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines

html := p.Sanitize(

`<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,

)

// Output:

// Google

fmt.Println(html)

}

https://play.golang.org/p/jYARzNwPToZ

反对回复 2023-06-12

蝴蝶刀刀

TA贡献1801条经验获得超8个赞

正则表达式的问题

这是一个非常简单的 RegEx 替换方法，它从字符串中格式良好的HTML中删除 HTML 标记。

strip_html_regex.go

package main

import "regexp"

const regex = `<.*?>`

// This method uses a regular expresion to remove HTML tags.

func stripHtmlRegex(s string) string {

r := regexp.MustCompile(regex)

return r.ReplaceAllString(s, "")

}

注意：这不适用于格式错误的HTML。不要用这个。

更好的方法

由于 Go 中的字符串可以被视为字节的一部分，因此可以轻松遍历字符串并查找不在 HTML 标记中的部分。当我们识别字符串的有效部分时，我们可以简单地截取该部分的一部分并使用strings.Builder.

strip_html.go

package main

import (

"strings"

"unicode/utf8"

)

const (

htmlTagStart = 60 // Unicode `<`

htmlTagEnd = 62 // Unicode `>`

)

// Aggressively strips HTML tags from a string.

// It will only keep anything between `>` and `<`.

func stripHtmlTags(s string) string {

// Setup a string builder and allocate enough memory for the new string.

var builder strings.Builder

builder.Grow(len(s) + utf8.UTFMax)

in := false // True if we are inside an HTML tag.

start := 0 // The index of the previous start tag character `<`

end := 0 // The index of the previous end tag character `>`

for i, c := range s {

// If this is the last character and we are not in an HTML tag, save it.

if (i+1) == len(s) && end >= start {

builder.WriteString(s[end:])

}

// Keep going if the character is not `<` or `>`

if c != htmlTagStart && c != htmlTagEnd {

continue

}

if c == htmlTagStart {

// Only update the start if we are not in a tag.

// This make sure we strip out `<<br>` not just `<br>`

if !in {

start = i

}

in = true

// Write the valid string between the close and start of the two tags.

builder.WriteString(s[end:start])

continue

}

// else c == htmlTagEnd

in = false

end = i + 1

}

s = builder.String()

return s

}

如果我们使用 OP 的文本和一些格式错误的 HTML 运行这两个函数，您会发现结果不一致。

main.go

package main

import "fmt"

func main() {

s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

res := stripHtmlTags(s)

fmt.Println(res)

// Malformed HTML examples

fmt.Println("\n:: stripHTMLTags ::\n")

fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))

fmt.Println(stripHtmlTags("h1>I broke this</h1>"))

fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))

fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))

// Regex Malformed HTML examples

fmt.Println(":: stripHtmlRegex ::\n")

fmt.Println(stripHtmlRegex("Do something <strong>bold</strong>."))

fmt.Println(stripHtmlRegex("h1>I broke this</h1>"))

fmt.Println(stripHtmlRegex("This is <a href='#'>>broken link</a>."))

fmt.Println(stripHtmlRegex("I don't know ><where to <<em>start</em> this tag<."))

}

输出：

afsdf4534534!@@!!#345345afsdf4534534!@@!!#

:: stripHTMLTags ::

Do something bold.

I broke this

This is broken link.

start this tag

:: stripHtmlRegex ::

Do something bold.

h1>I broke this

This is >broken link.

I don't know >start this tag<.

注意：RegEx 方法不会始终如一地删除所有 HTML 标记。老实说，我不太擅长 RegEx，无法编写 RegEx 匹配字符串来正确处理剥离 HTML。

基准

除了在剥离格式错误的 HTML 标签方面更安全和更积极的优势之外，stripHtmlTags它比 . 快 4 倍左右stripHtmlRegex。

> go test -run=Calculate -bench=.

goos: windows

goarch: amd64

BenchmarkStripHtmlRegex-8 51516 22726 ns/op

BenchmarkStripHtmlTags-8 230678 5135 ns/op

反对回复 2023-06-12

萧十郎

TA贡献1815条经验获得超13个赞

如果你想替换所有的 HTML 标签，使用 strip of html 标签。

匹配 HTML 标签的正则表达式不是一个好主意。

package main

import (

"fmt"

"github.com/grokify/html-strip-tags-go"

)

func main() {

text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

stripped := strip.StripTags(text)

fmt.Println(text)

fmt.Println(stripped)

}

反对回复 2023-06-12

至尊宝的传说

TA贡献1789条经验获得超10个赞

我们已经在生产中尝试过这个，但在某些极端情况下，所提出的解决方案都没有真正起作用。如果你需要一些强大的东西，请检查 Go 内部库的未导出方法（html-strip-tags-go pkg 基本上是使用 BSD-3 许可证导出的）。或者https://github.com/microcosm-cc/bluemonday是我们最终使用的非常流行的库（也包括 BSD-3）。

=================================================

这里唯一的区别是由于len对所有 utf-8 字符的字符串评估。对于使用的每个字符，它将返回 1-4 之间。所以len(è)实际上会评估为2. 为了解决这个问题，我们将把字符串转换为rune.

https://go.dev/play/p/xo7Mrx5qw-_J

// Aggressively strips HTML tags from a string.

// It will only keep anything between `>` and `<`.

func stripHTMLTags(s string) string {

// Supports utf-8, since some char could take more than 1 byte. ie: len("è") -> 2

d := []rune(s)

// Setup a string builder and allocate enough memory for the new string.

var builder strings.Builder

builder.Grow(len(d) + utf8.UTFMax)

in := false // True if we are inside an HTML tag.

start := 0 // The index of the previous start tag character `<`

end := 0 // The index of the previous end tag character `>`

for i, c := range d {

// If this is the last character and we are not in an HTML tag, save it.

if (i+1) == len(d) && end >= start {

builder.WriteString(s[end:])

}

// Keep going if the character is not `<` or `>`

if c != htmlTagStart && c != htmlTagEnd {

continue

}

if c == htmlTagStart {

// Only update the start if we are not in a tag.

// This make sure we strip out `<<br>` not just `<br>`

if !in {

start = i

}

in = true

// Write the valid string between the close and start of the two tags.

builder.WriteString(s[end:start])

continue

}

// else c == htmlTagEnd

in = false

end = i + 1

}

s = builder.String()

return s

}

反对回复 2023-06-12

4 回答
0 关注
384 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何在golang中用空字符串替换所有html标签

如何在golang中用空字符串替换所有html标签

4 回答

添加回答