首页猿问 Unmarshal HTML...

Unmarshal HTML 嵌套在 XML 中

长风秋雁 2022-08-01 16:44:57

我从第三方收到一个xml文件，该文件在其中一个XML标记中具有HTML元素。我无法弄清楚如何解构它以获取href URL。XML 示例：<SOME_HTML> <a href="http://www.google.com" target="_blank"> google</a></SOME_HTML>到目前为止，这是我所达到的，但没有向结构中添加任何内容：type Href struct { Link string `xml:"href"`}type Link struct { URL []Href `xml:"a"`}type XmlFile struct { HTMLTag []Link `xml:"SOME_HTML"`}myFile := []byte(`<?xml version="1.0" encoding="utf-8"?><SOME_HTML> <a href="http://www.google.com" target="_blank"> google</a></SOME_HTML>`)var output XmlFileerr := xml.Unmarshal(myFile, &output)fmt.Println(output) // {[]}

查看完整描述

3 回答

青春有我

TA贡献1784条经验获得超8个赞

你可以这样做（https://play.golang.org/p/MJzAVLBFfms）：

type aElement struct {

Href string `xml:"href,attr"`

}

type content struct {

A aElement `xml:"a"`

}

func main() {

test := `<SOME_HTML><a href="http://www.google.com" target="_blank">google</a></SOME_HTML>`

var result content

if err := xml.Unmarshal([]byte(test), &result); err != nil {

log.Fatal(err)

}

fmt.Println(result)

}

反对回复 2022-08-01

潇湘沐

TA贡献1816条经验获得超6个赞

解析 xml 中的所有内容，假设 html 或其他标记（如）中也可能有多个标记。adiv

如果不需要这样做，只需替换为类型（不是XmlFile.LinksXmlFile.LinkLink[]Link)

func main() {

type Link struct {

XMLName xml.Name `xml:"a"`

URL string `xml:"href,attr"`

Target string `xml:"target,attr"`

Content string `xml:",chardata"`

}

type Div struct {

XMLName xml.Name `xml:"div"`

Classes string `xml:"class,attr"`

Content string `xml:",chardata"`

}

type XmlFile struct {

XMLName xml.Name `xml:"SOME_HTML"`

Links []Link `xml:"a"`

Divs []Div `xml:"div"`

}

myFile := []byte(`<?xml version="1.0" encoding="utf-8"?>

<SOME_HTML>

<a href="http://www.google.com" target="_blank">google</a>

<a href="http://www.facebook.com" target="_blank">facebook</a>

</SOME_HTML>`)

var output XmlFile

err := xml.Unmarshal(myFile, &output)

if err != nil {

log.Fatal(err)

}

fmt.Println(output)

}

操场

编辑：在 xml 中添加了更多标签，以显示如何解析不同的标签类型。

反对回复 2022-08-01

萧十郎

TA贡献1815条经验获得超13个赞

您可以使用常规XML解析器解析您发布的示例，但是XML语法存在许多例外，这些异常通常被接受为有效的HTML。

我能想到的最简单的例子是：我所知道的所有html解释器都明白（未关闭的标签）与自关闭标签相同。<br><br><br />

如果您不知道服务另一端的HTML是如何生成的，则最好使用HTML解析器。

例如，有golang.go/x/net/html包，它提供了几个函数来解析HTML：

https://play.golang.org/p/3hUogiwdRPO

func findFirstHref(n *html.Node, indent string) string {

if n.Type == html.ElementNode {

fmt.Println(" * scanning:" + indent + n.Data)

}

if n.Type == html.ElementNode && n.Data == "a" {

for _, a := range n.Attr {

if a.Key == "href" {

return a.Val

}

for c := n.FirstChild; c != nil; c = c.NextSibling {

href := findFirstHref(c, indent+" ")

if href != "" {

return href

}

return ""

}

func main() {

doc1, err := html.Parse(strings.NewReader(sample1))

if err != nil {

fmt.Println(err)

} else {

fmt.Println("href in sample1:", findFirstHref(doc1, ""))

}

doc2, err := html.Parse(strings.NewReader(sample2))

if err != nil {

fmt.Println(err)

} else {

fmt.Println("href in sample2:", findFirstHref(doc2, ""))

}

const (

sample1 = `<?xml version="1.0" encoding="utf-8"?>

<SOME_HTML>

google</a>

</SOME_HTML>`

// sample2 is an invalid XML document (it has unclosed "<br>" tags):

sample2 = `

<p> line1 <br> line2

Some <br> text

</a>

</p>

)

反对回复 2022-08-01

3 回答
0 关注
104 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Unmarshal HTML 嵌套在 XML 中

Unmarshal HTML 嵌套在 XML 中

3 回答

添加回答