golang HTML 字符集解码

我正在尝试解码非utf-8 编码的HTML 页面。<meta http-equiv="Content-Type" content="text/html; charset=gb2312">有没有可以做到这一点的图书馆？我在网上找不到一个。PS 当然，我可以使用 goquery 和 iconv-go 提取字符集并解码 HTML 页面，但我不想重新发明轮子。

查看完整描述

2 回答

扬帆大鱼

TA贡献1799条经验获得超9个赞

Golang 官方提供了扩展包：charset和encoding。

下面的代码确保 HTML 包可以正确解析文档：

func detectContentCharset(body io.Reader) string {

r := bufio.NewReader(body)

if data, err := r.Peek(1024); err == nil {

if _, name, ok := charset.DetermineEncoding(data, ""); ok {

return name

}

return "utf-8"

}

// Decode parses the HTML body on the specified encoding and

// returns the HTML Document.

func Decode(body io.Reader, charset string) (interface{}, error) {

if charset == "" {

charset = detectContentCharset(body)

}

e, err := htmlindex.Get(charset)

if err != nil {

return nil, err

}

if name, _ := htmlindex.Name(e); name != "utf-8" {

body = e.NewDecoder().Reader(body)

}

node, err := html.Parse(body)

if err != nil {

return nil, err

}

return node, nil

}

反对回复 2022-01-04

交互式爱情

TA贡献1712条经验获得超3个赞

goquery可以满足您的需求。例如：

import "https://github.com/PuerkitoBio/goquery"

func main() {

d, err := goquery.NewDocument("http://www.google.com")

dh := d.Find("head")

dc := dh.Find("meta[http-equiv]")

c, err := dc.Attr("content") // get charset

// ...

}

更多的操作可以在Document结构中找到。

反对回复 2022-01-04

热搜

最近搜索清空

golang HTML 字符集解码

golang HTML 字符集解码

2 回答

添加回答