3 回答
TA贡献1790条经验 获得超9个赞
Unicode字符U+FEFF是字节顺序标记或BOM,用于区分大端和小端UTF-16编码。如果使用正确的编解码器解码网页,Python将为您删除它。例子:
#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')
请注意,这EF BB BF是一个UTF-8编码的BOM。它不是UTF-8所必需的,但仅作为签名(通常在Windows上)。
输出:
utf-8 'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.
utf-16le 'A\x00B\x00C\x00'
utf-16be '\x00A\x00B\x00C'
utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.
请注意,utf-16编解码器需要 BOM存在,否则Python将不知道数据是大端还是小端。
TA贡献2041条经验 获得超4个赞
该字符是BOM或“字节顺序标记”。它通常作为文件的前几个字节接收,告诉您如何解释其余数据的编码。您只需删除该字符即可继续。虽然,因为错误说你试图转换为'ascii',你应该选择另一种编码,无论你想做什么。
添加回答
举报