我正在编写一个脚本,该脚本转到链接列表并解析信息。它适用于大多数站点,但在某些情况下令人窒息:“ UnicodeEncodeError:'ascii'编解码器无法在位置13编码字符'\ xe9':序数不在范围内(128)”它在python3上urlib的client.py上停止确切的链接是:http : //finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html这里有很多类似的帖子,但是似乎没有答案对我有用。我的代码是:from urllib import requestdef __request(link,debug=0): try: html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts unicode_html = html.decode('utf-8','ignore')# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.except HTTPError as e: if debug: print('The server couldn\'t fulfill the request for ' + link) print('Error code: ', e.code) return ''except URLError as e: if isinstance(e.reason, socket.timeout): print('timeout') return '' else: return unicode_html这调用了请求功能链接=' http: //finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'页面= __request(链接)追溯是:Traceback (most recent call last): File "<string>", line 250, in run_nodebug File "C:\reader\get_news.py", line 276, in <module> main() File "C:\reader\get_news.py", line 255, in main body = get_article_body(item['link'],debug=0) File "C:\reader\get_news.py", line 155, in get_article_body page = __request('na',url) File "C:\reader\get_news.py", line 50, in __request html = request.urlopen(link, timeout=35).read() File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen return opener.open(url, data, timeout) File "C:\Python33\Lib\urllib\request.py", line 469, in open response = self._open(req, data) File "C:\Python33\Lib\urllib\request.py", line 487, in _open任何帮助表示赞赏它使我发疯,我想我已经尝试过x.decode和类似内容的所有组合
3 回答
MM们
TA贡献1886条经验 获得超2个赞
我不确定在URL的其他部分是否会出现问题,所以我将其拆分然后重新构建url_tuple = parse.urlsplit(link)parse.quote_plus(url_tuple [2])+ url_tuple [3] + parse.quote_plus(url_tuple [4]))encode_link =“%s://%s%s?%s%s”%(url_tuple [0],url_tuple [1],parse.quote(url_tuple [2]) ,url_tuple [3],parse.quote(url_tuple [4]))
添加回答
举报
0/150
提交
取消