我按教程写的百度百科爬虫源代码(略加修改):
https://github.com/effortjohn/baike_spider
https://github.com/effortjohn/baike_spider
2016-02-13
print u'第三种方法'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cj
print response3.read()
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cj
print response3.read()
2016-02-12
我的代码,改正了一些错误,可以运行。
# coding:utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
print u'第一种方法'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())
# coding:utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
print u'第一种方法'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())
2016-02-12
复制笔记的代码时注意缩进,,,我在_get_new_urls函数里把return new_urls写进for循环里了,结果循环一次就返回了链接,所以整个程序爬了一个链接就停了。
2016-02-12
最赞回答 / Effortjohn
html_outputer代码里,在写入<html>和<body>之间,再写入<head><meta charset="utf-8"></head>像下面这样: fout=open('output.html','w') fout.write("<html>") fout.write("<body>") fout.write("<head>") ...
2016-02-10