我按教程写的百度百科爬虫源代码(略加修改):
https://github.com/effortjohn/baike_spider
https://github.com/effortjohn/baike_spider
2016-02-13
print u'第三种方法'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cj
print response3.read()
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cj
print response3.read()
2016-02-12
我的代码,改正了一些错误,可以运行。
# coding:utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
print u'第一种方法'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())
# coding:utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
print u'第一种方法'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())
2016-02-12
复制笔记的代码时注意缩进,,,我在_get_new_urls函数里把return new_urls写进for循环里了,结果循环一次就返回了链接,所以整个程序爬了一个链接就停了。
2016-02-12