-
URL管理器查看全部
-
时序图查看全部
-
简单爬虫动态运行流程查看全部
-
这里提供下python3.4.4实现网页下载器的方法: import urllib.request from http.cookiejar import CookieJar url = 'http://www.baidu.com' print('第一种方法') res1 = urllib.request.urlopen(url) print(res1.getcode()) print(len(res1.read())) print('第二种方法') request = urllib.request.Request(url, headers={'user-agent': 'Mozilla/5.0'}) res2 = urllib.request.urlopen(request) print(res2.getcode()) print(len(res2.read())) print('第三种方法') cj = CookieJar() opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) urllib.request.install_opener(opener) res3 = urllib.request.urlopen(url) print(res3.getcode()) print(cj) print(res3.read())查看全部
-
方式三的代码,如果是3.5应该为: import urllib.request import http.cookiejar cj = http.cookiejar.CookieJar() pro = urllib.request.HTTPCookieProcessor(cj) opener = urllib.request.build_opener(pro) urllib.request.install_opener(opener) response = urllib.request.urlopen('http://www.baidu.com')查看全部
-
方式三,对于有要求的特殊网页的爬取查看全部
-
方式二的代码,如果是3.5应该为: import urllib.request request = urllib.request.Request('http://www.baidu.com') urllib.request.data = ('a','1') request.add_header('User_Agent','Mozilla/5.0') response = urllib.request.urlopen(request)查看全部
-
方式二 将我们的数据添加到request类中查看全部
-
方法一的代码 第一种方法: import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.getcode()) 200 cont = response.read()查看全部
-
urllib2下载网页方式一查看全部
-
运行流程查看全部
-
抓取策略: 1. URL格式 2. 数据格式 3. 页面编码查看全部
-
# coding:utf8 import urllib2 import cookielib url = "http://www.baidu.com" print "第一種方法" response1 = urllib2.urlopen(url) print response1.getcode() print response1.read() print "第二種方法" request=urllib2.Request(url) request.add_header("user-agent","Mozilla/5.0") response2=urllib2.urlopen(request) print response2.getcode() print len(response2.read()) print "第三種方法" cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) response3 = urllib2.urlopen(url) print response3.getcode() print cj print len(response3.read())查看全部
-
soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8') print '獲取所有的鏈接' links=soup.find_all('a') for link in links: print link.name,link['href'],link.get_text() print '获取lacie链接' link_node=soup.find('a',href='http://example.com/lacie') print link_node.name,link_node['href'],link_node.get_text() print '正则匹配' link_node=soup.find('a',href=re.compile(r'ill')) print link_node.name,link_node['href'],link_node.get_text() print 'p段落名字' link_node=soup.find('p',class_='title') print link_node.name,link_node.get_text()查看全部
-
节点信息查看全部
举报
0/150
提交
取消