Python开发简单爬虫_学习笔记

首页免费课 Python开发简单爬虫笔记

Python开发简单爬虫

最热最新

天天_

方法三：

import urllib.request
import http.cookiejar

cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen('https://www.baidu.com/')
print(response3.getcode())
print(len(response3.read()))

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法

2018-10-23

天天_

方法二：

import urllib.request

req = urllib.request.Request('https://www.baidu.com/')
req.add_header('User-Agent','Mozilla/5.0')
response = urllib.request.urlopen(req)
print(response.getcode())
print(response)
cont = response.read()
print(cont)

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法

2018-10-23

天天_

方法一:
import urllib.request

response = urllib.request.urlopen('https://www.baidu.com/')
print(response.getcode())
cont = response.read()
print(cont)

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法

2018-10-23

幕布斯6498529 01:47

断

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法
2018-10-22
慕婉清6014571

解析器bs4：

查看全部

0 采集收起来源：BeautifulSoup的语法
2018-10-22

touch_the_dream 00:40

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup   #导入网页解析器BeautifulSoup库
import re              #导入正则表达式库

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 1、创建BeautifulSoup对象
soup = BeautifulSoup(html_doc,              # HTML文档字符串
              'html.parser',         # HTML解析器
              from_encoding='utf-8'      # HTML文档的编码
           )

# 2、搜索节点（find_all, find）
print ('获取所有的链接')
links = soup.find_all('a')
for link in links:
    # 3、访问节点内容
    print (link.name, link['href'], link.get_text())

print ('获取Lacie的链接')
link_node = soup.find('a', href='http://example.com/lacie')
print (link_node.name, link_node['href'], link_node.get_text())

print ('正则匹配')
link_node = soup.find('a', href=re.compile(r"ill"))
print (link_node.name, link_node['href'], link_node.get_text())

print ('获取p段落文字')
link_node = soup.find('p', class_="title")
print (link_node.name, link_node.get_text())

查看全部

2 采集收起来源：BeautifulSoup实例测试

2018-10-19

慕函数1556660 01:16

python爬虫简单

查看全部

0 采集收起来源：Python爬虫URL管理器的实现方式
2018-10-16
小猪码农 01:44

url管理器

查看全部

0 采集收起来源：Python爬虫URL管理器的实现方式
2018-10-15
幕布斯0465714

简单爬虫架构--运行流程

查看全部

0 采集收起来源：Python简单爬虫架构的动态运行流程
2018-10-15
幕布斯0465714

简单爬虫架构

查看全部

0 采集收起来源：Python简单爬虫架构
2018-10-15
慕圣405381 00:41

import urlleb2

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法
2018-10-15
慕圣405381 00:41

200表示成功

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法
2018-10-15
笑丶忘
1. 爬虫简介
2. 简单爬虫架构
3. URL管理器
4. 网页下载器(urllib2)
5. 网页解析器(BeautifulSoup)
6. 完整示例 ·爬取百度百科Python词条相关的1000个页面数据
查看全部

0 采集收起来源：Python开发简单爬虫课程介绍
2018-10-14

尼古拉皮

import urllib2,cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response = urllib2.urlopen("http://www.baidu.com/")

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法

2018-10-13

尼古拉皮

import urllib2
request = urllib2.Request(url)
request.add_data('a','1')
request.add_header('User-Agent','mozilla/5.0')
response = urllib2.urlopen(request)

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法

2018-10-13

首页上一页 41 42 43 44 45 46 47 下一页尾页

0/150

提交

取消

该课程已下架

课程须知: 本课程是Python语言开发的高级课程 1、Python编程语法； 2、HTML语言基础知识； 3、正则表达式基础知识；

老师告诉你能学到什么？: 1、爬虫技术的含义和存在价值 2、爬虫技术架构 3、组成爬虫的关键模块：URL管理器、HTML下载器和HTML解析器 4、实战抓取百度百科1000个词条页面数据的抓取策略设定、实战代码编写、爬虫实例运行 5、一套极简的可扩展爬虫代码，修改本代码，你就能抓取任何互联网网页！

微信扫码，参与3人拼团

热搜

最近搜索清空

Python开发简单爬虫