Python开发简单爬虫_学习笔记

首页免费课 Python开发简单爬虫笔记

Python开发简单爬虫

最热最新

Foamed

实现方式：
1. 内存：python 内存，待爬取url集合：set()，已爬取:set()。 set可以去掉重复的
2. 关系数据库（表形式）urls(url,is_crawled)
3. 缓存数据库 redis 待/已爬取集合:set，支持set (高性能，都用它)

查看全部

0 采集收起来源：Python爬虫URL管理器的实现方式
2019-10-02
Foamed

URL管理器：防止重复抓取，循环抓取

查看全部

0 采集收起来源：Python爬虫URL管理
2019-10-02
Foamed 01:40

看图就好了

查看全部

0 采集收起来源：Python简单爬虫架构的动态运行流程
2019-10-02
Foamed

web crawler调度端→URL管理器→网页下载器→网页解析器→价值数据

查看全部

0 采集收起来源：Python简单爬虫架构
2019-10-02
Foamed

抓取想要信息

查看全部

0 采集收起来源：爬虫技术的价值
2019-10-02
Miller_Xu
import re #导入正则表达式要用的模块

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""

soup=BeautifulSoup(html_doc,'html.parser') #(文档字符串，解析器)

print('获取所有链接：')
links=soup.find_all('a')
for link in links:
print(link.name,link['href'],link.get_text()) #（名称，URL，文字）

print('获取指定链接(获取Lacie链接)：')
#link_node=soup.find('a',id="link2") 运行结果一样
link_node=soup.find('a',href='http://example.com/lacie') #注意find和find_all
print(link_node.name,link_node['href'],link_node.get_text())

print('输入正则模糊匹配出需要的内容：')
link_node=soup.find('a',href=re.compile(r"ill")) #'r'表示正则中出现反斜线时，我们只需写一个反斜线，否则我们要写两个
print(link_node.name,link_node['href'],link_node.get_text())

print('输入p这个段落文字(指定class获取内容)：')
p_node=soup.find('p',class_="story")
print(p_node.name,p_node.get_text())

输出：
```
获取所有链接：
a http://example.com/elsie Elsie
a http://example.com/lacie Lacie
a http://example.com/tillie Tillie
获取指定链接(获取Lacie链接)：
a http://example.com/lacie Lacie
输入正则模糊匹配出需要的内容：
a http://example.com/tillie Tillie
输入p这个段落文字(指定class获取内容)：
p Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
```
查看全部

1 采集收起来源：BeautifulSoup实例测试
2019-09-17
qq_慕丝2371519 02:41

下载网页的方法 1 ：最简洁的方法
import urllib 2
response = urllib2.urlopen（‘网页地址’）
print reponse. getcoed 获取状态码如果 200 表示获取成功

查看全部

0 采集收起来源：Python爬虫urlib2下载器网页的三种方法
2019-09-17
慕妹0316382

新

查看全部

0 采集收起来源：Python开发简单爬虫课程介绍
2019-09-13
慕妹3509545

基本构成。

查看全部

0 采集收起来源：Python简单爬虫架构的动态运行流程
2019-09-13
西奈奈子 00:02

urllib2实战演示

查看全部

0 采集收起来源：Python爬虫urlib2实例代码演示
2019-09-09
西奈奈子 00:15

beautifulsoup安装与测试

查看全部

0 采集收起来源：BeautifulSoup模块介绍和安装
2019-09-09
我要吃肉123_ 04:08

访问节点信息
node,name
node['href']
node.get_text()

查看全部

0 采集收起来源：BeautifulSoup的语法
2019-08-31
我要吃肉123_ 03:19

搜索节点（find_all,find）
soup.find_all('a')
soup.find_all('a',href='/view/123.htm')
soup.find_all('a',href='re.compile(r'/view/\d+\.htm'))
soup.find_all('div',class_='abc',string='python')

查看全部

0 采集收起来源：BeautifulSoup的语法
2019-08-31
我要吃肉123_ 02:22

创建BeautifulSoup对象
from bs4 import BeautifulSoup
#根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
                            html_doc,
                            'html.parser'
                            from_encoding='utf8'
                            )

查看全部

0 采集收起来源：BeautifulSoup的语法
2019-08-31
我要吃肉123_ 03:17

结构化解析-DOM树

查看全部

0 采集收起来源：Python爬虫网页解析器简介
2019-08-31

首页上一页 19 20 21 22 23 24 25 下一页尾页

0/150

提交

取消

该课程已下架

课程须知: 本课程是Python语言开发的高级课程 1、Python编程语法； 2、HTML语言基础知识； 3、正则表达式基础知识；

老师告诉你能学到什么？: 1、爬虫技术的含义和存在价值 2、爬虫技术架构 3、组成爬虫的关键模块：URL管理器、HTML下载器和HTML解析器 4、实战抓取百度百科1000个词条页面数据的抓取策略设定、实战代码编写、爬虫实例运行 5、一套极简的可扩展爬虫代码，修改本代码，你就能抓取任何互联网网页！

微信扫码，参与3人拼团

热搜

最近搜索清空

Python开发简单爬虫