首页手记 Python爬虫资料：新手入门教程

Python爬虫资料：新手入门教程

标签：

Python 爬虫

概述

本文详细介绍了Python爬虫的基本概念、开发流程及常用库的使用方法，涵盖了从环境配置到实战案例的全过程。文章深入讲解了如何使用requests库发送HTTP请求和解析HTML文档的BeautifulSoup库，同时提供了处理代理和反爬虫措施的实例。通过丰富的示例代码，读者可以全面掌握Python爬虫的开发技巧，获取宝贵的Python爬虫资料。

Python爬虫简介

爬虫的基本概念

爬虫是一种自动化的网络数据采集工具，主要用于从互联网上的网页、数据库或文件中抓取特定信息。爬虫通常会模拟浏览器的行为，通过发送HTTP请求向目标网站请求数据，并通过解析返回的HTML文档来提取目标信息。爬虫在数据挖掘、信息检索、网络监控等方面有着广泛的应用。

爬虫的基本流程

爬虫的基本流程包括以下几个步骤：

发送请求：爬虫向目标网站发送HTTP请求，请求特定的网页资源。
接收响应：爬虫接收服务器返回的响应，响应通常包含HTML文档或其他类型的资源。
解析文档：爬虫解析返回的文档，提取出需要的数据。
数据存储：爬虫将提取的数据存储到本地或数据库中。

Python为什么适合爬虫开发

Python是一种高级编程语言，具有简洁明了的语法和丰富的库支持，使得Python成为爬虫开发的理想选择。以下是Python在爬虫开发中的优势：

强大的库支持：Python拥有众多强大的库，如requests和BeautifulSoup，这些库提供了处理HTTP请求、解析HTML文档等功能。
简洁的语法：Python的语法简洁易懂，使得开发人员可以快速编写和理解代码。
跨平台：Python可以在多种操作系统上运行，支持Windows、Linux和Mac OS等。
社区支持：Python拥有庞大的用户社区和活跃的开发者，为解决问题提供了丰富的资源和帮助。

安装与环境配置

Python环境配置

在开始编写Python爬虫代码之前，需要确保已经安装了Python环境。以下是配置步骤：

安装Python：访问Python官方网站（https://www.python.org/）下载对应的Python安装包，并按照安装向导进行安装。
配置环境变量：在安装Python时，确保勾选“Add Python to PATH”选项，使Python能够在命令行中直接使用。

安装必要的库

Python提供了许多第三方库来支持爬虫开发，常用的库包括requests、BeautifulSoup和lxml等。以下是安装这些库的方法：

pip install requests
pip install beautifulsoup4
pip install lxml

设置运行环境

设置运行环境主要包括以下几个步骤：

安装编辑器：选择一个合适的Python编辑器，如PyCharm、VS Code等。
创建虚拟环境：使用virtualenv或venv创建一个虚拟环境，确保开发过程中不干扰其他Python项目。
激活虚拟环境：使用命令激活虚拟环境。

示例代码：

# 创建虚拟环境
python -m venv myenv

# 激活虚拟环境（Windows）
myenv\Scripts\activate

# 激活虚拟环境（Linux/Mac）
source myenv/bin/activate

使用Requests库获取数据

Requests库简介

requests是一个常用的HTTP请求库，用于发送各种类型的HTTP请求。它支持GET、POST等多种HTTP方法，并支持处理Cookies、Session等。

发送GET请求

使用requests库发送GET请求的基本步骤如下：

导入requests库。
使用requests.get()方法发送GET请求。
处理返回的响应。

示例代码：

import requests

url = "https://www.example.com"
response = requests.get(url)

# 打印响应状态码
print(response.status_code)

# 打印响应内容
print(response.text)

发送POST请求

使用requests库发送POST请求的基本步骤如下：

导入requests库。
使用requests.post()方法发送POST请求。
处理返回的响应。

示例代码：

import requests

url = "https://www.example.com"
payload = {"key1": "value1", "key2": "value2"}
response = requests.post(url, data=payload)

# 打印响应状态码
print(response.status_code)

# 打印响应内容
print(response.text)

发送PUT请求

使用requests库发送PUT请求的基本步骤如下：

导入requests库。
使用requests.put()方法发送PUT请求。
处理返回的响应。

示例代码：

import requests

url = "https://www.example.com"
payload = {"key1": "value1", "key2": "value2"}
response = requests.put(url, data=payload)

# 打印响应状态码
print(response.status_code)

# 打印响应内容
print(response.text)

发送DELETE请求

使用requests库发送DELETE请求的基本步骤如下：

导入requests库。
使用requests.delete()方法发送DELETE请求。
处理返回的响应。

示例代码：

import requests

url = "https://www.example.com"
response = requests.delete(url)

# 打印响应状态码
print(response.status_code)

# 打印响应内容
print(response.text)

处理Cookies和Session

使用requests库处理Cookies和Session的基本步骤如下：

使用requests.cookies.RequestsCookieJar类管理Cookies。
使用requests.Session类管理Session。

示例代码：

import requests

# 获取Cookies
response = requests.get("https://www.example.com")
cookies = response.cookies

# 使用Cookies发送请求
response_with_cookies = requests.get("https://www.example.com", cookies=cookies)

# 使用Session发送请求
session = requests.Session()
response = session.get("https://www.example.com")

# 打印响应内容
print(response.text)

# 保存Cookies
session.cookies.save()

# 加载Cookies
session.cookies.load()

使用BeautifulSoup解析HTML

BeautifulSoup库简介

BeautifulSoup是一个HTML和XML的解析器，用于解析和提取HTML文档中的数据。它提供了强大的功能来解析复杂的HTML结构，并提取出需要的数据。

解析HTML文档

使用BeautifulSoup库解析HTML文档的基本步骤如下：

导入BeautifulSoup库。
创建一个BeautifulSoup对象来表示HTML文档。
使用各种方法解析和提取数据。

示例代码：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>Example Page</title>
</head>
<body>
    <h1>Heading</h1>
    <p>Paragraph 1</p>
    <p>Paragraph 2</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 打印整个HTML文档
print(soup.prettify())

提取所需数据

使用BeautifulSoup库提取所需数据的基本步骤如下：

使用soup.find()或soup.find_all()方法查找特定的标签。
使用.text属性提取标签内的文本。
使用.属性提取标签的属性值。

示例代码：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>Example Page</title>
</head>
<body>
    <h1 id="heading">Main Heading</h1>
    <p class="content">Paragraph 1</p>
    <p class="content">Paragraph 2</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 提取标题
title = soup.title.text
print("Title:", title)

# 提取段落文本
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

# 提取h1标签的id属性
h1_id = soup.h1['id']
print("H1 ID:", h1_id)

处理标签和属性

使用BeautifulSoup库处理标签和属性的基本步骤如下：

使用.find()或.find_all()方法查找特定的标签。
使用.attrs属性获取标签的所有属性。
使用.get()方法获取特定的属性值。

示例代码：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>Example Page</title>
</head>
<body>
    <a href="https://www.example.com" target="_blank">Link</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 查找a标签
link = soup.find('a')

# 获取href属性值
href = link.get('href')
print("Link:", href)

# 获取所有属性
attrs = link.attrs
print("Attributes:", attrs)

代理与反爬虫处理

代理的作用和设置方法

代理的作用是隐藏真实的IP地址，防止被目标网站封禁。代理设置的方法包括：

设置HTTP代理：通过proxies参数设置HTTP代理。
动态切换代理：使用代理池动态更换代理IP。

示例代码：

import requests

proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "http://10.10.1.10:1080",
}

response = requests.get("https://www.example.com", proxies=proxies)

print(response.text)

处理IP封禁

处理IP封禁的方法包括：

使用代理：通过设置多个代理IP来避免单一IP被封禁。
增加请求间隔：设置合理的请求间隔，避免频繁请求导致封禁。

示例代码：

import time
import requests

proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "http://10.10.1.10:1080",
}

for i in range(10):
    response = requests.get("https://www.example.com", proxies=proxies)
    print(response.text)
    time.sleep(1)  # 间隔1秒

处理验证码

处理验证码的方法包括：

使用验证码识别服务：通过调用验证码识别服务来自动识别验证码。
手动输入验证码：在代码中添加手动输入验证码的逻辑。

示例代码：

import requests
from selenium import webdriver

url = "https://www.example.com/login"
driver = webdriver.Chrome()

driver.get(url)
driver.find_element_by_id("username").send_keys("your_username")
driver.find_element_by_id("password").send_keys("your_password")

# 等待验证码输入
captcha = driver.find_element_by_id("captcha")
captcha_input = input("请输入验证码: ")
captcha.send_keys(captcha_input)

# 提交表单
driver.find_element_by_id("submit").click()

driver.quit()

处理User-Agent

处理User-Agent的方法包括：

设置User-Agent：通过headers参数设置请求头中的User-Agent信息。
模拟浏览器行为：通过抓包工具获取真实的User-Agent信息。

示例代码：

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

response = requests.get("https://www.example.com", headers=headers)

print(response.text)

实战案例：爬取网页数据

选择目标网站

选择一个合适的网站作为目标，这里以爬取一个新闻网站为例。目标网站为https://news.example.com。

编写完整爬虫代码

编写一个完整的爬虫代码，用于爬取新闻网站的数据。代码如下：

import requests
from bs4 import BeautifulSoup

def fetch_news(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    news_list = soup.find_all('div', class_='news-item')
    for news in news_list:
        title = news.find('h3').text
        link = news.find('a')['href']
        content = news.find('p').text
        print(f"Title: {title}\nLink: {link}\nContent: {content}\n")

if __name__ == "__main__":
    url = "https://news.example.com"
    fetch_news(url)

保存爬取的数据

将爬取的数据保存到本地文件中。示例代码：

import requests
from bs4 import BeautifulSoup

def fetch_news(url, output_file):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    news_list = soup.find_all('div', class_='news-item')
    with open(output_file, 'w', encoding='utf-8') as f:
        for news in news_list:
            title = news.find('h3').text
            link = news.find('a')['href']
            content = news.find('p').text
            f.write(f"Title: {title}\nLink: {link}\nContent: {content}\n\n")

if __name__ == "__main__":
    url = "https://news.example.com"
    output_file = "news_data.txt"
    fetch_news(url, output_file)

遇到的问题及解决方法

请求被拒绝：调整请求头中的User-Agent信息，模拟不同的浏览器行为。
验证码问题：使用验证码识别服务或手动输入验证码。
IP封禁：使用代理IP池，动态更换代理IP。

示例代码：

import requests
from bs4 import BeautifulSoup

def fetch_news_with_proxy(url, proxy):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }
    proxies = {
        "http": proxy,
        "https": proxy,
    }
    response = requests.get(url, headers=headers, proxies=proxies)
    soup = BeautifulSoup(response.text, 'html.parser')

    news_list = soup.find_all('div', class_='news-item')
    for news in news_list:
        title = news.find('h3').text
        link = news.find('a')['href']
        content = news.find('p').text
        print(f"Title: {title}\nLink: {link}\nContent: {content}\n")

if __name__ == "__main__":
    url = "https://news.example.com"
    proxy = "http://10.10.1.10:3128"
    fetch_news_with_proxy(url, proxy)

通过以上步骤，你可以完成一个简单的爬虫开发流程，从安装环境、发送请求、解析HTML到保存数据。这些基本技能将帮助你在实践中开发更复杂的爬虫应用。

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

呼唤远方

手记
篇

粉丝

82

获赞与收藏

367

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 25697 869

Python 算法入门教程

15个小节 27410 1070

Python 进阶应用教程

38个小节 65719 1030

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

Python爬虫资料：新手入门教程

爬虫的基本概念

爬虫的基本流程

Python为什么适合爬虫开发

Python环境配置

安装必要的库

设置运行环境

Requests库简介

发送GET请求

发送POST请求

发送PUT请求

发送DELETE请求

处理Cookies和Session

BeautifulSoup库简介

解析HTML文档

提取所需数据

处理标签和属性

代理的作用和设置方法

处理IP封禁

处理验证码

处理User-Agent

选择目标网站

编写完整爬虫代码

保存爬取的数据

遇到的问题及解决方法

阅读免费教程