首页手记 Python正则表达式项目实战入门教程

Python正则表达式项目实战入门教程

标签：

Python 正则表达式

概述

本文详细介绍了Python正则表达式的使用方法，涵盖基础语法、常用操作和高级技巧，并通过多个实战项目如文本解析和数据清洗加以应用，帮助读者掌握Python正则表达式项目实战。

Python正则表达式项目实战入门教程

Python正则表达式基础

正则表达式的概念

正则表达式（Regular Expression，简称regex或regexp）是一种强大的文本处理工具，它可以在文本中查找、替换和分割字符串。正则表达式提供了一种灵活的表述规则，可以描述字符、字符串以及这些字符和字符串之间的关系。在编程中，正则表达式被广泛应用于文本处理、数据清洗、解析HTML和日志文件等场景。

Python中常用的正则表达式模块re

Python内置了re模块，提供了一套强大的正则表达式功能。使用re模块，你可以执行各种文本处理任务，如搜索、查找、替换、分割文本等。re模块提供了多个函数，用于匹配、搜索、替换和编译正则表达式。

常用函数：

re.search(pattern, string): 检查字符串中是否包含与正则表达式匹配的子串，返回匹配对象，找不到则返回None。
re.match(pattern, string): 从字符串的开始匹配正则表达式，返回匹配对象，找不到则返回None。
re.findall(pattern, string): 查找所有匹配的子串并返回一个列表。
re.sub(pattern, repl, string): 将正则表达式匹配到的子串替换为指定的字符串。
re.split(pattern, string): 根据正则表达式将字符串分割成列表。

示例代码：

import re

# 搜索字符串中是否包含与正则表达式匹配的子串
pattern = r'hello'
string = 'hello world'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 5), match='hello'>

# 从字符串的开始匹配正则表达式
pattern = r'^hello'
string = 'hello world'
result = re.match(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 5), match='hello'>

# 查找所有匹配的子串并返回一个列表
pattern = r'\d+'
string = 'one1two2three3'
result = re.findall(pattern, string)
print(result)
# 输出: ['1', '2', '3']

# 将正则表达式匹配到的子串替换为指定的字符串
pattern = r'\d+'
string = 'one1two2three3'
result = re.sub(pattern, 'X', string)
print(result)
# 输出: oneXtwoXthreeX

# 根据正则表达式将字符串分割成列表
pattern = r'\s+'
string = 'one two three'
result = re.split(pattern, string)
print(result)
# 输出: ['one', 'two', 'three']

基本正则表达式语法介绍

正则表达式提供了一套符号来编写模式，这些符号可以匹配文本中的特定字符或字符集。以下是常用的正则表达式符号和模式：

.: 匹配任何单个字符（除了换行符\n）。
^: 匹配字符串的开头。
$: 匹配字符串的结尾。
[]: 匹配括号中的任意一个字符。
[^]: 匹配不在括号中的任意一个字符。
\d: 匹配一个数字（等价于[0-9]）。
\D: 匹配一个非数字（等价于[^0-9]）。
\w: 匹配字母或数字或下划线（等价于[a-zA-Z0-9_]）。
\W: 匹配非字母非数字非下划线（等价于[^a-zA-Z0-9_]）。
\s: 匹配空白字符（等价于[ \t\n\r\f\v]）。
\S: 匹配非空白字符（等价于[^ \t\n\r\f\v]）。
a|b: 匹配a或者b。
{m}: 匹配前面字符重复m次。
{m,n}: 匹配前面字符重复m到n次。
*: 匹配前面字符0次或多次。
+: 匹配前面字符1次或多次。
?: 匹配前面字符0次或1次。
(): 捕获括号内的子表达式。
(?P<name>...): 命名捕获组。
(?P=name): 反向引用命名捕获组。

示例代码：

import re

# 匹配任意单个字符
pattern = r'.'
string = 'a'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='a'>

# 匹配字符串的开头
pattern = r'^hello'
string = 'hello world'
result = re.match(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 5), match='hello'>

# 匹配字符串的结尾
pattern = r'world$'
string = 'hello world'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(6, 11), match='world'>

# 匹配括号中的任意一个字符
pattern = r'[abc]'
string = 'a'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='a'>

# 匹配不在括号中的任意一个字符
pattern = r'[^abc]'
string = 'd'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='d'>

# 匹配一个数字
pattern = r'\d'
string = '1'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='1'>

# 匹配一个非数字
pattern = r'\D'
string = 'a'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='a'>

# 匹配字母或数字或下划线
pattern = r'\w'
string = 'a'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='a'>

# 匹配非字母非数字非下划线
pattern = r'\W'
string = ' '
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match=' '>

# 匹配空白字符
pattern = r'\s'
string = ' '
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match=' '>

# 匹配非空白字符
pattern = r'\S'
string = 'a'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='a'>

# 匹配`a`或者`b`
pattern = r'a|b'
string = 'a'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='a'>

# 匹配前面字符重复m次
pattern = r'\d{3}'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符重复m到n次
pattern = r'\d{1,3}'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符0次或多次
pattern = r'\d*'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符1次或多次
pattern = r'\d+'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符0次或1次
pattern = r'\d?'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='1'>

# 捕获括号内的子表达式
pattern = r'(hello) world'
string = 'hello world'
result = re.search(pattern, string)
print(result.group(1))
# 输出: hello

# 命名捕获组
pattern = r'(?P<word>hello) (?P=word)'
string = 'hello world'
result = re.search(pattern, string)
print(result.group('word'))
# 输出: world

# 反向引用命名捕获组
pattern = r'(?P<word>hello) (?P=word)'
string = 'hello hello'
result = re.search(pattern, string)
print(result.group('word'))
# 输出: hello

正则表达式常用操作

查找匹配的字符串

使用re模块中的search和match函数可以查找匹配的字符串。search函数会在整个字符串中查找第一个匹配的子串，而match函数只会在字符串的开头查找匹配的子串。

示例代码：

import re

# 查找字符串中是否包含与正则表达式匹配的子串
pattern = r'hello'
string = 'hello world'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 5), match='hello'>

# 从字符串的开始匹配正则表达式
pattern = r'^hello'
string = 'hello world'
result = re.match(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 5), match='hello'>

# 查找所有匹配的子串并返回一个列表
pattern = r'\d+'
string = 'one1two2three3'
result = re.findall(pattern, string)
print(result)
# 输出: ['1', '2', '3']

替换字符串

使用re模块中的sub函数可以替换字符串。sub函数会在字符串中查找与正则表达式匹配的子串，并将其替换为指定的字符串。

示例代码：

import re

# 将正则表达式匹配到的子串替换为指定的字符串
pattern = r'\d+'
string = 'one1two2three3'
result = re.sub(pattern, 'X', string)
print(result)
# 输出: oneXtwoXthreeX

分割字符串

使用re模块中的split函数可以分割字符串。split函数会根据正则表达式将字符串分割成列表。

示例代码：

import re

# 根据正则表达式将字符串分割成列表
pattern = r'\s+'
string = 'one two three'
result = re.split(pattern, string)
print(result)
# 输出: ['one', 'two', 'three']

正则表达式的高级用法

使用括号捕获分组

在正则表达式中，使用括号可以捕获分组。捕获的分组可以通过group方法获取，如result.group(1)获取第一个捕获的分组。

示例代码：

import re

# 捕获括号内的子表达式
pattern = r'(hello) world'
string = 'hello world'
result = re.search(pattern, string)
print(result.group(1))
# 输出: hello

# 命名捕获组
pattern = r'(?P<word>hello) (?P=word)'
string = 'hello hello'
result = re.search(pattern, string)
print(result.group('word'))
# 输出: hello

使用量词匹配指定次数的字符

在正则表达式中，量词可以匹配指定次数的字符。常用的量词包括{m}、{m,n}、*、+、?等。

示例代码：

import re

# 匹配前面字符重复m次
pattern = r'\d{3}'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符重复m到n次
pattern = r'\d{1,3}'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符0次或多次
pattern = r'\d*'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符1次或多次
pattern = r'\d+'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 3), match='123'>

# 匹配前面字符0次或1次
pattern = r'\d?'
string = '123'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 1), match='1'>

理解正则表达式的优先级与转义字符

在正则表达式中，某些字符具有特殊意义，如^、$、.、*、+、?等。如果需要匹配这些字符本身，需要使用反斜杠\进行转义。

示例代码：

import re

# 转义特殊字符
pattern = r'\.'
string = 'a.b'
result = re.search(pattern, string)
print(result)
# 输出: <re.Match object; span=(1, 2), match='.'>

# 匹配特殊字符
pattern = r'^hello$'
string = 'hello'
result = re.match(pattern, string)
print(result)
# 输出: <re.Match object; span=(0, 5), match='hello'>

实战项目：文本解析

从日志文件中提取IP地址

在一个日志文件中，通常包含大量的日志记录，每条记录都包含一条访问日志。每条访问日志的格式可能为[IP地址] - [访问时间] - [请求的URL]。我们需要从这些日志中提取出IP地址。

示例代码：

import re

def extract_ip_addresses(log_file):
    ip_pattern = r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'
    ip_addresses = []
    with open(log_file, 'r') as file:
        for line in file:
            matches = re.findall(ip_pattern, line)
            ip_addresses.extend(matches)
    return ip_addresses

# 假设我们有一个名为'log.txt'的日志文件，内容如下：
'''
192.168.1.1 - [10/Jan/2021:00:00:00 +0000] "GET /index.html HTTP/1.1" 200 2326
192.168.1.2 - [10/Jan/2021:00:00:01 +0000] "GET /about.html HTTP/1.1" 200 1234
192.168.1.3 - [10/Jan/2021:00:00:02 +0000] "GET /contact.html HTTP/1.1" 200 1234
'''

log_file = 'log.txt'
ip_addresses = extract_ip_addresses(log_file)
print(ip_addresses)
# 输出: ['192.168.1.1', '192.168.1.2', '192.168.1.3']

从HTML文档中提取特定的链接地址

在HTML文档中，链接地址通常以<a>标签的形式出现，格式为<a href="链接地址">链接文本</a>。我们需要从HTML文档中提取所有的链接地址。

示例代码：

import re

def extract_links(html_file):
    link_pattern = r'<a\s+href="([^"]+)"\s*>'
    links = []
    with open(html_file, 'r', encoding='utf-8') as file:
        for line in file:
            matches = re.findall(link_pattern, line)
            links.extend(matches)
    return links

# 假设我们有一个名为'index.html'的HTML文件，内容如下：
'''
<html>
<head>
<title>示例页面</title>
</head>
<body>
<h1>欢迎访问示例页面</h1>
<p>这是示例文本。</p>
<a href="https://www.example.com">示例链接</a>
<a href="https://www.example.org">另一个示例链接</a>
</body>
</html>
'''

html_file = 'index.html'
links = extract_links(html_file)
print(links)
# 输出: ['https://www.example.com', 'https://www.example.org']

实战项目：数据清洗

清洗电话号码格式

在实际应用中，电话号码可能会有不同的格式，如13812345678、138 1234 5678、138-1234-5678等。我们需要把电话号码统一成统一的格式，如13812345678。

示例代码：

import re

def clean_phone_numbers(phone_numbers):
    cleaned_numbers = []
    for number in phone_numbers:
        cleaned_number = re.sub(r'\D', '', number)
        cleaned_numbers.append(cleaned_number)
    return cleaned_numbers

# 假设我们有一个电话号码列表
phone_numbers = ['138 1234 5678', '138-1234-5678', '13812345678']
cleaned_numbers = clean_phone_numbers(phone_numbers)
print(cleaned_numbers)
# 输出: ['13812345678', '13812345678', '13812345678']

清洗日期格式

日期格式也可能多种多样，如2023-01-01、01/01/2023、Jan 1, 2023等。我们需要把日期统一成统一的格式，如YYYY-MM-DD。

示例代码：

import re

def clean_dates(dates):
    cleaned_dates = []
    for date in dates:
        cleaned_date = re.sub(r'(\d{4})-(\d{2})-(\d{2})|\b(\d{2})\/(\d{2})\/(\d{4})|\bJan (\d{1,2}), (\d{4})\b', r'\1-\2-\3', date)
        cleaned_dates.append(cleaned_date)
    return cleaned_dates

# 假设我们有一个日期列表
dates = ['2023-01-01', '01/01/2023', 'Jan 1, 2023']
cleaned_dates = clean_dates(dates)
print(cleaned_dates)
# 输出: ['2023-01-01', '2023-01-01', '2023-01-01']

正则表达式调试与优化

常见问题与调试技巧

正则表达式匹配失败或返回结果不符合预期时，可以通过以下方法调试：

使用在线正则表达式调试工具，如regex101，输入正则表达式和测试字符串，查看匹配结果。
分解复杂正则表达式，逐步调试。
使用re.DEBUG模式输出详细的匹配信息。
尝试使用不同的正则表达式库或编写代码实现正则表达式功能。

示例代码：

import re

def debug_pattern(pattern, string):
    result = re.search(pattern, string, re.DEBUG)
    print(result)

# 示例图案和字符串
pattern = r'\d'
string = '1'
debug_pattern(pattern, string)
# 输出: match 1

性能优化方法

正则表达式执行效率可能较低，特别是在处理大量数据时。以下是一些性能优化方法：

使用re.compile预编译正则表达式，减少重复解析的时间。
尽量使用非贪婪匹配模式，避免不必要的匹配。
使用re模块提供的finditer函数，避免创建不必要的列表。
避免使用复杂的正则表达式，尽量拆分处理。

示例代码：

import re

# 预编译正则表达式
pattern = re.compile(r'\d+')
string = 'one1two2three3'
matches = pattern.findall(string)
print(matches)
# 输出: ['1', '2', '3']

总结

正则表达式是一种强大的文本处理工具，广泛应用于文本匹配、查找、替换、分割等多种场景。通过本文的介绍，读者可以掌握Python中正则表达式的使用方法，了解一些常见的高级用法和实战项目应用。希望读者可以通过本文掌握正则表达式的应用技巧，提高文本处理能力。

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕妹3242003

手记
篇

粉丝

9

获赞与收藏

25

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 25585 865

Python 算法入门教程

15个小节 27261 1065

Python 进阶应用教程

38个小节 65199 1020

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

Python正则表达式项目实战入门教程

正则表达式的概念

Python中常用的正则表达式模块re

常用函数：

示例代码：

基本正则表达式语法介绍

示例代码：

查找匹配的字符串

示例代码：

替换字符串

示例代码：

分割字符串

示例代码：

使用括号捕获分组

示例代码：

使用量词匹配指定次数的字符

示例代码：

理解正则表达式的优先级与转义字符

示例代码：

从日志文件中提取IP地址

示例代码：

从HTML文档中提取特定的链接地址

示例代码：

清洗电话号码格式

示例代码：

清洗日期格式

示例代码：

常见问题与调试技巧

示例代码：

性能优化方法

示例代码：

阅读免费教程