首页猿问基于pandas的模糊匹配删除重复项

基于pandas的模糊匹配删除重复项

Python

HUH函数 2021-11-16 15:45:02

我有一个包含人们信息的 DataFrame，但有重复的行，地址略有不同。如何基于模糊匹配或其他检测相似性的方式删除重复项，但确保只有在名字和姓氏匹配的情况下才会删除具有相似地址的行？示例数据： First name | Last name | Address0 John Doe ABC 91 John Doe KFT 22 Michael John ABC 93 Mary Jane PEP 9/24 Mary Jane PEP, 9-25 Gary Young verylongstreetname 1 6 Gary Young 1 verylongstretname（故意在街上打错字）示例数据的代码：df = pd.DataFrame([ ['John', 'Doe', 'ABC 9'], ['John', 'Doe', 'KFT 2'], ['Michael', 'John', 'ABC 9'], ['Mary', 'Jane', 'PEP 9/2'], ['Mary', 'Jane', 'PEP, 9-2'], ['Gary', 'Young', 'verylongstreetname 1'], ['Gary', 'Young', '1 verylongstretname']], columns=['First name', 'Last name', 'Address'])预期输出： First name | Last name | Address0 John Doe ABC 91 John Doe KFT 22 Michael John ABC 93 Mary Jane PEP 9/24 Gary Young verylongstreetname 1

查看完整描述

2 回答

九州编程

TA贡献1785条经验获得超4个赞

用于str.replace删除所有非单词字符，然后drop_duplicates

df['Address'] = df['Address'].str.replace(r'\W','')

temp_address = df['Address']

df.drop_duplicates(inplace=True)

输出

First name Last name Address

0 John Doe ABC9

1 John Doe KFT2

2 Michael John ABC9

3 Mary Jane PEP92

替换原地址

b['Address'] = b['Address'].apply(lambda x: [w for w in temp_address if w.split(' ')[0] in x][0])

输出

First name Last name Address

0 John Doe ABC 9

1 John Doe KFT 2

2 Michael John ABC 9

3 Mary Jane PEP 9/2

好的，这是一种方法

df['Address'] = df['Address'].str.replace(r'\W',' ') # giving a space

def check_simi(d):

temp = []

flag = 0

for w in d:

temp.extend(w.split(' '))

temp = [t for t in temp if t]

flag = len(temp) / 2

if len(set(temp)) == flag:

return int(d.index[0])

else:

indexes = df.groupby(['First name','Last name'])['Address'].apply(check_simi)

indexes = [int(i) for i in indexes if i >= 0]

df.drop(indexes)

First name Last name Address

0 John Doe ABC 9

1 John Doe KFT 2

2 Michael John ABC 9

4 Mary Jane PEP 9 2

6 Gary Young 1 verylongstreetname

PS - 请查看https://github.com/seatgeek/fuzzywuzzy以获得更清洁的方法，我没有，因为我的网络不允许这样做

反对回复 2021-11-16

holdtom

TA贡献1805条经验获得超10个赞

解决了。

基于@iamklaus anwser 我制作了这段代码：

def remove_duplicates_inplace(df, groupby=[], similarity_field='', similar_level=85):

def check_simi(d):

dupl_indexes = []

for i in range(len(d.values) - 1):

for j in range(i + 1, len(d.values)):

if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:

dupl_indexes.append(d.index[j])

return dupl_indexes

indexes = df.groupby(groupby)[similarity_field].apply(check_simi)

for index_list in indexes:

df.drop(index_list, inplace=True)

remove_duplicates_inplace(df, groupby=['firstname', 'lastname'], similarity_field='address')

输出：

firstname lastname address

0 John Doe ABC 9

1 John Doe KFT 2

2 Michael John ABC 9

3 Mary Jane PEP 9/2

5 Gary Young verylongstreetname 1

反对回复 2021-11-16

2 回答
0 关注
311 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

基于pandas的模糊匹配删除重复项

基于pandas的模糊匹配删除重复项

2 回答

添加回答