首页猿问在 pandas 单列上运行...

在 pandas 单列上运行 fuzzywuzzy 比率

Python

慕婉清6462132 2023-07-27 14:08:07

我有一大堆全名示例：datafile.csv:full_name, dob,Jerry Smith,21/01/2010Morty Smith,18/06/2008Rick Sanchez,27/04/1993Jery Smith,27/12/2012Morti Smith,13/03/2012我试图用它来fuzz.ration查看 column['fullname'] 中的名称是否有任何相似之处，但代码需要很长时间，主要是因为嵌套的 for 循环。示例代码：dataframe = pd.read_csv('datafile.csv')_list = []for row1 in dataframe['fullname']: for row2 in dataframe['fullname']: x = fuzz.ratio(row1, row2) if x > 90: _list.append([row1, row2, x])print(_list)是否有更好的方法来迭代单个 pandas 列以获得潜在重复数据的比率？

查看完整描述

4 回答

宝慕林4294392

TA贡献2021条经验获得超8个赞

import pandas as pd

from io import StringIO

from fuzzywuzzy import process

s = """full_name,dob

Jerry Smith,21/01/2010

Morty Smith,18/06/2008

Rick Sanchez,27/04/1993

Jery Smith,27/12/2012

Morti Smith,13/03/2012"""

df = pd.read_csv(StringIO(s))

# 1 - use fuzzywuzzy.process.extract with list comprehension

# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow

# 3 - convert the list comprehension results to a dataframe

# Note that I am limiting the results to one match. You can adjust the code as you see fit

df2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],

index=df.index, columns=['match_name', 'match_percent', 'match_index'])

# join the new dataframe to the original

final = df.join(df2)

full_name dob match_name match_percent match_index

0 Jerry Smith 21/01/2010 Jery Smith 95 3

1 Morty Smith 18/06/2008 Morti Smith 91 4

2 Rick Sanchez 27/04/1993 Morti Smith 43 4

3 Jery Smith 27/12/2012 Jerry Smith 95 0

4 Morti Smith 13/03/2012 Morty Smith 91 1

反对回复 2023-07-27

GCT1015

TA贡献1827条经验获得超4个赞

通常有两个部分可以帮助您提高性能：

减少比较次数
使用更快的方式来匹配字符串

在你的实现中，你执行了很多不必要的比较，因为你总是比较 A <-> B，然后比较 B <-> A。你也比较 A <-> A，通常总是 100。所以你可以减少数量的比较超过50%。由于您只想添加分数超过 90 的匹配项，因此此信息可用于加快比较速度。

您的代码可以通过以下方式来实现这两个更改，这应该会快得多。在我的机器上测试时，您的代码运行大约 12 秒，而这个改进版本只需要 1.7 秒。

import pandas as pd

from io import StringIO

from rapidfuzz import fuzz

# generate a bigger list of examples to show the performance benefits

s = "fullname,dob"

s+='''

Jerry Smith,21/01/2010

Morty Smith,18/06/2008

Rick Sanchez,27/04/1993

Jery Smith,27/12/2012

Morti Smith,13/03/2012'''*500

dataframe = pd.read_csv(StringIO(s))

# only create the data series once

full_names = dataframe['fullname']

for index, row1 in full_names.items():

# skip elements that are already compared

for row2 in full_names.iloc[index+1::]:

# use a score_cutoff to improve the runtime for bad matches

score = fuzz.ratio(row1, row2, score_cutoff=90)

if score:

_list.append([row1, row2, score])

反对回复 2023-07-27

慕码人8056858

TA贡献1803条经验获得超6个赞

您可以创建第一个模糊数据：

import pandas as pd

from io import StringIO

from fuzzywuzzy import fuzz

data = StringIO("""

Jerry Smith

Morty Smith

Rick Sanchez

Jery Smith

Morti Smith

""")

df = pd.read_csv(data, names=['full_name'])

for index, row in df.iterrows():

df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))

print(df.to_string())

输出：

full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith

0 Jerry Smith 100 73 26 95 64

1 Morty Smith 73 100 26 76 91

2 Rick Sanchez 26 26 100 27 35

3 Jery Smith 95 76 27 100 67

4 Morti Smith 64 91 35 67 100

然后找到所选名称的最佳匹配：

data_rows = df[df['Jerry Smith'] > 90]

print(data_rows)

输出：

full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith

0 Jerry Smith 100 73 26 95 64

3 Jery Smith 95 76 27 100 67

反对回复 2023-07-27

千万里不及你

TA贡献1784条经验获得超9个赞

这种比较方法会做双重工作，因为在“Jerry Smith”和“Morti Smith”之间运行 fuzz.ratio 与在“Morti Smith”和“Jerry Smith”之间运行相同。

如果您迭代子数组，那么您将能够更快地完成此操作。

dataframe = pd.read_csv('datafile.csv')

_list = []

for i_dataframe in range(len(dataframe)-1):

comparison_fullname = dataframe['fullname'][i_dataframe]

for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):

if entry_score >=90:

_list.append((comparison_fullname, entry_fullname, entry_score)

print(_list)

这将防止任何重复工作。

反对回复 2023-07-27

4 回答
0 关注
404 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

在 pandas 单列上运行 fuzzywuzzy 比率

在 pandas 单列上运行 fuzzywuzzy 比率

4 回答

添加回答