首页猿问如何使 pandas df...

如何使 pandas df 列中的文本处理对于大型文本数据更快？

Python

FFIVE 2023-10-26 16:37:04

我有一个超过 1GB 的大型文本文件（chat.txt），其格式如下：john|12-02-1999|hello#,there#,how#,are#,you#,tom$ tom|12-02-1999|hey#,john$,hows#, it#, goin#mary|12-03-1999|hello#,boys#,fancy#,meetin#,ya'll#,here#......john|12-02-2000|well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$mary|12-03-2000|catch#,you#,on#,the#,flipside#,tom$,and#,john$我想处理此文本并分别总结每个用户的某些关键字的字数（例如 500 个字 - 你好，很好，喜欢......晚餐，不）。此过程还涉及从每个单词中删除所有尾随特殊字符输出看起来像user hello nice like ..... dinner No Tom 10000 500 300 ..... 6000 0John 6000 1200 200 ..... 3000 5Mary 23 9000 10000 ..... 100 9000 这是我当前的 pythonic 解决方案：chat_data = pd.read_csv("chat.txt", sep="|", names =["user","date","words"])user_lst = chat_data.user.unique()user_grouped_data= pd.DataFrame(columns=["user","words"])user_grouped_data['user']=user_lstfor i,row in user_grouped_data.iterrows(): id = row["user"] temp = chat_data[chat_data["user"]==id] user_grouped_data.loc[i,"words"] = ",".join(temp["words"].tolist())result = pd.DataFrame(columns=[ "user", "hello", "nice", "like","...500 other keywords...", "dinner", "no"])result["user"]= user_lstfor i, row in result.iterrows(): id = row["user"] temp = user_grouped_data[user_grouped_data["user"]==id] words = temp.values.tolist()[0][1] word_lst = words.split(",") word_lst = [item[0:-1] for item in word_lst] t_dict = Counter(word_lst) keys = t_dict.keys() for word in keys: result.at[i,word]= t_dict.get(word)result.to_csv("user_word_counts.csv")这对于小数据来说效果很好，但是当我的 chat_data 超过 1GB 时，这个解决方案变得非常慢并且无法使用。下面是否有任何我可以改进的部分可以帮助我更快地处理数据？按用户对文本数据进行分组通过删除尾随特殊字符来清理每行中的文本数据计算单词数并将单词数分配给右列

查看完整描述

3 回答

千万里不及你

TA贡献1784条经验获得超9个赞

您可以将split逗号分隔的列转换为列表，explode将列表列转换为数据帧，groupby分解列表中的名称和值，unstack或者将数据帧转换为所需的格式，并使用, ,pivot_table对多索引列进行最终清理ETC。droplevel()reset_index()

以下所有内容都是矢量化的 pandas 方法，所以希望它很快。注意：当我从剪贴板读取并通过时，下面的代码中的三列是 [0,1,2]headers=None

输入：

df = pd.DataFrame({0: {0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary'},

1: {0: '12-02-1999',

1: '12-02-1999',

2: '12-03-1999',

3: '12-02-2000',

4: '12-03-2000'},

2: {0: 'hello#,there#,how#,are#,you#,tom$ ',

1: 'hey#,john$,hows#, it#, goin#',

2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",

3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',

4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$'}})

代码：

df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')

df = (df.explode(2)

.groupby([0, 2])[2].count()

.rename('Count')

.reset_index()

.set_index([0,2])

.unstack(1)

.fillna(0))

df.columns = df.columns.droplevel()

df = df.reset_index()

Out[1]:

2 0 goin it mary and are been boys catch catching ... on \

0 john 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 ... 0.0

1 mary 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 ... 1.0

2 tom 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0

2 the there tom tom up well with ya'll you

0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 2.0

1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0

您还可以使用.pivot_table代替.unstack()，这样可以节省这行代码：df.columns = df.columns.droplevel()：

df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')

df = (df.explode(2)

.groupby([0, 2])[2].count()

.rename('Count')

.reset_index()

.pivot_table(index=0, columns=2, values='Count')

.fillna(0)

.astype(int)

.reset_index())

Out[45]:

2 0 goin it mary and are been boys catch catching ... on \

0 john 0 0 1 1 1 1 0 0 1 ... 0

1 mary 0 0 0 1 0 0 1 1 0 ... 1

2 tom 1 1 0 0 0 0 0 0 0 ... 0

2 the there tom tom up well with ya'll you

0 0 1 0 1 1 1 1 0 2

1 1 0 1 0 0 0 0 1 1

2 0 0 0 0 0 0 0 0 0

[3 rows x 31 columns]

反对回复 2023-10-26

九州编程

TA贡献1785条经验获得超4个赞

如果你会使用scikit-learn，那就很容易了CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

s = df['words'].str.replace("#|\$|\s+", "")

model = CountVectorizer(tokenizer=lambda x: x.split(','))

df_final = pd.DataFrame(model.fit_transform(s).toarray(),

columns=model.get_feature_names(),

index=df.user).sum(level=0)

Out[279]:

and are been boys catch catching fancy flipside goin hello \

user

john 1 1 1 0 0 1 0 0 0 1

tom 0 0 0 0 0 0 0 0 1 0

mary 1 0 0 1 1 0 1 1 0 1

here hey how hows it its john mary meetin nice on the there \

user

john 0 0 1 0 0 1 0 1 0 1 0 0 1

tom 0 1 0 1 1 0 1 0 0 0 0 0 0

mary 1 0 0 0 0 0 1 0 1 0 1 1 0

tom up well with ya'll you

user

john 1 1 1 1 0 2

tom 0 0 0 0 0 0

mary 1 0 0 0 1 1

反对回复 2023-10-26

江户川乱折腾

TA贡献1851条经验获得超5个赞

我不确定这种方法在大型 DataFrame 上的速度有多快，但您可以尝试一下。首先，删除特殊字符并将字符串拆分为单词列表，从而形成另一列：

from itertools import chain
from collections import Counter
df['lists'] = df['words'].str.replace("#|\$", "").str.split(",")

现在，按用户分组），将列表收集到一个列表中，并使用以下命令计算出现次数Counter：

df.groupby('user')['lists'].apply(chain.from_iterable)\
                           .apply(Counter)\
                           .apply(pd.Series)\
                           .fillna(0).astype(int)

反对回复 2023-10-26

3 回答
0 关注
149 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使 pandas df 列中的文本处理对于大型文本数据更快？

如何使 pandas df 列中的文本处理对于大型文本数据更快？

3 回答

添加回答