我正在尝试按类别对 10 个最常用的单词进行分组。我已经看到了这个答案,但我不能完全修改它以获得我想要的输出。category | sentence A cat runs over big dog A dog runs over big cat B random sentences include words C including this one所需的输出:category | word/frequency A runs, 2 cat: 2 dog: 2 over: 2 big: 2 B random: 1 C including: 1由于我的数据框非常大,我只想获得前 10 个最常出现的词。我也看过这个答案df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))但此方法也返回字母数。
3 回答
元芳怎么了
TA贡献1798条经验 获得超7个赞
# Split the sentence into Series
df1 = pd.DataFrame(df.sentence.str.split(' ').tolist())
# Add category with as not been adding with the split
df1['category'] = df['category']
# Melt the Series corresponding to the splited sentence
df1 = pd.melt(df1, id_vars='category', value_vars=df1.columns[:-1].tolist())
# Groupby and count (reset_index will create a column nammed 0)
df1 = df1.groupby(['category', 'value']).size().reset_index()
# Keep the 10 largests numbers
df1 = df1.nlargest(10, 0)
添加回答
举报
0/150
提交
取消