1 回答
TA贡献1828条经验 获得超4个赞
尽管您创建了在 value_counts 之前df2使用 ed 系列更容易。stack这允许您对其进行过滤,然后str.join仅返回您想要保留的单词。
s = df['column1'].str.split(expand=True).stack()
# Keep only words with frequency above specified threshold
cutoff = 5
s = s[s.groupby(s).transform('size') >= cutoff]
# Alignment based on original Index
df['column1'] = s.groupby(level=0).agg(' '.join)
column1 column2
0 better better rights rights rights rights rights 2015
1 better rights 2016
2 better 2015
3 better 2014
据您所知,如果使用value_countsDataFrame,您可以对其进行子集化并ListKeywords仅通过指定截止值来生成。但是,我们已经split通过'column1'Series 来获得计数,所以在这里重新计算是相当低效的。
df2 = df['column1'].str.split(expand=True).stack().value_counts()
cutoff = 5
ListKeywords = df2[df2 >= cutoff].index
#Index(['rights', 'better'], dtype='object')
df['column1'].apply(lambda x: ' '.join([i for i in x.split(' ') if i in ListKeywords]))
起始数据
df = pd.DataFrame({'column1': ['better spotted better rights rights rights fresh fresh rights rights',
'better rights reserved', 'better', 'better horse'],
'column2': [2015, 2016, 2015, 2014]})
添加回答
举报