1 回答
TA贡献1864条经验 获得超2个赞
这是实现您所追求的目标的一种方法:
自定义函数:
def get_top_n_bigram(row):
corpus = row['txt_main'] + row['txt_pro'] + row['txt_con'] + row['txt_adviceMgmt']
n = 2 % the top n
vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
使用定义的函数调用groupbywith :apply
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = df['date'].dt.to_period('Q')
newdf = df.groupby(['stock_symbol', 'quarter']).apply(get_top_n_bigram).to_frame(name = 'frequencies')
print(newdf)
frequencies
stock_symbol quarter
AMG 2011Q3 [(smart driven, 2), (driven risk, 2)]
2013Q1 [(asset management, 2), (smart working, 1)]
2014Q1 [(audit firm, 3), (employment agency, 2)]
MMM 2017Q2 [(working 3m, 1), (3m time, 1)]
添加回答
举报