为了账号安全,请及时绑定邮箱和手机立即绑定

在python中计算没有停用词的tfidf矩阵

在python中计算没有停用词的tfidf矩阵

繁华开满天机 2023-08-15 16:32:47
我正在尝试计算一个tfidf没有停用词的矩阵。这是我的代码:def removeStopWords(documents):    stop_words = set(stopwords.words('italian'))    english_stop_words = set(stopwords.words('english'))    stop_words.update(list(set(english_stop_words)))    for d in documents:        document = d['document']        word_tokens = word_tokenize(document)         filtered_sentence = ''        for w in word_tokens:            if not inStopwords(w, stop_words):                 filtered_sentence = w + ' ' + filtered_sentence        d['document'] = filtered_sentence[:-1]    return calculateTFIDF(documents)def calculateTFIDF(corpus):    tfidf = TfidfVectorizer()    x = tfidf.fit_transform(corpus)    df_tfidf = pd.DataFrame(x.toarray(), columns=tfidf.get_feature_names())    return {c: s[s > 0] for c, s in zip(df_tfidf, df_tfidf.T.values)}但是当我返回矩阵(使用形式{word:value})时,它还包含一些停用词,例如whenor il。我该如何解决?谢谢
查看完整描述

1 回答

?
一只萌萌小番薯

TA贡献1795条经验 获得超7个赞

有更好的方法来删除 TF-IDF 计算中的停用词。有TfidfVectorizer一个参数stop_words,您可以在其中传递要排除的单词集合。


from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd


documents = ['I went to the barbershop when my hair was long.', 'The barbershop was closed.']


# create set of stopwords to remove

stop_words = set(stopwords.words('italian'))

english_stop_words = set(stopwords.words('english'))

stop_words.update(english_stop_words)


# check if word in stop words

print('when' in stop_words)  # True

print('il' in stop_words)  # True


# else add word to the set

print('went' in stop_words)  # False

stop_words.add('went')


# create tf-idf from original documents

tfidf = TfidfVectorizer(stop_words=stop_words)

x = tfidf.fit_transform(documents)

df_tfidf = pd.DataFrame(x.toarray(), columns=tfidf.get_feature_names())


print({c: s[s > 0] for c, s in zip(df_tfidf, df_tfidf.T.values)})

# {'barbershop': array([0.44943642, 0.57973867]), 'closed': array([0.81480247]), 'hair': array([0.6316672]), 'long': array([0.6316672])}



查看完整回答
反对 回复 2023-08-15
  • 1 回答
  • 0 关注
  • 111 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信