我正在尝试按照本教程进行一些情感分析,并且我很确定到目前为止我的代码完全相同。然而,我的 BOW 值出现了重大差异。https://www.tensorscience.com/nlp/sentiment-analysis-tutorial-in-python-classifying-reviews-on-movies-and-products到目前为止,这是我的代码。import nltkimport pandas as pdimport stringfrom nltk.corpus import stopwordsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_selection import SelectKBest, chi2def openFile(path): #param path: path/to/file.ext (str) #Returns contents of file (str) with open(path) as file: data = file.read() return dataimdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')datasets = [imdb_data, amzn_data, yelp_data]combined_dataset = []# separate samples from each otherfor dataset in datasets: combined_dataset.extend(dataset.split('\n'))# separate each label from each sampledataset = [sample.split('\t') for sample in combined_dataset]df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])df = df[df["Labels"].notnull()]df = df.sample(frac=1)labels = df['Labels']vectorizer = TfidfVectorizer(min_df=15)bow = vectorizer.fit_transform(df['Reviews'])len(vectorizer.get_feature_names())selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)bow = vectorizer.fit_transform(df['Reviews'])bow这是我的结果。这是教程的结果。我一直在试图找出可能出现的问题,但还没有任何进展。
1 回答
![?](http://img1.sycdn.imooc.com/545845e900013e3e02200220-100-100.jpg)
LEATH
TA贡献1936条经验 获得超6个赞
问题是您正在提供索引,请尝试提供真正的词汇。
尝试这个:
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vocabulary = np.array(vectorizer.get_feature_names())[selected_features]
vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here
bow = vectorizer.fit_transform(df['Reviews'])
bow
<3000x200 sparse matrix of type '<class 'numpy.float64'>'
with 12916 stored elements in Compressed Sparse Row format>
添加回答
举报
0/150
提交
取消