首页猿问在 LDA 中指定词汇输入

在 LDA 中指定词汇输入

Python

蓝山帝景 2022-08-02 10:42:03

我试图了解如何在我的情况下使用LDA。我有一个包含许多文档的语料库，我想看看一组非常具体的单词和ngram是如何跨主题的分布的。有没有办法将特定单词的列表指定为主题建模的词汇表？我一直在使用gensim实现，我相信这个论点可以解决这个问题，但是文档对我来说并不清楚。我的理解是否正确？id2word

查看完整描述

2 回答

呼如林

TA贡献1798条经验获得超3个赞

你可以使用Scikit学习计数矢量器为此

from sklearn.feature_extraction.text import CountVectorizer

from gensim import matutils

from gensim.models.ldamodel import LdaModel

text = ['computer time graph', 'survey response eps', 'human system computer','machinelearning is very hot topic','python win the race for simplicity as compared to other programming language']

# suppose this are the word that you want to be used in your vocab

vocabulary = ['machine','python','learning','human', 'system','hot','time']

vect = CountVectorizer(vocabulary = vocabulary)

x = vect.fit_transform(text)

feature_name = vect.get_feature_names()

# now you can use matutils helper function of gensim

model = LdaModel(matutils.Sparse2Corpus(x),num_topic=3,id2word=dict([(i, s) for i, s in enumerate(feature_name)]))

#printing the topic

model.show_topics()

//img1.sycdn.imooc.com//62e88f1c00013c6808520135.jpg

#to see the vocab that use being used

print(vect.get_feature_names())

['machine', 'python', 'learning', 'human', 'system', 'hot', 'time'] # you will get the feature that you want include

反对回复 2022-08-02

守着一只汪

TA贡献1872条经验获得超3个赞

LDA的主题建模方法是将每个文档视为一定比例的主题集合。每个主题作为关键字的集合，同样，以一定的比例。

一旦为算法提供了主题的数量，它就会重新排列文档中的主题分布和主题内的关键字分布，以获得主题关键字分布的良好组合。

主题模型的两个主要输入是字典或词汇（）和语料库。LDAid2word

您可以使用类似这样的东西来实现此目的：

import gensim.corpora as corpora

# Create Dictionary/Vocabulary

id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus

texts = data_lemmatized

# Term Document Frequency

corpus = [id2word.doc2bow(text) for text in texts]

反对回复 2022-08-02

2 回答
0 关注
141 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

在 LDA 中指定词汇输入

在 LDA 中指定词汇输入

2 回答

添加回答