
nltk NaiveBayesClassifier情绪分析培训

nltk NaiveBayesClassifier情绪分析培训

达令说 2019-12-26 09:58:18
我正在NaiveBayesClassifier使用句子在Python中进行训练,这给了我下面的错误。我不知道错误可能是什么,任何帮助都将是很好的。我尝试了许多其他输入格式,但错误仍然存在。下面给出的代码:from text.classifiers import NaiveBayesClassifierfrom text.blob import TextBlobtrain = [('I love this sandwich.', 'pos'),         ('This is an amazing place!', 'pos'),         ('I feel very good about these beers.', 'pos'),         ('This is my best work.', 'pos'),         ("What an awesome view", 'pos'),         ('I do not like this restaurant', 'neg'),         ('I am tired of this stuff.', 'neg'),         ("I can't deal with this", 'neg'),         ('He is my sworn enemy!', 'neg'),         ('My boss is horrible.', 'neg') ]test = [('The beer was good.', 'pos'),        ('I do not enjoy my job', 'neg'),        ("I ain't feeling dandy today.", 'neg'),        ("I feel amazing!", 'pos'),        ('Gary is a friend of mine.', 'pos'),        ("I can't believe I'm doing this.", 'neg') ]classifier = nltk.NaiveBayesClassifier.train(train)我包括下面的追溯。Traceback (most recent call last):  File "C:\Users\5460\Desktop\train01.py", line 15, in <module>    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))  File "C:\Users\5460\Desktop\train01.py", line 15, in <genexpr>    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize    return _word_tokenize(text)  File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize    text = re.sub(r'^\"', r'``', text)  File "C:\Python27\lib\re.py", line 151, in sub    return _compile(pattern, flags).sub(repl, string, count)TypeError: expected string or buffer

3 回答


TA贡献1871条经验 获得超13个赞



training_data = [('I love this sandwich.', 'pos'),

('This is an amazing place!', 'pos'),

('I feel very good about these beers.', 'pos'),

('This is my best work.', 'pos'),

("What an awesome view", 'pos'),

('I do not like this restaurant', 'neg'),

('I am tired of this stuff.', 'neg'),

("I can't deal with this", 'neg'),

('He is my sworn enemy!', 'neg'),

('My boss is horrible.', 'neg')]


from nltk.tokenize import word_tokenize

from itertools import chain

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

本质上,vocabulary这里是@ 275365的相同all_word

>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))

>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

>>> print vocabulary == all_words


从每个数据点(即每个句子和pos / neg标签),我们要说一个特征(即词汇中的单词)是否存在。

>>> sentence = word_tokenize('I love this sandwich.'.lower())

>>> print {i:True for i in vocabulary if i in sentence}

{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}


>>> sentence = word_tokenize('I love this sandwich.'.lower())

>>> x =  {i:True for i in vocabulary if i in sentence}

>>> y =  {i:False for i in vocabulary if i not in sentence}

>>> x.update(y)

>>> print x

{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}


>>> sentence = word_tokenize('I love this sandwich.'.lower())

>>> x = {i:(i in sentence) for i in vocabulary}

{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

因此,对于每个句子,我们想告诉每个句子的分类器哪个词存在,哪个词不存在,并为其赋予pos / neg标记。我们可以称其为feature_set,它是一个由x(如上所示)及其标签组成的元组。

>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]


from nltk import NaiveBayesClassifier as nbc

classifier = nbc.train(feature_set)


>>> test_sentence = "This is the best band I've ever heard! foobar"

>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}



>>> classifier.classify(featurized_test_sentence)



from nltk import NaiveBayesClassifier as nbc

from nltk.tokenize import word_tokenize

from itertools import chain

training_data = [('I love this sandwich.', 'pos'),

('This is an amazing place!', 'pos'),

('I feel very good about these beers.', 'pos'),

('This is my best work.', 'pos'),

("What an awesome view", 'pos'),

('I do not like this restaurant', 'neg'),

('I am tired of this stuff.', 'neg'),

("I can't deal with this", 'neg'),

('He is my sworn enemy!', 'neg'),

('My boss is horrible.', 'neg')]

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

classifier = nbc.train(feature_set)

test_sentence = "This is the best band I've ever heard!"

featurized_test_sentence =  {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

print "test_sent:",test_sentence

print "tag:",classifier.classify(featurized_test_sentence)

反对 回复 2019-12-26
  • 3 回答
  • 0 关注
  • 612 浏览



意见反馈 帮助中心 APP下载