1 回答
TA贡献1836条经验 获得超3个赞
我正在使用Nltk从句子中删除停用词。
例如。"I would love to fly again via American Airlines"
结果:"Love to fly American Airlines"
我曾尝试过以下代码:
# Tokenizing the text
txt = "I love to fly with American Airlines"
stopWords = set(stopwords.words("english"))
words = word_tokenize(txt)
# Creating a frequency table to keep the
# score of each word
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
# Creating a dictionary to keep the score
# of each sentence
sentences = sent_tokenize(txt)
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from the original text
average = int(sumValues / len(sentenceValue))
# Storing sentences into our summary.
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
print("Summary: " + summary)
这个结果是一个空字符串,因为我认为这个句子太短而无法Nltk工作。只是研究是否有更简单的方法,我打算为此训练一个模型。
添加回答
举报