首页猿问如何使用 NLTK...

如何使用 NLTK 或拓扑结构进行词形还原

Python

三国纷争 2022-09-13 19:43:44

我知道我的解释很长，但我觉得有必要。希望有人有耐心和乐于助人的灵魂:)我正在做一个情感分析项目atm，我被困在预处理部分。我导入了csv文件，将其转换为数据帧，将变量/列转换为正确的数据类型。然后我像这样进行了标记化，在数据帧（df_tweet1）中选择要标记的变量（推文内容）：# Tokenizationtknzr = TweetTokenizer()tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]for i in tokenized_sents: print(i)输出是一个包含单词（标记）的列表列表。然后我执行非索引字删除：# Stop word removalfrom nltk.corpus import stopwordsstop_words = set(stopwords.words("english"))#add words that aren't in the NLTK stopwords listnew_stopwords = ['!', ',', ':', '&', '%', '.', '’']new_stopwords_list = stop_words.union(new_stopwords)clean_sents = []for m in tokenized_sents: stop_m = [i for i in m if str(i).lower() not in new_stopwords_list] clean_sents.append(stop_m)输出相同，但没有非索引字接下来的两个步骤让我感到困惑（词性标记和词形还原）。我尝试了两件事：1）将上一个输出转换为字符串列表new_test = [' '.join(x) for x in clean_sents]因为我认为这将允许我使用此代码在一个步骤中执行这两个步骤：from pywsd.utils import lemmatize_sentencetext = new_testlemm_text = lemmatize_sentence(text, keepWordPOS=True)我得到了这个错误：类型错误：预期的字符串或类似字节的对象2）分别执行 POS 和词形还原。第一个使用clean_sents作为输入的 POS：# PART-OF-SPEECH def process_content(clean_sents): try: tagged_list = [] for lst in clean_sents[:500]: for item in lst: words = nltk.word_tokenize(item) tagged = nltk.pos_tag(words) tagged_list.append(tagged) return tagged_list except Exception as e: print(str(e))output_POS_clean_sents = process_content(clean_sents)输出是一个列表列表，其中附加了带有标记的单词然后我想重新修饰此输出，但是如何呢？我尝试了两个模块，但都给了我错误：from pywsd.utils import lemmatize_sentencelemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s] for s in output_POS_clean_sents]# ANDfrom nltk.stem.wordnet import WordNetLemmatizerlmtzr = WordNetLemmatizer()lemmatized = [[lmtzr.lemmatize(word) for word in s] for s in output_POS_clean_sents]print(lemmatized)错误分别为：类型错误：预期的字符串或类似字节的对象属性错误：“元组”对象没有属性“endswith”

查看完整描述

2 回答

撒科打诨

TA贡献1934条经验获得超2个赞

如果您使用的是数据帧，我建议您将预处理步骤结果存储在新列中。通过这种方式，您始终可以检查输出，并且始终可以创建一个列表列表，以用作一行代码后记中模型的输入。这种方法的另一个优点是，您可以轻松地可视化预处理线，并在需要时添加其他步骤，而不会感到困惑。

关于你的代码，它可以被优化（例如，你可以同时执行非索引字删除和标记化），我看到你执行的步骤有点混乱。例如，你执行多次词形还原，也使用不同的库，这样做是没有意义的。在我看来，nltk工作得很好，我个人使用其他库来预处理推文，只是为了处理表情符号，网址和主题标签，所有与推文特别相关的东西。

# I won't write all the imports, you get them from your code

# define new column to store the processed tweets

df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)

tknzr = TweetTokenizer()

lmtzr = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))

new_stopwords = ['!', ',', ':', '&', '%', '.', '’']

new_stopwords_list = stop_words.union(new_stopwords)

# iterate through each tweet

for ind, row in df_tweet1.iterrows():

# get initial tweet: ['This is the initial tweet']

tweet = row['Tweet Content']

# tokenisation, stopwords removal and lemmatisation all at once

# out: ['initial', 'tweet']

tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]

# pos tag, no need to lemmatise again after.

# out: [('initial', 'JJ'), ('tweet', 'NN')]

tweet = nltk.pos_tag(tweet)

# save processed tweet into the new column

df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

因此，总的来说，您只需要4行，一行用于获取推文字符串，两行用于预处理文本，另一行用于存储推文。您可以添加额外的处理步骤，注意每个步骤的输出（例如，标记化返回字符串列表，pos标记返回元组列表，您遇到麻烦的原因）。

如果你愿意，你可以创建一个列表列表，其中包含数据帧中的所有推文：

# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]

all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]

反对回复 2022-09-13

烙印99

TA贡献1829条经验获得超13个赞

第一部分是字符串列表。需要一个字符串，因此传递将引发一个像您得到的错误。您必须单独传递每个字符串，然后从每个词根化字符串创建一个列表。所以：new_testlemmatize_sentencenew_test

text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

应该创建一个词形符号化句子的列表。

实际上，我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串：

import lemmy, re

def remove_stopwords(lst):

with open('stopwords.txt', 'r') as sw:

#read the stopwords file

stopwords = sw.read().split('\n')

return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

-- body_text: string or list of strings

-- language: language of the passed string(s), e.g. 'en', 'da' etc.

"""

if isinstance(body_text, str):

body_text = [body_text] #Convert whatever passed to a list to support passing of single string

if not hasattr(body_text, '__iter__'):

raise TypeError('Passed argument should be a sequence.')

lemmatizer = lemmy.load(language) #load lemmatizing dictionary

lemma_list = [] #list to store each lemmatized string

word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words

for string in body_text:

#remove punctuation and split words

matches = word_regex.findall(string)

#split words and lowercase them unless they are all caps

lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

#lemmatize each word and choose the shortest word of suggested lemmatizations

lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

lemma_list.append(' '.join(lemmatized_string))

return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意，你可以看看，但不要觉得有义务。如果它能帮助你得到任何想法，我会非常高兴，我花了很多时间试图自己弄清楚！

让我知道：-）第一部分是字符串列表。需要一个字符串，因此传递将引发一个像您得到的错误。您必须单独传递每个字符串，然后从每个词根化字符串创建一个列表。所以：new_testlemmatize_sentencenew_test

text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

应该创建一个词形符号化句子的列表。

实际上，我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串：

import lemmy, re

def remove_stopwords(lst):

with open('stopwords.txt', 'r') as sw:

#read the stopwords file

stopwords = sw.read().split('\n')

return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

-- body_text: string or list of strings

-- language: language of the passed string(s), e.g. 'en', 'da' etc.

"""

if isinstance(body_text, str):

body_text = [body_text] #Convert whatever passed to a list to support passing of single string

if not hasattr(body_text, '__iter__'):

raise TypeError('Passed argument should be a sequence.')

lemmatizer = lemmy.load(language) #load lemmatizing dictionary

lemma_list = [] #list to store each lemmatized string

word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words

for string in body_text:

#remove punctuation and split words

matches = word_regex.findall(string)

#split words and lowercase them unless they are all caps

lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

#lemmatize each word and choose the shortest word of suggested lemmatizations

lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

lemma_list.append(' '.join(lemmatized_string))

return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意，你可以看看，但不要觉得有义务。如果它能帮助你得到任何想法，我会非常高兴，我花了很多时间试图自己弄清楚！

让我知道：-）

反对回复 2022-09-13

2 回答
0 关注
94 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使用 NLTK 或拓扑结构进行词形还原

如何使用 NLTK 或拓扑结构进行词形还原

2 回答

添加回答