为了账号安全,请及时绑定邮箱和手机立即绑定

如何使用 NLTK 或拓扑结构进行词形还原

如何使用 NLTK 或拓扑结构进行词形还原

三国纷争 2022-09-13 19:43:44
我知道我的解释很长,但我觉得有必要。希望有人有耐心和乐于助人的灵魂:)我正在做一个情感分析项目atm,我被困在预处理部分。我导入了csv文件,将其转换为数据帧,将变量/列转换为正确的数据类型。然后我像这样进行了标记化,在数据帧(df_tweet1)中选择要标记的变量(推文内容):# Tokenizationtknzr = TweetTokenizer()tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]for i in tokenized_sents:    print(i)输出是一个包含单词(标记)的列表列表。然后我执行非索引字删除:# Stop word removalfrom nltk.corpus import stopwordsstop_words = set(stopwords.words("english"))#add words that aren't in the NLTK stopwords listnew_stopwords = ['!', ',', ':', '&', '%', '.', '’']new_stopwords_list = stop_words.union(new_stopwords)clean_sents = []for m in tokenized_sents:    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]    clean_sents.append(stop_m)输出相同,但没有非索引字接下来的两个步骤让我感到困惑(词性标记和词形还原)。我尝试了两件事:1)将上一个输出转换为字符串列表new_test = [' '.join(x) for x in clean_sents]因为我认为这将允许我使用此代码在一个步骤中执行这两个步骤:from pywsd.utils import lemmatize_sentencetext = new_testlemm_text = lemmatize_sentence(text, keepWordPOS=True)我得到了这个错误: 类型错误: 预期的字符串或类似字节的对象2) 分别执行 POS 和词形还原。第一个使用clean_sents作为输入的 POS:# PART-OF-SPEECH        def process_content(clean_sents):    try:        tagged_list = []          for lst in clean_sents[:500]:             for item in lst:                words = nltk.word_tokenize(item)                tagged = nltk.pos_tag(words)                tagged_list.append(tagged)        return tagged_list    except Exception as e:        print(str(e))output_POS_clean_sents = process_content(clean_sents)输出是一个列表列表,其中附加了带有标记的单词 然后我想重新修饰此输出,但是如何呢?我尝试了两个模块,但都给了我错误:from pywsd.utils import lemmatize_sentencelemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]              for s in output_POS_clean_sents]# ANDfrom nltk.stem.wordnet import WordNetLemmatizerlmtzr = WordNetLemmatizer()lemmatized = [[lmtzr.lemmatize(word) for word in s]              for s in output_POS_clean_sents]print(lemmatized)错误分别为:类型错误:预期的字符串或类似字节的对象属性错误:“元组”对象没有属性“endswith”
查看完整描述

2 回答

?
撒科打诨

TA贡献1934条经验 获得超2个赞

如果您使用的是数据帧,我建议您将预处理步骤结果存储在新列中。通过这种方式,您始终可以检查输出,并且始终可以创建一个列表列表,以用作一行代码后记中模型的输入。这种方法的另一个优点是,您可以轻松地可视化预处理线,并在需要时添加其他步骤,而不会感到困惑。


关于你的代码,它可以被优化(例如,你可以同时执行非索引字删除和标记化),我看到你执行的步骤有点混乱。例如,你执行多次词形还原,也使用不同的库,这样做是没有意义的。在我看来,nltk工作得很好,我个人使用其他库来预处理推文,只是为了处理表情符号,网址和主题标签,所有与推文特别相关的东西。


# I won't write all the imports, you get them from your code

# define new column to store the processed tweets

df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)


tknzr = TweetTokenizer()

lmtzr = WordNetLemmatizer()


stop_words = set(stopwords.words("english"))

new_stopwords = ['!', ',', ':', '&', '%', '.', '’']

new_stopwords_list = stop_words.union(new_stopwords)


# iterate through each tweet

for ind, row in df_tweet1.iterrows():


    # get initial tweet: ['This is the initial tweet']

    tweet = row['Tweet Content']


    # tokenisation, stopwords removal and lemmatisation all at once

    # out: ['initial', 'tweet']

    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]


    # pos tag, no need to lemmatise again after.

    # out: [('initial', 'JJ'), ('tweet', 'NN')]

    tweet = nltk.pos_tag(tweet)


    # save processed tweet into the new column

    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

因此,总的来说,您只需要4行,一行用于获取推文字符串,两行用于预处理文本,另一行用于存储推文。您可以添加额外的处理步骤,注意每个步骤的输出(例如,标记化返回字符串列表,pos标记返回元组列表,您遇到麻烦的原因)。


如果你愿意,你可以创建一个列表列表,其中包含数据帧中的所有推文:


# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]

all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]


查看完整回答
反对 回复 2022-09-13
?
烙印99

TA贡献1829条经验 获得超13个赞

第一部分是字符串列表。 需要一个字符串,因此传递将引发一个像您得到的错误。您必须单独传递每个字符串,然后从每个词根化字符串创建一个列表。所以:new_testlemmatize_sentencenew_test


text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

应该创建一个词形符号化句子的列表。


实际上,我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串:


import lemmy, re


def remove_stopwords(lst):

    with open('stopwords.txt', 'r') as sw:

        #read the stopwords file 

        stopwords = sw.read().split('\n')

        return [word for word in lst if not word in stopwords]


def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.


    -- body_text: string or list of strings

    -- language: language of the passed string(s), e.g. 'en', 'da' etc.

    """


    if isinstance(body_text, str):

        body_text = [body_text] #Convert whatever passed to a list to support passing of single string


    if not hasattr(body_text, '__iter__'):

        raise TypeError('Passed argument should be a sequence.')


    lemmatizer = lemmy.load(language) #load lemmatizing dictionary


    lemma_list = [] #list to store each lemmatized string 


    word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words


    for string in body_text:

        #remove punctuation and split words

        matches = word_regex.findall(string)


        #split words and lowercase them unless they are all caps

        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        #lemmatize each word and choose the shortest word of suggested lemmatizations

        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        lemma_list.append(' '.join(lemmatized_string))


    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意,你可以看看,但不要觉得有义务。如果它能帮助你得到任何想法,我会非常高兴,我花了很多时间试图自己弄清楚!


让我知道:-)第一部分是字符串列表。 需要一个字符串,因此传递将引发一个像您得到的错误。您必须单独传递每个字符串,然后从每个词根化字符串创建一个列表。所以:new_testlemmatize_sentencenew_test


text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

应该创建一个词形符号化句子的列表。


实际上,我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串:


import lemmy, re


def remove_stopwords(lst):

    with open('stopwords.txt', 'r') as sw:

        #read the stopwords file 

        stopwords = sw.read().split('\n')

        return [word for word in lst if not word in stopwords]


def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.


    -- body_text: string or list of strings

    -- language: language of the passed string(s), e.g. 'en', 'da' etc.

    """


    if isinstance(body_text, str):

        body_text = [body_text] #Convert whatever passed to a list to support passing of single string


    if not hasattr(body_text, '__iter__'):

        raise TypeError('Passed argument should be a sequence.')


    lemmatizer = lemmy.load(language) #load lemmatizing dictionary


    lemma_list = [] #list to store each lemmatized string 


    word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words


    for string in body_text:

        #remove punctuation and split words

        matches = word_regex.findall(string)


        #split words and lowercase them unless they are all caps

        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        #lemmatize each word and choose the shortest word of suggested lemmatizations

        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        lemma_list.append(' '.join(lemmatized_string))


    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意,你可以看看,但不要觉得有义务。如果它能帮助你得到任何想法,我会非常高兴,我花了很多时间试图自己弄清楚!


让我知道:-)


查看完整回答
反对 回复 2022-09-13
  • 2 回答
  • 0 关注
  • 94 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信