2 回答
TA贡献1934条经验 获得超2个赞
如果您使用的是数据帧,我建议您将预处理步骤结果存储在新列中。通过这种方式,您始终可以检查输出,并且始终可以创建一个列表列表,以用作一行代码后记中模型的输入。这种方法的另一个优点是,您可以轻松地可视化预处理线,并在需要时添加其他步骤,而不会感到困惑。
关于你的代码,它可以被优化(例如,你可以同时执行非索引字删除和标记化),我看到你执行的步骤有点混乱。例如,你执行多次词形还原,也使用不同的库,这样做是没有意义的。在我看来,nltk工作得很好,我个人使用其他库来预处理推文,只是为了处理表情符号,网址和主题标签,所有与推文特别相关的东西。
# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)
tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)
# iterate through each tweet
for ind, row in df_tweet1.iterrows():
# get initial tweet: ['This is the initial tweet']
tweet = row['Tweet Content']
# tokenisation, stopwords removal and lemmatisation all at once
# out: ['initial', 'tweet']
tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]
# pos tag, no need to lemmatise again after.
# out: [('initial', 'JJ'), ('tweet', 'NN')]
tweet = nltk.pos_tag(tweet)
# save processed tweet into the new column
df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet
因此,总的来说,您只需要4行,一行用于获取推文字符串,两行用于预处理文本,另一行用于存储推文。您可以添加额外的处理步骤,注意每个步骤的输出(例如,标记化返回字符串列表,pos标记返回元组列表,您遇到麻烦的原因)。
如果你愿意,你可以创建一个列表列表,其中包含数据帧中的所有推文:
# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]
TA贡献1829条经验 获得超13个赞
第一部分是字符串列表。 需要一个字符串,因此传递将引发一个像您得到的错误。您必须单独传递每个字符串,然后从每个词根化字符串创建一个列表。所以:new_testlemmatize_sentencenew_test
text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]
应该创建一个词形符号化句子的列表。
实际上,我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串:
import lemmy, re
def remove_stopwords(lst):
with open('stopwords.txt', 'r') as sw:
#read the stopwords file
stopwords = sw.read().split('\n')
return [word for word in lst if not word in stopwords]
def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.
-- body_text: string or list of strings
-- language: language of the passed string(s), e.g. 'en', 'da' etc.
"""
if isinstance(body_text, str):
body_text = [body_text] #Convert whatever passed to a list to support passing of single string
if not hasattr(body_text, '__iter__'):
raise TypeError('Passed argument should be a sequence.')
lemmatizer = lemmy.load(language) #load lemmatizing dictionary
lemma_list = [] #list to store each lemmatized string
word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words
for string in body_text:
#remove punctuation and split words
matches = word_regex.findall(string)
#split words and lowercase them unless they are all caps
lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
#lemmatize each word and choose the shortest word of suggested lemmatizations
lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
lemma_list.append(' '.join(lemmatized_string))
return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string
如果你愿意,你可以看看,但不要觉得有义务。如果它能帮助你得到任何想法,我会非常高兴,我花了很多时间试图自己弄清楚!
让我知道:-)第一部分是字符串列表。 需要一个字符串,因此传递将引发一个像您得到的错误。您必须单独传递每个字符串,然后从每个词根化字符串创建一个列表。所以:new_testlemmatize_sentencenew_test
text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]
应该创建一个词形符号化句子的列表。
实际上,我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串:
import lemmy, re
def remove_stopwords(lst):
with open('stopwords.txt', 'r') as sw:
#read the stopwords file
stopwords = sw.read().split('\n')
return [word for word in lst if not word in stopwords]
def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.
-- body_text: string or list of strings
-- language: language of the passed string(s), e.g. 'en', 'da' etc.
"""
if isinstance(body_text, str):
body_text = [body_text] #Convert whatever passed to a list to support passing of single string
if not hasattr(body_text, '__iter__'):
raise TypeError('Passed argument should be a sequence.')
lemmatizer = lemmy.load(language) #load lemmatizing dictionary
lemma_list = [] #list to store each lemmatized string
word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words
for string in body_text:
#remove punctuation and split words
matches = word_regex.findall(string)
#split words and lowercase them unless they are all caps
lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
#lemmatize each word and choose the shortest word of suggested lemmatizations
lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
lemma_list.append(' '.join(lemmatized_string))
return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string
如果你愿意,你可以看看,但不要觉得有义务。如果它能帮助你得到任何想法,我会非常高兴,我花了很多时间试图自己弄清楚!
让我知道:-)
添加回答
举报