如何优化预处理所有文本文档而不使用for循环在每次迭代中预处理单个文本文档？

我想优化下面的代码，以便它可以有效地处理 3000 个文本数据，然后这些数据将被馈送到 TFIDF Vectorizer 和 links() 进行聚类。到目前为止，我已经使用 Pandas 读取了 excel 并将数据框保存到列表变量中。然后我将列表中的每个文本元素迭代为标记，然后从元素中过滤掉停用词。过滤后的元素存储在另一个变量中，该变量存储在列表中。所以最后，我创建了一个处理过的文本元素列表（来自列表）。我认为可以在创建列表和过滤掉停用词时以及将数据保存到两个不同的变量中时执行优化：documents_no_stopwords 和 processing_words。如果有人可以帮助我或建议我遵循的方向，那就太好了。temp=0df=pandas.read_excel('File.xlsx')for text in df['text'].tolist(): temp=temp+1 preprocessing(text) print tempdef preprocessing(word): tokens = tokenizer.tokenize(word) processed_words = [] for w in tokens: if w in stop_words: continue else: ## a new list is created with only the nouns in them for each text document processed_words.append(w) ## This step creates a list of text documents with only the nouns in them documents_no_stopwords.append(' '.join(processed_words)) processed_words=[]

查看完整描述

1 回答

冉冉说

TA贡献1877条经验获得超1个赞

您需要首先制作set停用词并使用列表理解来过滤标记。

def preprocessing(txt):

tokens = word_tokenize(txt)

# print(tokens)

stop_words = set(stopwords.words("english"))

tokens = [i for i in tokens if i not in stop_words]

return " ".join(tokens)

string = "Hey this is Sam. How are you?"

print(preprocessing(string))

输出：

'Hey Sam . How ?'

而不是使用for循环，使用df.apply如下：

df['text'] = df['text'].apply(preprocessing)

为什么集合优于列表

stopwords.words() 如果检查有重复条目，len(stopwords.words())并且len(set(stopwords.words())) 设置的长度小了几百。这就是为什么set这里是首选。

这是性能使用list和set

x = stopwords.words('english')

y = set(stopwords.words('english'))

%timeit new = [i for i in tokens if i not in x]

# 10000 loops, best of 3: 120 µs per loop

%timeit old = [j for j in tokens if j not in y]

# 1000000 loops, best of 3: 1.16 µs per loop

而且list-comprehension速度比平时快for-loop。

反对回复 2021-10-19

热搜

最近搜索清空

如何优化预处理所有文本文档而不使用for循环在每次迭代中预处理单个文本文档？

如何优化预处理所有文本文档而不使用for循环在每次迭代中预处理单个文本文档？

1 回答

添加回答