获取标记化 csv 文件中的每个唯一单词

Python

DIEA 2022-06-02 17:30:53

这是 CSV 表 CSV 表中有两列。一个是摘要，另一个是文本。在我将它们组合在一起、转换为数据框并保存为 CSV 文件之前，这两列都是 typeOfList 。顺便说一句，表中的文本已经被清理（删除了所有标记并转换为小写）：我想遍历表格中的每个单元格，将摘要和文本拆分为单词并标记每个单词。我该怎么做？我尝试使用 python CSV 阅读器和 df.apply(word_tokenize)。我也尝试了 newList=set(summaries+texts)，但后来我无法对它们进行标记。解决问题的任何解决方案，无论是使用 CSV 文件、数据框还是列表。提前感谢您的帮助！注意：真实表有超过 50,000 行。===一些更新==这是我尝试过的代码。import pandas as pddata= pd.read_csv('test.csv')data.head()newTry=data.apply(lambda x: " ".join(x), axis=1)type(newTry)print (newTry)import nltkfor sentence in newTry: new=sentence.split() print(new) print(set(new))请参考屏幕截图中的输出。列表中有重复的单词，还有一些方括号。我应该如何删除它们？我试过用set，但它只给出一个句子值。

查看完整描述

1 回答

精慕HU

TA贡献1845条经验获得超8个赞

您可以使用内置的 csv 包来读取 csv 文件。和 nltk 来标记单词：

from nltk.tokenize import word_tokenize

import csv

words = []

def get_data():

with open("sample_csv.csv", "r") as records:

for record in csv.reader(records):

yield record

data = get_data()

next(data) # skip header

for row in data:

for sent in row:

for word in word_tokenize(sent):

if word not in words:

words.append(word)

print(words)

反对回复 2022-06-02

热搜

最近搜索清空

获取标记化 csv 文件中的每个唯一单词

获取标记化 csv 文件中的每个唯一单词

1 回答

添加回答