为包含单词的列表生成唯一 ID

我有包含单词对的列表列表，并想在 id 上描述单词。Id 应该从 0 到 len(set(words))。该列表现在看起来像这样：[['pluripotent', 'Scharte'], ['Halswirbel', 'präventiv'], ['Kleiber', 'Blauspecht'], ['Kleiber', 'Scheidung'], ['Nillenlutscher', 'Salzstangenlecker']] 结果应该具有相同的格式，但使用 id 代替。例如：[[0, 1], [2, 3], [4, 5], [4, 6], [7, 8]]到目前为止，我有这个，但它没有给我正确的输出：def words_to_ids(labels): vocabulary = [] word_to_id = {} ids = [] for word1,word2 in labels: vocabulary.append(word1) vocabulary.append(word2) for i, word in enumerate(vocabulary): word_to_id [word] = i for word1,word2 in labels: ids.append([word_to_id [word1], word_to_id [word1]]) print(ids)输出：[[0, 0], [2, 2], [6, 6], [6, 6], [8, 8]]它在有唯一词的地方重复 id。

查看完整描述

2 回答

富国沪深

TA贡献1790条经验获得超9个赞

你有两个错误。首先，你有一个简单的错字，在这里：

for word1,word2 in labels:

ids.append([word_to_id [word1], word_to_id [word1]])

您在那里添加了word1 两次id 。更正第二个word1以查找word2。

接下来，您不是在测试您之前是否见过某个单词，因此'Kleiber'您首先为其指定 id 4，然后6在下一次迭代中覆盖该条目。您需要提供唯一的单词编号，而不是所有单词：

counter = 0

for word in vocabulary:

if word not in word_to_id:

word_to_id[word] = counter

counter += 1

或者，vocabulary如果您已经列出了该词，则您根本无法添加该词。vocabulary顺便说一下，您在这里真的不需要单独的列表。一个单独的循环不会给你买任何东西，所以以下也有效：

word_to_id = {}

counter = 0

for words in labels:

for word in words:

word_to_id [word] = counter

counter += 1

您可以通过使用defaultdict对象并itertools.count()提供默认值来大大简化代码：

from collections import defaultdict

from itertools import count

def words_to_ids(labels):

word_ids = defaultdict(count().__next__)

return [[word_ids[w1], word_ids[w2]] for w1, w2 in labels]

count()每次__next__调用该对象时，该对象都会为您提供系列中的下一个整数值，并且defaultdict()每次您尝试访问字典中尚不存在的键时都会调用该值。它们一起确保每个唯一单词的唯一 ID。

反对回复 2021-10-12

热搜

最近搜索清空

为包含单词的列表生成唯一 ID

为包含单词的列表生成唯一 ID

2 回答

添加回答