首页猿问如何使用 difflib...

如何使用 difflib 突出显示（仅）单词错误？

Python

翻阅古今 2023-03-22 16:34:13

我正在尝试将语音转文本 API 的输出与地面实况转录进行比较。我想要做的是将语音到文本 API 遗漏或误解的基本事实中的单词大写。例如：真相： The quick brown fox jumps over the lazy dog.语音到文本输出： the quick brown box jumps over the dog期望的结果： The quick brown FOX jumps over the LAZY dog.我最初的直觉是从基本事实中删除大写和标点符号并使用 difflib。这让我得到了准确的差异，但我无法将输出映射回原始文本中的位置。我想保留基本事实的大写和标点符号来显示结果，即使我只对单词错误感兴趣。有什么方法可以将 difflib 输出表示为原始文本的词级变化吗？

查看完整描述

3 回答

慕斯王

TA贡献1864条经验获得超2个赞

我还想建议一个使用 difflib 的解决方案，但我更喜欢使用 RegEx 进行单词检测，因为它会更精确并且更能容忍奇怪的字符和其他问题。

我在您的原始字符串中添加了一些奇怪的文字以表明我的意思：

import re

import difflib

truth = 'The quick! brown - fox jumps, over the lazy dog.'

speech = 'the quick... brown box jumps. over the dog'

truth = re.findall(r"[\w']+", truth.lower())

speech = re.findall(r"[\w']+", speech.lower())

for d in difflib.ndiff(truth, speech):

print(d)

输出

the

quick

brown

- fox

+ box

jumps

over

the

- lazy

dog

另一个可能的输出：

diff = difflib.unified_diff(truth, speech)

print(''.join(diff))

输出

---

+++

@@ -1,9 +1,8 @@

the quick brown-fox+box jumps over the-lazy dog

反对回复 2023-03-22

HUX布斯

TA贡献1876条经验获得超6个赞

为什么不将句子拆分成单词然后在这些单词上使用 difflib？

import difflib

truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(

'.').split()

speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()

for d in difflib.ndiff(truth, speech):

print(d)

反对回复 2023-03-22

神不在的星期二

TA贡献1963条经验获得超6个赞

所以我想我已经解决了这个问题。我意识到 difflib 的“contextdiff”提供了其中有变化的行的索引。为了获取“ground truth”文本的索引，我删除了大写/标点符号，将文本拆分为单个单词，然后执行以下操作：

altered_word_indices = []

diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)

for line in diff:

if line.startswith('*** ') and line.endswith(' ****\n'):

line = line.replace(' ', '').replace('\n', '').replace('*', '')

if ',' in line:

split_line = line.split(',')

for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):

altered_word_indices.append((int(split_line[0]) + i) - 1)

else:

altered_word_indices.append(int(line) - 1)

在此之后，我将更改后的单词大写打印出来：

split_ground_truth = ground_truth.split(' ')

for i in range(0, len(split_ground_truth)):

if i in altered_word_indices:

print(split_ground_truth[i].upper(), end=' ')

else:

print(split_ground_truth[i], end=' ')

这让我可以打印出“The quick brown FOX jumps over the LAZY dog”。（包括大写/标点符号）而不是“快速的棕色 FOX 跳过 LAZY 狗”。

这不是一个超级优雅的解决方案，它需要经过测试、清理、错误处理等。但这似乎是一个不错的开始，并且可能对遇到相同问题的其他人有用。我会把这个问题悬而未决几天，以防有人想出一种不太粗略的方法来获得相同的结果。

反对回复 2023-03-22

3 回答
0 关注
101 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使用 difflib 突出显示（仅）单词错误？

如何使用 difflib 突出显示（仅）单词错误？

3 回答

添加回答