如何在文本中获取匹配的 n-gram 的偏移量

我想匹配文本中的字符串（n-gram），并使用一种方法来获得偏移量：string_to_match = "many workers are very underpaid" text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."所以结果我想得到一个像这样的元组("matched", 44, 75)，其中 44 是开始，75 是结束。这是我构建的代码，但它仅适用于 unigram。def extract_offsets(line, _len=len): words = line.split() index = line.index offsets = [] append = offsets.append running_offset = 0 for word in words: word_offset = index(word, running_offset) word_len = _len(word) running_offset = word_offset + word_len append(("matched", word_offset, running_offset - 1)) return offsetsdef get_entities(offsets): entities = [] for elm in offsets: if elm[0] == "string_to_match": # here string_to_match is only one word entities.append(elm) return entitiesoffsets = extract_offsets(text)entities = get_entities(offsets) # [("matched", start, end)]任何使之适用于字符串序列或 n-gram 的提示！

查看完整描述

1 回答

鸿蒙传说

TA贡献1865条经验获得超7个赞

您可以re.finditer()调用span()匹配对象上的方法来获取匹配子字符串的开始和结束索引-

def m():

string_to_match = "many workers are very underpaid"

text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."

m = re.finditer(r'%s'%(string_to_match),text)

for x in m:

print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple

反对回复 2022-05-24

热搜

最近搜索清空

如何在文本中获取匹配的 n-gram 的偏移量

如何在文本中获取匹配的 n-gram 的偏移量

1 回答

添加回答