为了账号安全,请及时绑定邮箱和手机立即绑定

如何在文本中获取匹配的 n-gram 的偏移量

如何在文本中获取匹配的 n-gram 的偏移量

噜噜哒 2022-05-24 15:54:34
我想匹配文本中的字符串(n-gram),并使用一种方法来获得偏移量:string_to_match = "many workers are very underpaid"  text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."所以结果我想得到一个像这样的元组("matched", 44, 75),其中 44 是开始,75 是结束。这是我构建的代码,但它仅适用于 unigram。def extract_offsets(line, _len=len):    words = line.split()    index = line.index    offsets = []    append = offsets.append    running_offset = 0    for word in words:        word_offset = index(word, running_offset)        word_len = _len(word)        running_offset = word_offset + word_len        append(("matched", word_offset, running_offset - 1))    return offsetsdef get_entities(offsets):    entities = []    for elm in offsets:        if elm[0] == "string_to_match": # here string_to_match is only one word            entities.append(elm)    return entitiesoffsets = extract_offsets(text)entities = get_entities(offsets) # [("matched", start, end)]任何使之适用于字符串序列或 n-gram 的提示!
查看完整描述

1 回答

?
鸿蒙传说

TA贡献1865条经验 获得超7个赞

您可以re.finditer()调用span()匹配对象上的方法来获取匹配子字符串的开始和结束索引-


def m():

    string_to_match = "many workers are very underpaid"

    text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."

    m = re.finditer(r'%s'%(string_to_match),text)

    for x in m:

        print x.group(0), x.span()     # x.span() will return the beginning and the ending indices of the matched substring as a tuple



查看完整回答
反对 回复 2022-05-24
  • 1 回答
  • 0 关注
  • 87 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信