我想匹配文本中的字符串(n-gram),并使用一种方法来获得偏移量:string_to_match = "many workers are very underpaid" text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."所以结果我想得到一个像这样的元组("matched", 44, 75),其中 44 是开始,75 是结束。这是我构建的代码,但它仅适用于 unigram。def extract_offsets(line, _len=len): words = line.split() index = line.index offsets = [] append = offsets.append running_offset = 0 for word in words: word_offset = index(word, running_offset) word_len = _len(word) running_offset = word_offset + word_len append(("matched", word_offset, running_offset - 1)) return offsetsdef get_entities(offsets): entities = [] for elm in offsets: if elm[0] == "string_to_match": # here string_to_match is only one word entities.append(elm) return entitiesoffsets = extract_offsets(text)entities = get_entities(offsets) # [("matched", start, end)]任何使之适用于字符串序列或 n-gram 的提示!
1 回答
鸿蒙传说
TA贡献1865条经验 获得超7个赞
您可以re.finditer()调用span()匹配对象上的方法来获取匹配子字符串的开始和结束索引-
def m():
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
m = re.finditer(r'%s'%(string_to_match),text)
for x in m:
print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple
添加回答
举报
0/150
提交
取消