2 回答
TA贡献1875条经验 获得超3个赞
我发现 spacy 匹配器对匹配术语的索引进行排序,即使它发现术语列表中列出的术语早于另一个术语。所以我可以在下一个匹配的索引之前结束跨度。代码来显示我的意思:
data = u"Species:cat color:orange and white with yellow spots number feet: 4"
from spacy.matcher import PhraseMatcher
import en_core_web_sm
nlp = en_core_web_sm.load()
data=data.lower()
matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"species",u"color", u"number feet"]
patterns = list(nlp.tokenizer.pipe(terminology_list))
matcher.add("Terms", None, *patterns)
doc = nlp(data)
matches=matcher(doc)
matched_phrases={}
for idd, (match_id, start, end) in enumerate(matches):
key_match = doc[start:end]
if idd != len(matches)-1:
end_index=matches[idd+1][1]
else:
end_index=len(doc)
phrase = doc[end:end_index]
if phrase.text != '':
matched_phrases[key_match] = phrase
print(matched_phrases)
TA贡献1830条经验 获得超9个赞
我有一个不使用 spaCy 的想法。
首先,我将字符串拆分为令牌
split = "Species:cat color:orange and white with yellow spots number feet: 4".replace(": ", ":").split()
然后我遍历令牌列表,保存键,然后将值合并到键中,因为有新键
goal = []
key_value = None
for token in split:
print(token)
if ":" in token:
if key_value:
goal.append(kv)
key_value = token
else:
key_value = token
else:
key_value += " " + token
goal.append(key_value)
goal
>>>
['Species:cat', 'color:orange and white with yellow spots number', 'feet:4']
添加回答
举报