为了账号安全,请及时绑定邮箱和手机立即绑定

具有自定义匹配功能的 Python 序列匹配器

具有自定义匹配功能的 Python 序列匹配器

杨__羊羊 2021-11-30 16:43:05
我有两个列表,我想使用 python difflib/sequence 匹配器找到匹配的元素,它是这样的:from difflib import SequenceMatcherdef match_seq(list1,list2):    output=[]    s = SequenceMatcher(None, list1, list2)    blocks=s.get_matching_blocks()    for bl in blocks:        #print(bl, bl.a, bl.b, bl.size)        for bi in range(bl.size):            cur_a=bl.a+bi            cur_b=bl.b+bi            output.append((cur_a,cur_b))    return output所以当我在这样的两个列表上运行它时list1=["orange","apple","lemons","grapes"]list2=["pears", "orange","apple", "lemons", "cherry", "grapes"]for a,b in match_seq(list1,list2):    print(a,b, list1[a],list2[b])我得到这个输出:(0, 1, 'orange', 'orange')(1, 2, 'apple', 'apple')(2, 3, 'lemons', 'lemons')(3, 5, 'grapes', 'grapes')但假设我不想只匹配相同的项目,而是使用匹配函数(例如,可以匹配橙色和橙色或反之亦然的函数,或者匹配另一种语言中的等效词)。list3=["orange","apple","lemons","grape"]list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]difflib/sequence 匹配器或任何其他 python 内置库中是否有任何选项可以提供此功能,以便我可以匹配 list3 和 list 4,以及 list3 和 list5,就像我对 list 1 和 list2 所做的一样?一般来说,你能想到一个解决方案吗?我想用我想要匹配的可能的等价词替换目标列表中的每个单词,但这可能会有问题,因为我可能需要为每个单词设置多个等价词,这可能会扰乱序列
查看完整描述

3 回答

?
牧羊人nacy

TA贡献1862条经验 获得超7个赞

您基本上有三种解决方案:1)编写自己的实现diff;2)破解difflib模块;3)找到解决方法。


你自己的实现

在情况 1) 中,您可以查看此问题 并阅读一些书籍,例如CLRS或 Robert Sedgewick 的书籍。


破解difflib模块

在情况 2) 中,查看源代码:在第 479 行get_matching_blocks调用。在 的核心中,您拥有将列表元素映射到它们在列表中的索引的字典。如果你覆盖这本字典,你就可以实现你想要的。这是标准版本:find_longest_matchfind_longest_matchb2jab


>>> import difflib

>>> from difflib import SequenceMatcher

>>> list3 = ["orange","apple","lemons","grape"]

>>> list4 = ["pears", "oranges","apple", "lemon", "cherry", "grapes"]

>>> s = SequenceMatcher(None, list3, list4)

>>> s.get_matching_blocks()

[Match(a=1, b=2, size=1), Match(a=4, b=6, size=0)]

>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]

[(1, 2, 'apple', 'apple')]

这是被黑的版本:


>>> s = SequenceMatcher(None, list3, list4)

>>> s.b2j

{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5]}

>>> s.b2j = {**s.b2j, 'orange':s.b2j['oranges'], 'lemons':s.b2j['lemon'], 'grape':s.b2j['grapes']}

>>> s.b2j

{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5], 'orange': [1], 'lemons': [3], 'grape': [5]}

>>> s.get_matching_blocks()

[Match(a=0, b=1, size=3), Match(a=3, b=5, size=1), Match(a=4, b=6, size=0)]

>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]

[(0, 1, 'orange', 'oranges'), (1, 2, 'apple', 'apple'), (2, 3, 'lemons', 'lemon'), (3, 5, 'grape', 'grapes')]

这并不难自动化,但我不建议您使用该解决方案,因为有一个非常简单的解决方法。


解决方法

这个想法是按家庭对单词进行分组:


families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes"}]

现在很容易创建一个字典,将家庭中的每个单词映射到这些单词中的一个(让我们称之为主词):


>>> d = {w:main for main, *alternatives in map(list, families) for w in alternatives}

>>> d

{'pears': 'peras', 'orange': 'naranjas', 'oranges': 'naranjas', 'manzana': 'apple', 'lemon': 'lemons', 'limón': 'lemons', 'cherry': 'cereza', 'grape': 'grapes'}

请注意,main, *alternatives in map(list, families)使用星号运算符将家庭分解为一个主要词(列表的第一个)和一个替代列表:


>>> head, *tail = [1,2,3,4,5]

>>> head

1

>>> tail

[2, 3, 4, 5]

然后,您可以将列表转换为仅使用主要词:


>>> list3=["orange","apple","lemons","grape"]

>>> list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]

>>> list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

>>> [d.get(w, w) for w in list3]

['naranjas', 'apple', 'limón', 'grapes']

>>> [d.get(w, w) for w in list4]

['peras', 'naranjas', 'apple', 'limón', 'cereza', 'grapes']

>>> [d.get(w, w) for w in list5]

['peras', 'naranjas', 'apple', 'limón', 'cereza', 'uvas']

表达式d.get(w, w)将返回d[w]ifw是一个键, elsew本身。因此,属于一个族的词被转换为该族的主要词,而其他词保持不变。


这些列表很容易与difflib.


重要提示:与 diff 算法相比,列表转换的时间复杂度可以忽略不计,因此您不应看到差异。


完整代码

作为奖励,完整代码:


def match_seq(list1, list2):

    """A generator that yields matches of list1 vs list2"""

    s = SequenceMatcher(None, list1, list2)

    for block in s.get_matching_blocks():

        for i in range(block.size):

            yield block.a + i, block.b + i # you don't need to store the matches, just yields them


def create_convert(*families):

    """Return a converter function that converts a list

    to the same list with only main words"""

    d = {w:main for main, *alternatives in map(list, families) for w in alternatives}

    return lambda L: [d.get(w, w) for w in L]


families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes", "uvas"}]

convert = create_convert(*families)


list3=["orange","apple","lemons","grape"]

list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]

list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]


print ("list3 vs list4")

for a,b in match_seq(convert(list3), convert(list4)):

    print(a,b, list3[a],list4[b])


#  list3 vs list4

# 0 1 orange oranges

# 1 2 apple apple

# 2 3 lemons lemon

# 3 5 grape grapes


print ("list3 vs list5")

for a,b in match_seq(convert(list3), convert(list5)):

    print(a,b, list3[a],list5[b])


# list3 vs list5

# 0 1 orange naranjas

# 1 2 apple manzana

# 2 3 lemons limón

# 3 5 grape uvas


查看完整回答
反对 回复 2021-11-30
?
慕的地10843

TA贡献1785条经验 获得超8个赞

下面是使用一类,从继承的方法UserString和覆盖__eq__()和__hash__()这样的字符串视为同义词评估作为平等的:


import collections

from difflib import SequenceMatcher



class SynonymString(collections.UserString):

    def __init__(self, seq, synonyms, inverse_synonyms):

        super().__init__(seq)


        self.synonyms = synonyms

        self.inverse_synonyms = inverse_synonyms


    def __eq__(self, other):

        if self.synonyms.get(other) and self.data in self.synonyms.get(other):

            return True

        return self.data == other


    def __hash__(self):

        if str(self.data) in self.inverse_synonyms:

            return hash(self.inverse_synonyms[self.data])

        return hash(self.data)



def match_seq_syn(list1, list2, synonyms):


    inverse_synonyms = {

        string: key for key, value in synonyms.items() for string in value

    }


    list1 = [SynonymString(s, synonyms, inverse_synonyms) for s in list1]

    list2 = [SynonymString(s, synonyms, inverse_synonyms) for s in list2]


    output = []

    s = SequenceMatcher(None, list1, list2)

    blocks = s.get_matching_blocks()


    for bl in blocks:

        for bi in range(bl.size):

            cur_a = bl.a + bi

            cur_b = bl.b + bi

            output.append((cur_a, cur_b))

    return output



list3 = ["orange", "apple", "lemons", "grape"]

list5 = ["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]


synonyms = {

    "orange": ["oranges", "naranjas"],

    "apple": ["manzana"],

    "pears": ["peras"],

    "lemon": ["lemons", "limón"],

    "cherry": ["cereza"],

    "grape": ["grapes", "uvas"],

}


for a, b in match_seq_syn(list3, list5, synonyms):

    print(a, b, list3[a], list5[b])

结果(比较列表 3 和 5):


0 1 橙色 naranjas

1 2 苹果曼扎纳

2 3 个柠檬

3 5 葡萄藤


查看完整回答
反对 回复 2021-11-30
?
呼唤远方

TA贡献1856条经验 获得超11个赞

因此,假设您想用应该相互匹配的元素填充列表。我没有使用任何库,但Generators。我不确定效率,我试过这个代码一次,但我认为它应该工作得很好。


orange_list = ["orange", "oranges"] # Fill this with orange matching words

pear_list = ["pear", "pears"]

lemon_list = ["lemon", "lemons"]

apple_list = ["apple", "apples"]

grape_list = ["grape", "grapes"]


lists = [orange_list, pear_list, lemon_list, apple_list, grape_list] # Put your matching lists inside this list


def match_seq_bol(list1, list2):

    output=[]

    for x in list1:

        for lst in lists:

            matches = (y for y in list2 if (x in lst and y in lst))

            if matches:

                for i in matches:

                    output.append((list1.index(x), list2.index(i), x,i))

    return output;


list3=["orange","apple","lemons","grape"]

list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]


print(match_seq_bol(list3, list4))

match_seq_bol()表示基于列表的匹配序列。


输出匹配list3和list4将是:


[

    (0, 1, 'orange', 'oranges'),

    (1, 2, 'apple', 'apple'),

    (2, 3, 'lemons', 'lemon'),

    (3, 5, 'grape', 'grapes')

]


查看完整回答
反对 回复 2021-11-30
  • 3 回答
  • 0 关注
  • 181 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信