首页猿问查找字符串中最长的重复序列

查找字符串中最长的重复序列

Python 正则表达式

开心每一天1111 2019-12-26 14:19:14

我需要找到一个字符串中最长的序列，但要注意的是，该序列必须重复三次或更多次。因此，例如，如果我的字符串是：fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld那么我想返回值“ helloworld ”。我知道完成此操作的几种方法，但是我面临的问题是实际的字符串非常大，因此我确实在寻找一种可以及时实现的方法。

查看完整描述

3 回答

慕田峪9158850

TA贡献1794条经验获得超8个赞

这个问题是最长重复子串问题的一个变体，并且存在一个使用后缀树的O（n）时间算法来解决。这个想法（如Wikipedia所建议）是构造后缀树（时间O（n）），用后代数注释树中的所有节点（使用DFS的时间O（n）），然后找到树中具有至少三个后代的最深节点（使用DFS的时间O（n））。该总体算法花费时间O（n）。

也就是说，众所周知，后缀树很难构建，因此您可能想要在尝试此实现之前，找到一个为您实现后缀树的Python库。快速的Google搜索打开了这个库，尽管我不确定这是否是一个很好的实现。

希望这可以帮助！

反对回复 2019-12-26

犯罪嫌疑人X

TA贡献2080条经验获得超4个赞

使用defaultdict对从输入字符串中每个位置开始的每个子字符串进行计数。OP尚不清楚是否应该包含重叠的匹配项，这种蛮力方法包括它们。

from collections import defaultdict

def getsubs(loc, s):

substr = s[loc:]

i = -1

while(substr):

yield substr

substr = s[loc:i]

i -= 1

def longestRepetitiveSubstring(r, minocc=3):

occ = defaultdict(int)

# tally all occurrences of all substrings

for i in range(len(r)):

for sub in getsubs(i,r):

occ[sub] += 1

# filter out all substrings with fewer than minocc occurrences

occ_minocc = [k for k,v in occ.items() if v >= minocc]

if occ_minocc:

maxkey = max(occ_minocc, key=len)

return maxkey, occ[maxkey]

else:

raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

印刷品：

('helloworld', 3)

反对回复 2019-12-26

ibeautiful

TA贡献1993条经验获得超6个赞

让我们从头开始，计算频率，并在出现最频繁的元素3次或更多次后立即停止。

from collections import Counter

a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

times=3

for n in range(1,len(a)/times+1)[::-1]:

substrings=[a[i:i+n] for i in range(len(a)-n+1)]

freqs=Counter(substrings)

if freqs.most_common(1)[0][1]>=3:

seq=freqs.most_common(1)[0][0]

break

print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

结果：

>>> sequence 'helloworld' of length 10 occurs 3 or more times

编辑：如果您感觉自己正在处理随机输入，并且公共子字符串的长度应该很小，那么最好以小子字符串开始（如果需要速度），而当找不到任何出现在该子字符串时停止至少3次：

from collections import Counter

a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

times=3

for n in range(1,len(a)/times+1):

substrings=[a[i:i+n] for i in range(len(a)-n+1)]

freqs=Counter(substrings)

if freqs.most_common(1)[0][1]<3:

n-=1

break

else:

seq=freqs.most_common(1)[0][0]

print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

与上述相同的结果。

反对回复 2019-12-26

3 回答
0 关注
1142 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

查找字符串中最长的重复序列

查找字符串中最长的重复序列

3 回答

添加回答