1 回答
TA贡献1803条经验 获得超3个赞
如果我理解正确,您的问题是岛和间隙问题的变体。每个具有可接受间隙的单调(增加或减少)子序列将形成一个岛。例如,给定一个系列s:
s island
-- ------
0 1
0 1
1 1
3 2 # gap > 1, form new island
4 2
2 3 # stop increasing, form new island
1 3
0 3
概括地说:只要当前行和前一行之间的差距超出 [-1, 1] 范围,就会形成一个新岛。
将此间隙岛算法应用于Query Segment Id和Reference Segment Id:
Query Segment Id Q Island Reference Segment Id R Island Q-R Intersection
---------------- -------- -------------------- -------- ----------------
1 1 1 1 (1, 1)
2 1 2 1 (1, 1)
3 1 3 1 (1, 1)
0 2 4 1 (2, 1)
1 2 5 1 (2, 1)
2 2 6 1 (2, 1)
3 2 7 1 (2, 1)
4 2 8 1 (2, 1)
0 3 9 1 (3, 1)
您正在寻找的qand范围现在是每个 的开头和结尾的and 。最后一个警告:忽略长度为 1 的交叉点(如最后一个交叉点)。rQuery Segment IdReference Segment IdQ-R Intersection
代码:
columns = ['Query Segment Id', 'Reference Segment Id']
df = pd.DataFrame(data_with_multiple_contiguous_sequences, columns=columns)
def get_island(col):
return (~col.diff().between(-1,1)).cumsum()
df[['Q Island', 'R Island']] = df[['Query Segment Id', 'Reference Segment Id']].apply(get_island)
result = df.groupby(['Q Island', 'R Island']) \
.agg(**{
'Q Start': ('Query Segment Id', 'first'),
'Q End': ('Query Segment Id', 'last'),
'R Start': ('Reference Segment Id', 'first'),
'R End': ('Reference Segment Id', 'last'),
'Count': ('Query Segment Id', 'count')
}) \
.replace({'Count': 1}, {'Count': np.nan}) \
.dropna()
result['Q'] = result[['Q Start', 'Q End']].apply(tuple, axis=1)
result['R'] = result[['R Start', 'R End']].apply(tuple, axis=1)
结果:
Q Start Q End R Start R End Count Q R
Q Island R Island
1 1 1 3 1 3 3 (1, 3) (1, 3)
2 1 0 4 4 8 5 (0, 4) (4, 8)
添加回答
举报