首页猿问在 Pandas 数据框中查找...

在 Pandas 数据框中查找 asc/desc 序列

Python

繁星淼淼 2022-10-06 19:29:08

我正在尝试构建一个有助于简化研究工作的工具，并且似乎需要检测我何时在一列中的数据中具有递增序列，而在另一列中具有 asc/desc 序列。有没有一种干净的方法来检查行中是否有序列，而不必编写一个像https://stackoverflow.com/a/52679427/5045375这样遍历行的状态机？编写这样的代码片段必须检查一列中的值是否在递增（无间隙），而另一列中的值是否为 asc/desc（无间隙）。我完全能够做到这一点，我只是想知道我的熊猫工具箱中是否有我遗漏的东西。这里有一些例子来澄清我的意图，import pandas as pd from collections import namedtupleQUERY_SEGMENT_ID_COLUMN = 'Query Segment Id'REFERENCE_SEGMENT_ID_COLUMN = 'Reference Segment Id'def dataframe(data): columns = [QUERY_SEGMENT_ID_COLUMN, REFERENCE_SEGMENT_ID_COLUMN] return pd.DataFrame(data, columns=columns)# No sequence in either column. No resultsdata_without_pattern = [[1, 2], [7, 0], [3, 6]]# Sequence in first column, but no sequence in second column. No resultsdata_with_pseodo_pattern_query = [[1, 2], [2, 0], [3, 6]]# Sequence in second column, but no sequence in first column. No resultsdata_with_pseudo_pattern_reference = [[1, 2], [7, 3], [3, 4]]# Broken sequence in first column, sequence in second column. No resultsdata_with_pseudo_pattern_query_broken = [[1, 2], [3, 3], [7, 4]]# Sequence occurs in both columns, asc. Expect resultsdata_with_pattern_asc = [[1, 2], [2, 3], [3, 4]]# Sequence occurs in both columns, desc. Expect resultsdata_with_pattern_desc = [[1, 4], [2, 3], [3, 2]]# There is a sequence, and some noise. Expect resultsdata_with_pattern_and_noise = [[1, 0], [1, 4], [1, 2], [1, 3], [2, 3], [3, 4]]在第一个示例中，没有任何模式，print(dataframe(data_without_pattern)) Query Segment Id Reference Segment Id0 1 21 7 02 3 6第二个示例在查询列中有一个升序的 id 序列，但在参考列中没有，print(dataframe(data_with_pseodo_pattern_query)) Query Segment Id Reference Segment Id0 1 21 2 02 3 6

查看完整描述

1 回答

繁星点点滴滴

TA贡献1803条经验获得超3个赞

如果我理解正确，您的问题是岛和间隙问题的变体。每个具有可接受间隙的单调（增加或减少）子序列将形成一个岛。例如，给定一个系列s：

s island

-- ------

0 1

1 1

3 2 # gap > 1, form new island

4 2

2 3 # stop increasing, form new island

1 3

0 3

概括地说：只要当前行和前一行之间的差距超出 [-1, 1] 范围，就会形成一个新岛。

将此间隙岛算法应用于Query Segment Id和Reference Segment Id：

Query Segment Id Q Island Reference Segment Id R Island Q-R Intersection

---------------- -------- -------------------- -------- ----------------

1 1 1 1 (1, 1)

2 1 2 1 (1, 1)

3 1 3 1 (1, 1)

0 2 4 1 (2, 1)

1 2 5 1 (2, 1)

2 2 6 1 (2, 1)

3 2 7 1 (2, 1)

4 2 8 1 (2, 1)

0 3 9 1 (3, 1)

您正在寻找的qand范围现在是每个的开头和结尾的and 。最后一个警告：忽略长度为 1 的交叉点（如最后一个交叉点）。rQuery Segment IdReference Segment IdQ-R Intersection

代码：

columns = ['Query Segment Id', 'Reference Segment Id']

df = pd.DataFrame(data_with_multiple_contiguous_sequences, columns=columns)

def get_island(col):

return (~col.diff().between(-1,1)).cumsum()

df[['Q Island', 'R Island']] = df[['Query Segment Id', 'Reference Segment Id']].apply(get_island)

result = df.groupby(['Q Island', 'R Island']) \

.agg(**{

'Q Start': ('Query Segment Id', 'first'),

'Q End': ('Query Segment Id', 'last'),

'R Start': ('Reference Segment Id', 'first'),

'R End': ('Reference Segment Id', 'last'),

'Count': ('Query Segment Id', 'count')

}) \

.replace({'Count': 1}, {'Count': np.nan}) \

.dropna()

result['Q'] = result[['Q Start', 'Q End']].apply(tuple, axis=1)

result['R'] = result[['R Start', 'R End']].apply(tuple, axis=1)

结果：

Q Start Q End R Start R End Count Q R

Q Island R Island

1 1 1 3 1 3 3 (1, 3) (1, 3)

2 1 0 4 4 8 5 (0, 4) (4, 8)

反对回复 2022-10-06

1 回答
0 关注
126 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

在 Pandas 数据框中查找 asc/desc 序列

在 Pandas 数据框中查找 asc/desc 序列

1 回答

添加回答