首页猿问比 iterrows...

比 iterrows 更好的解决方案，用于比较不同数据集之间的值

Python

子衿沉夜 2023-03-16 15:28:31

我一直在使用iterrows()比较两个数据集之间的列值并合并满足某些条件的行，但是这需要很长时间。有没有更好的方法可以在不迭代的情况下做到这一点？这是完整的功能def find_peak_matches(lncRNA, CAGE):"""isolates CAGE peaks that match an lncRNA"""lncRNA['promoter_start'] = lncRNA['promoter_start'].apply(pd.to_numeric).astype('int32')lncRNA['promoter_stop'] = lncRNA['promoter_stop'].apply(pd.to_numeric).astype('int32')CAGE['peak_start'] = CAGE['peak_start'].apply(pd.to_numeric).astype('int32')CAGE['peak_stop'] = CAGE['peak_stop'].apply(pd.to_numeric).astype('int32')peak_matches = pd.DataFrame()for i, row in lncRNA.iterrows(): mask = ( (CAGE['chr'] == row['chr']) & \ (row['promoter_start'] <= CAGE['peak_start']) & \ (row['promoter_stop'] >= CAGE['peak_stop']) )#finds peaks in lncRNA promoters matches = CAGE[mask].dropna() #isolates only the peak matches if len(matches) == 0: #if no matches found continue continue merged = pd.merge( row.to_frame().T, matches, on=['chr'] ) #merges rows that meet mask conditions peak_matches = pd.concat( [peak_matches, merged], ignore_index=True ) #creates a new df from all the merged rows logging.debug('found peak matches')return (peak_matches)这是一个示例 lncRNA 数据集：name chr promoter_start promoter_stop infolnc1 1 1 10 xlnc2 1 11 20 ylnc3 1 21 30 z样品笼：ID chr peak_start peak_stop peak1 1 3 7 peak2 1 15 17 peak3 1 4 6 peak4 2 6 9 真实的数据集每个包含大约 20 万行，所以我当前的代码花费的时间太长了。我正在尝试合并峰值/启动子具有相同 chr 值且峰值开始/停止落在启动子启动/停止之间的行。有什么优化建议吗？我对 python 相当了解，所以我不知道最好的做事方式是什么。

查看完整描述

2 回答

呼啦一阵风

TA贡献1802条经验获得超6个赞

在将数据放入pandas可以对需要进行的比较次数进行更多选择的地方之前，您可能会更幸运地完成所有繁重的工作——尽管事实上您会放弃一些 numpy 加速剂在pandas。namedtuples为了方便起见，我编写了下面的示例，并在制作数据框之前进行了所有比较。对于 200K x 200K 的虚假数据，它在我的机器上大约需要 30 秒就可以完成，并获得 1000 万行匹配项，这完全取决于我使用的随机数据的多样性。YMMV。

这里可能还有更多“留在地板上”。一些智能排序（除了我所做的按“chr”分箱）可能会更进一步。

import pandas as pd

from collections import namedtuple, defaultdict

from random import randint

from itertools import product

# structures

rna = namedtuple('rna', 'name chr promoter_start promoter_stop info')

cage = namedtuple('cage', 'ID chr peak_start peak_stop')

row = namedtuple('row', 'name chr promoter_start promoter_stop info ID peak_start peak_stop')

# some data entry from post to check...

rnas = [rna('inc1',1,1,10,'x'), rna('inc2',1,11,20,'y'), rna('inc1',1,21,30,'z')]

cages = [cage('peak1',1,3,7), cage('peak2',1,15,17), cage('peak3',1,4,6), cage('peak4',2,6,9)]

result_rows = [row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)

for r in rnas for c in cages if

r.chr == c.chr and

r.promoter_start <= c.peak_start and

r.promoter_stop >= c.peak_stop]

df = pd.DataFrame(data=result_rows)

print(df)

print()

# stress test

# big fake data

rnas = [rna('xx', randint(1,1000), randint(1,50), randint(10,150), 'yy') for t in range(200_000)]

cages = [cage('pk', randint(1,1000), randint(1,50), randint(10,150)) for t in range(200_000)]

# group by chr to expedite comparisons

rna_dict = defaultdict(list)

cage_dict = defaultdict(list)

for r in rnas:

rna_dict[r.chr].append(r)

for c in cages:

cage_dict[c.chr].append(c)

print('fake data made')

# use the chr's that are keys in the rna dictionary and make all comparisions...

result_rows = []

for k in rna_dict.keys():

result_rows.extend([row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)

for r in rna_dict.get(k) for c in cage_dict.get(k) if

r.promoter_start <= c.peak_start and

r.promoter_stop >= c.peak_stop])

df = pd.DataFrame(data=result_rows)

print(df.head(5))

print(df.info())

输出：

name chr promoter_start promoter_stop info ID peak_start peak_stop

0 inc1 1 1 10 x peak1 3 7

1 inc1 1 1 10 x peak3 4 6

2 inc2 1 11 20 y peak2 15 17

fake data made

name chr promoter_start promoter_stop info ID peak_start peak_stop

0 xx 804 34 35 yy pk 36 11

1 xx 804 34 35 yy pk 39 11

2 xx 804 34 35 yy pk 37 14

3 xx 804 34 35 yy pk 34 28

4 xx 804 34 35 yy pk 39 20

RangeIndex: 10280046 entries, 0 to 10280045

Data columns (total 8 columns):

name object

chr int64

promoter_start int64

promoter_stop int64

info object

ID object

peak_start int64

peak_stop int64

dtypes: int64(5), object(3)

memory usage: 627.4+ MB

None

[Finished in 35.4s]

来自 DataFrame --> namedtuple

下面的几个选项......研究了同样的事情并选择了几个例子。您可以使用pd.itertuples下面的方法将它们剥离出来并将它们放入命名元组中。但是，它似乎只进行位置匹配。所以要小心。注意第二个例子是顶起的。Pandas 似乎也做自己的命名行事情，这可能同样有效。（最后一个例子）。我没有对它进行太多修改，但它似乎可以在内部通过名称寻址，这就像 namedtuple 一样好。

In [22]: df

Out[22]:

name chr promoter_start promoter_stop info

0 lnc1 1 1 10 x

1 lnc2 1 11 20 y

2 lnc3 1 21 30 z

In [23]: rna = namedtuple('rna', 'name chr promoter_start promoter_stop info')

In [24]: rows = [rna(*t) for t in df.itertuples(index=False)]

In [25]: rows

Out[25]:

[rna(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'),

rna(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'),

rna(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]

In [26]: rna = namedtuple('rna', 'name chr info promoter_start promoter_stop') # note: wrong

In [27]: rows = [rna(*t) for t in df.itertuples(index=False)]

In [28]: rows

Out[28]:

[rna(name='lnc1', chr=1, info=1, promoter_start=10, promoter_stop='x'),

rna(name='lnc2', chr=1, info=11, promoter_start=20, promoter_stop='y'),

rna(name='lnc3', chr=1, info=21, promoter_start=30, promoter_stop='z')]

In [29]: # note the above is mis-aligned!!!

In [32]: rows = [t for t in df.itertuples(name='row', index=False)]

In [33]: rows

Out[33]:

[row(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'),

row(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'),

row(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]

In [34]: type(rows[0])

Out[34]: pandas.core.frame.row

In [35]: rows[0].chr

Out[35]: 1

In [36]: rows[0].info

Out[36]: 'x'

反对回复 2023-03-16

守着星空守着你

TA贡献1799条经验获得超8个赞

您可以合并所有lncRNA和CAGE数据框，然后使用df.query.

例如：

df = lncRNA.merge(CAGE, on='chr')

df = df.query('(promoter_start <= peak_start) & (promoter_stop >= peak_stop)')

print(df)

印刷：

name chr promoter_start promoter_stop info ID peak_start peak_stop

0 lnc1 1 1 10 x peak1 3 7

2 lnc1 1 1 10 x peak3 4 6

4 lnc2 1 11 20 y peak2 15 17

反对回复 2023-03-16

2 回答
0 关注
77 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

比 iterrows 更好的解决方案，用于比较不同数据集之间的值

比 iterrows 更好的解决方案，用于比较不同数据集之间的值

2 回答

添加回答