首页猿问为什么在 pandas...

为什么在 pandas 中获取索引的反向速度如此之慢？

Python

慕运维8079593 2023-03-30 16:41:31

我有一个用于存储网络数据的熊猫数据框；看起来像：from_id, to_id, countX, Y, 3Z, Y, 4Y, X, 2...我正在尝试添加一个新列，inverse_count它获取和从当前行反转的count行的值。from_idto_id我正在采取以下方法。我以为它会很快，但它比我预期的要慢得多，而且我不明白为什么。def get_inverse_val(x): # Takes the inverse of the index for a given row # When passed to apply with axis = 1, the index becomes the name try: return df.loc[(x.name[1], x.name[0]), 'count'] except KeyError: return 0df = df.set_index(['from_id', 'to_id'])df['inverse_count'] = df.apply(get_inverse_val, axis = 1)

查看完整描述

2 回答

HUX布斯

TA贡献1876条经验获得超6个赞

为什么不为此做一个简单的合并？

df = pd.DataFrame({'from_id': ['X', 'Z', 'Y'], 'to_id': ['Y', 'Y', 'X'], 'count': [3,4,2]})

pd.merge(

left = df,

right = df,

how = 'left',

left_on = ['from_id', 'to_id'],

right_on = ['to_id', 'from_id']

)

from_id_x to_id_x count_x from_id_y to_id_y count_y

0 X Y 3 Y X 2.0

1 Z Y 4 NaN NaN NaN

2 Y X 2 X Y 3.0

这里我们合并 from (from, to) -> (to, from) 得到反向匹配对。一般来说，你应该避免使用，apply()因为它很慢。（要理解为什么，意识到它不是矢量化操作。）

反对回复 2023-03-30

慕斯709654

TA贡献1840条经验获得超5个赞

您可以使用.set_indextwice 创建两个具有相反索引顺序的数据帧，并分配以创建您的 inverse_count 列。

df = (df.set_index(['from_id','to_id'])

.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])

.reset_index())

from_id to_id count inverse_count

0 X Y 3 2.0

1 Z Y 4 NaN

2 Y X 2 3.0

由于问题是关于速度的，让我们看看在更大数据集上的性能：

设置：

import pandas as pd

import string

import itertools

df = pd.DataFrame(list(itertools.permutations(string.ascii_uppercase, 2)), columns=['from_id', 'to_id'])

df['count'] = df.index % 25 + 1

print(df)

from_id to_id count

0 A B 1

1 A C 2

2 A D 3

3 A E 4

4 A F 5

.. ... ... ...

645 Z U 21

646 Z V 22

647 Z W 23

648 Z X 24

649 Z Y 25

设置索引：

%timeit (df.set_index(['from_id','to_id'])

.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])

.reset_index())

6 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

合并：

%timeit pd.merge(

left = df,

right = df,

how = 'left',

left_on = ['from_id', 'to_id'],

right_on = ['to_id', 'from_id'] )

1.73 ms ± 57.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

因此，看起来合并方法是更快的选择。

反对回复 2023-03-30

2 回答
0 关注
121 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

为什么在 pandas 中获取索引的反向速度如此之慢？

为什么在 pandas 中获取索引的反向速度如此之慢？

2 回答

添加回答