首页猿问在 pandas...

在 pandas 数据框中的两列之间传输值

Python

炎炎设计 2023-10-26 14:40:32

我有一个像这样的熊猫数据框：p q0.5 0.50.6 0.40.3 0.70.4 0.60.9 0.1所以，我想知道，如何将较大的值传输到 p 列，反之亦然的 q 列（将较小的值传输到 q 列），如下所示：p q0.5 0.50.6 0.40.7 0.30.6 0.40.9 0.1

查看完整描述

4 回答

陪伴而非守候

TA贡献1757条经验获得超8个赞

您可以存储一些条件系列np.where()，然后将它们应用到数据帧：

s1 = np.where(df['p'] < df['q'], df['q'], df['p'])

s2 = np.where(df['p'] > df['q'], df['q'], df['p'])

df['p'] = s1

df['q'] = s2

Out[1]:

p q

0 0.5 0.5

1 0.6 0.4

2 0.7 0.3

3 0.6 0.4

4 0.9 0.1

您还可以使用.where()：

s1 = df['p'].where(df['p'] > df['q'], df['q'])

s2 = df['p'].where(df['p'] < df['q'], df['q'])

df['p'] = s1

df['q'] = s2

我测试了从 100 行到 100 万行的不同行的执行时间，需要通过的答案axis=1可以是10,000 times slower!：

Erfan 的 numpy 答案看起来是大型数据集以毫秒为单位执行最快的答案
我的.where()答案也具有出色的性能，可以将执行时间保持在毫秒内（我假设 `np.where() 会有类似的结果。
我以为MHDG7的答案会是最慢的，但实际上它比Alexander的答案更快。
我猜亚历山大的回答很慢，因为它需要通过axis=1。事实上，MGDG7 和 Alexander 的答案是逐行的（带有axis=1），这意味着对于大型数据帧来说，它会大大减慢速度。

正如您所看到的，一百万行数据帧需要几分钟才能执行。而且，如果您有 1000 万行到 1 亿行的数据帧，这些单行代码可能需要几个小时才能执行。

from timeit import timeit

df = d.copy()

def df_where(df):

s1 = df['p'].where(df['p'] > df['q'], df['q'])

s2 = df['p'].where(df['p'] < df['q'], df['q'])

df['p'] = s1

df['q'] = s2

return df

def agg_maxmin(df):

df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)

return df

def np_flip(df):

df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)

return df

def lambda_x(df):

df = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand')

return df

res = pd.DataFrame(

index=[20, 200, 2000, 20000, 200000],

columns='df_where agg_maxmin np_flip lambda_x'.split(),

dtype=float

)

for i in res.index:

d = pd.concat([df]*i)

for j in res.columns:

stmt = '{}(d)'.format(j)

setp = 'from __main__ import d, {}'.format(j)

print(stmt, d.shape)

res.at[i, j] = timeit(stmt, setp, number=1)

res.plot(loglog=True);

反对回复 2023-10-26

慕容森

TA贡献1853条经验获得超18个赞

用于numpy.sort按水平轴升序排序，然后翻转数组axis=1：

df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)

p q

0 0.5 0.5

1 0.6 0.4

2 0.7 0.3

3 0.6 0.4

4 0.9 0.1

反对回复 2023-10-26

至尊宝的传说

TA贡献1789条经验获得超10个赞

使用agg，传递函数列表（max和min）并指定axis=1将这些函数按行应用于列。

df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)

>>> df

p q

0 0.5 0.5

1 0.6 0.4

2 0.7 0.3

3 0.6 0.4

4 0.9 0.1

简单的解决方案并不总是最有效的（例如上面的解决方案）。以下解决方案明显更快。p它屏蔽列小于列的数据帧q，然后交换值。

mask = df['p'].lt(df['q'])

df.loc[mask, ['p', 'q']] = df.loc[mask, ['q', 'p']].to_numpy()

>>> df

p q

0 0.5 0.5

1 0.6 0.4

2 0.7 0.3

3 0.6 0.4

4 0.9 0.1

反对回复 2023-10-26

千巷猫影

TA贡献1829条经验获得超7个赞

您可以使用应用功能：

df[['p','q']] = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand' )

反对回复 2023-10-26

4 回答
0 关注
147 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

在 pandas 数据框中的两列之间传输值

在 pandas 数据框中的两列之间传输值

4 回答

添加回答