首页猿问如何使用 pandas...

如何使用 pandas groupby 计算完成每个唯一 id 的行选择标准？

Python

慕村225694 2023-10-06 10:58:29

DataFrame 组循环替代方案？我有一个包含 1300 万行、1,214 个站点（唯一 ID）的数据集：# copy the data to the clipboard, and read in withdf = pd.read_clipboard(sep=',', index_col=[0]),tmc_code,measurement_tstamp,travel_time_minutes0,133-04199,2019-01-01 18:15:00,2.011,133-04199,2019-01-01 18:20:00,2.012,133-04198,2019-01-01 18:25:00,9.233,133-04191,2019-01-01 20:35:00,2.884,133-04191,2019-01-01 20:40:00,2.625,133-04190,2019-01-01 20:40:00,1.36,133-04193,2019-01-01 20:20:00,4.967,133-04193,2019-01-01 20:25:00,4.968,133-04192,2019-01-01 20:30:00,5.059,133-04192,2019-01-01 20:35:00,5.1410,133-04195,2019-01-01 19:45:00,9.5211,133-04195,2019-01-01 19:50:00,10.6912,133-04195,2019-01-01 19:55:00,9.3713,133-04194,2019-01-01 20:10:00,5.9614,133-04194,2019-01-01 20:15:00,5.9615,133-04194,2019-01-01 20:20:00,5.9616,133P04359,2019-01-01 22:25:00,0.6617,133P04359,2019-01-01 22:30:00,0.7818,133P04359,2019-01-01 23:25:00,0.819,133P04126,2019-01-01 23:10:00,0.0120,133P04125,2019-01-01 23:10:00,0.71有一些极端的最大值在物理上是不可能的，因此为了修剪它们，我尝试使用95 百分位数加上模式来创建阈值并过滤掉极端值。站点会产生不同的 Travel_time 值（由于长度/交通模式），因此百分位数和众数必须按站点计算。这可行，但速度非常慢。df_clean_tmc = df.groupby(['tmc_code'], as_index=False)['travel_time_seconds'].apply(lambda x: x[x['travel_time_seconds'] < (x['travel_time_seconds'].quantile(.95) + x['travel_time_seconds'].apply(lambda x: stats.mode(x)[0]))])我也尝试过这个，但速度很慢，并且结果没有执行任何计算，它与原始数据帧的大小相同。我怀疑第二个应用是错误的，但是 groupby 对象没有“模式”功能，并且 stats.mode 在各个 groupby 测试中正常工作。我也尝试过这个：df_clean_tmc = df.groupby(['tmc_code'], as_index=False)np.where(df_clean_tmc['travel_time_seconds'] < (df_clean_tmc['travel_time_seconds'].quantile(.95)+ df_clean_tmc['travel_time_seconds'].apply(lambda x: stats.mode(x)[0]),df['travel_time_seconds']))但出现类型错误：TypeError: '<' not supported between instances of 'DataFrameGroupBy' and 'tuple'什么是更有效、更合适的方法来实现这一目标？

查看完整描述

1 回答

qq_笑_17

TA贡献1818条经验获得超7个赞

numba根据测试结果，不太可能实现几个数量级的改进（不使用像甚至 Cython 这样的底层工具）。这可以从执行聚合计算所需的时间看出。

然而，仍然可以进行两个关键优化：

减少显式数据传递的数量 - 主要是df[df['col'] = val]过滤。在我的实现中，您的 for 循环被替换为（1）使用一次聚合所有内容.groupby().agg()，（2）使用查找表（dict）检查阈值。我不确定是否存在更有效的方法，但它总是涉及一次数据传递，并且最多只能再节省几秒钟。
访问df["col"].values而不是df["col"]尽可能。（注意，这不会复制数据，因为可以在tracemalloc模块打开的情况下轻松验证。）

基准代码：

使用您的示例生成了 15M 条记录。

import pandas as pd

import numpy as np

from datetime import datetime

# check memory footprint

# import tracemalloc

# tracemalloc.start()

# data

df = pd.read_csv("/mnt/ramdisk/in.csv", index_col="idx")

del df['measurement_tstamp']

df.reset_index(drop=True, inplace=True)

df["travel_time_minutes"] = df["travel_time_minutes"].astype(np.float64)

# repeat

cols = df.columns

df = pd.DataFrame(np.repeat(df.values, 500000, axis=0))

df.columns = cols

# Aggregation starts

t0 = datetime.now()

print(f"Program begins....")

# 1. aggregate everything at once

df_agg = df.groupby("tmc_code").agg(

mode=("travel_time_minutes", pd.Series.mode),

q95=("travel_time_minutes", lambda x: np.quantile(x, .95))

)

t1 = datetime.now()

print(f" Aggregation: {(t1 - t0).total_seconds():.2f}s")

# 2. construct a lookup table for the thresholds

threshold = {}

for tmc_code, row in df_agg.iterrows(): # slow but only 1.2k rows

threshold[tmc_code] = np.max(row["mode"]) + row["q95"]

t2 = datetime.now() # doesn't matter

print(f" Computing Threshold: {(t2 - t1).total_seconds():.2f}s")

# 3. filtering

def f(tmc_code, travel_time_minutes):

return travel_time_minutes <= threshold[tmc_code]

df = df[list(map(f, df["tmc_code"].values, df["travel_time_minutes"].values))]

t3 = datetime.now()

print(f" Filter: {(t3 - t2).total_seconds():.2f}s...")

print(f"Program ends in {(datetime.now() - t0).total_seconds():.2f}s")

# memory footprint

# current, peak = tracemalloc.get_traced_memory()

# print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")

# tracemalloc.stop()

print()

结果：（3 次运行）

| No. | old | new | new(aggr) | new(filter) |

|-----|-------|-------|-----------|-------------|

| 1 | 24.55 | 14.04 | 9.87 | 4.16 |

| 2 | 23.84 | 13.58 | 9.66 | 3.92 |

| 3 | 24.81 | 14.37 | 10.02 | 4.34 |

| avg | 24.40 | 14.00 | | |

=> ~74% faster

使用 python 3.7 和 pandas 1.1.2 进行测试

反对回复 2023-10-06

1 回答
0 关注
84 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使用 pandas groupby 计算完成每个唯一 id 的行选择标准？

如何使用 pandas groupby 计算完成每个唯一 id 的行选择标准？

1 回答

添加回答