聚合数据帧上的列根据另一个数据帧对其进行分组而不合并它们

我有两个数据帧 df1 和 df2：df1 有 column1、column2 并且有很多行（~1000 万）df2 有 column2，还有很多其他列，而且很短（约 100 列和约 1000 行）我想要实现的是：df1.merge(df2, on=column2).groupby(column1).agg($SomeAggregatingFunction)但避免合并操作，因为它会占用大量内存。有没有办法获得这种行为？

查看完整描述

1 回答

肥皂起泡泡

TA贡献1829条经验获得超6个赞

除非内存开销成为瓶颈，否则我预计这种方法可能会更慢。不过，您是否尝试过df2根据操作column2后返回的索引进行子集化？请参阅下面的示例，了解我的意思。groupbydf1

我想另一种选择是考虑一个 map-reduce 框架（例如，pyspark）？

# two toy datasets

df1 = pd.DataFrame({i:np.random.choice(np.arange(10), size=20) for i in range(2)}).rename(columns={0:'col1',1:'col2'})

df2 = pd.DataFrame({i:np.random.choice(np.arange(10), size=5) for i in range(2)}).rename(columns={0:'colOther',1:'col2'})

# make sure we don't use values of col2 that df2 doesn't contain

df1 = df1[df1['col2'].isin(df2['col2'])]

# for faster indexing and use of .loc

df2_col2_idx = df2.set_index('col2')

# iterate over the groups rather than merge

for i,group in df1.groupby('col1'):

subset = df2_col2_idx.loc[group.col2,:]

# some function on the subset here

# note 'i' is the col1 index

print(i,subset.colOther.mean())

更新：将@max 对apply功能的评论建议包含在组中：

df1.groupby(column1).apply(lambda x: df2_col2_idx.loc[x[columns2],other_columns].agg($SomeAggregatingFunction))

反对回复 2022-01-05

热搜

最近搜索清空

聚合数据帧上的列根据另一个数据帧对其进行分组而不合并它们

聚合数据帧上的列根据另一个数据帧对其进行分组而不合并它们

1 回答

添加回答