首页猿问想要在 python...

想要在 python 中执行分组，分组数据将进入行

Python

明月笑刀无情 2021-09-02 11:03:41

我有这样的数据：ID Value1 ABC1 BCD1 AKB2 CAB2 AIK3 KIB我想执行一个操作，它会给我这样的东西：ID Value1 Value2 Value31 ABC BCD AKB 2 CAB AIK3 KIB我使用了 SAS，其中使用了 retain 和 by 我们过去常常得到答案。在 Python 中，我没有任何办法。我知道我必须使用 group by 然后一些东西。但是不知道有什么用。在 Pyspark 中使用 group by 和 collect_list 我们可以以数组格式获取它，但我想在 Pandas 数据框中进行

查看完整描述

3 回答

守着一只汪

TA贡献1872条经验获得超3个赞

使用set_index与cumcount对MultiIndex，然后通过重塑unstack：

df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])['Value']

.unstack()

.rename(columns=lambda x: 'Value{}'.format(x + 1))

.reset_index())

对于 python3.6+可以使用f-strings 来重命名列名称：

df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])['Value']

.unstack()

.rename(columns=lambda x: f'Value{x+1}')

.reset_index())

另一个想法是由构造函数 create lists 和 new DataFrame：

s = df.groupby('ID')['Value'].apply(list)

df1 = (pd.DataFrame(s.values.tolist(), index=s.index)

.rename(columns=lambda x: 'Value{}'.format(x + 1))

.reset_index())

print (df1)

ID Value1 Value2 Value3

0 1 ABC BCD AKB

1 2 CAB AIK NaN

2 3 KIB NaN NaN

性能：取决于行数和列的唯一值数ID：

np.random.seed(45)

a = np.sort(np.random.randint(1000, size=10000))

b = np.random.choice(list('abcde'), size=10000)

df = pd.DataFrame({'ID':a, 'Value':b})

#print (df)

In [26]: %%timeit

...: (df.set_index(['ID',df.groupby('ID').cumcount()])['Value']

...: .unstack()

...: .rename(columns=lambda x: f'Value{x+1}')

...: .reset_index())

...:

8.96 ms ± 628 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [27]: %%timeit

...: s = df.groupby('ID')['Value'].apply(list)

...: (pd.DataFrame(s.values.tolist(), index=s.index)

...: .rename(columns=lambda x: 'Value{}'.format(x + 1))

...: .reset_index())

...:

105 ms ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#jpp solution

In [28]: %%timeit

...: def group_gen(df):

...: for key, x in df.groupby('ID'):

...: x = x.set_index('ID').T

...: x.index = pd.Index([key], name='ID')

...: x.columns = [f'Value{i}' for i in range(1, x.shape[1]+1)]

...: yield x

...:

...: pd.concat(group_gen(df)).reset_index()

...:

3.23 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

反对回复 2021-09-02

慕桂英4014372

TA贡献1871条经验获得超13个赞

groupby + concat

一种方法是迭代一个groupby对象并连接结果数据帧：

def group_gen(df):

for key, x in df.groupby('ID'):

x = x.set_index('ID').T

x.index = pd.Index([key], name='ID')

x.columns = [f'Value{i}' for i in range(1, x.shape[1]+1)]

yield x

res = pd.concat(group_gen(df)).reset_index()

print(res)

ID Value1 Value2 Value3

0 1 ABC BCD AKB

1 2 CAB AIK NaN

2 3 KIB NaN NaN

反对回复 2021-09-02

3 回答
0 关注
152 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

想要在 python 中执行分组，分组数据将进入行

想要在 python 中执行分组，分组数据将进入行

3 回答

添加回答