Pandas 按第一列分组并从第二列添加逗号分隔的条目

我有一个大约有 500 万行的 Pandas 数据框，其中有 2 列“top_level_domain”和“category”。我想创建一个具有不同 top_level_domain 的新数据框和一个以逗号分隔的类别列以用于唯一类别。此数据框已具有按性质以逗号分隔的类别。其他域如 google 将具有重复类别，但我只想要一个。数据框：df1 top_level_domain category1 google.com Search Engines2 service-now.com Business, Software/Hardware3 google-analytics.com Internet Services4 live.com None Assigned5 google.com Content Server6 google.com Search Engines7 inspectlet.com Internet Services8 doubleclick.net Online Shopping, Web Ads9 google.com Search Engines10 doubleclick.net Ads期望的输出：df2 top_level_domain category1 google.com Search Engines, Content Server2 service-now.com Business, Software/Hardware3 google-analytics.com Internet Services4 live.com None Assigned7 inspectlet.com Internet Services8 doubleclick.net Online Shopping, Web Ads, Ads实现这一目标的最佳方法是什么？我已经尝试了Pandas groupby 多列、多列列表中的所有示例其他人喜欢下面的那个，但我仍然在类别列中收到重复项。distinct_category = distinct_category.groupby('top_level_domain')['category'].agg(lambda x: ', '.join(set(x))).reset_index()但我在列中得到重复1 zoho.com Online Shopping, Interactive Web Applications, Interactive Web Applications, Interactive Web Applications, Motor Vehicles1 zohopublic.com Internet Services, Motor Vehicles, Internet Services, Online Shopping, Internet Services

查看完整描述

3 回答

小唯快跑啊

TA贡献1863条经验获得超2个赞

首先展开你的数据框，所以每一行只包含一个类别：

split = df['category'].str.split(', ')

lens = split.str.len()

df = pd.DataFrame({'top_level_domain': np.repeat(df['top_level_domain'].values, lens),

'category': np.concatenate(split)})

然后删除重复和使用agg有str.join：

res = df.drop_duplicates()\

.groupby('top_level_domain')['category'].agg(','.join)

反对回复 2021-10-12

扬帆大鱼

TA贡献1799条经验获得超9个赞

以下代码对我有用：

df =df.groupby('top_level_domain')['category'].agg([('category', ', '.join)]).reset_index()

反对回复 2021-10-12

热搜

最近搜索清空

Pandas 按第一列分组并从第二列添加逗号分隔的条目

Pandas 按第一列分组并从第二列添加逗号分隔的条目

3 回答

添加回答