首页猿问大熊猫中棘手的级联分组

大熊猫中棘手的级联分组

Python

汪汪一只猫 2023-07-11 18:15:25

我想在 pandas 中解决一个奇怪的问题。假设我有一堆对象，它们有不同的分组方式。这是我们的数据框的样子：df=pd.DataFrame([ {'obj': 'Ball', 'group1_id': None, 'group2_id': '7' }, {'obj': 'Balloon', 'group1_id': '92', 'group2_id': '7' }, {'obj': 'Person', 'group1_id': '14', 'group2_id': '11'}, {'obj': 'Bottle', 'group1_id': '3', 'group2_id': '7' }, {'obj': 'Thought', 'group1_id': '3', 'group2_id': None},])obj group1_id group2_idBall None 7Balloon 92 7Person 14 11Bottle 3 7Thought 3 None我想根据任何组将事物分组在一起。这里注释一下：obj group1_id group2_id # annotatedBall None 7 # group2_id = 7Balloon 92 7 # group1_id = 92 OR group2_id = 7Person 14 11 # group1_id = 14 OR group2_id = 11Bottle 3 7 # group1_id = 3 OR group2_id = 7Thought 3 None # group1_id = 3组合后，我们的输出应如下所示：count objs composite_id4 [Ball, Balloon, Bottle, Thought] g1=3,92|g2=71 [Person] g1=11|g2=14请注意，我们可以获得的前三个对象group2_id=7，然后是第四个对象Thought，是因为它可以通过group1_id=3为其分配group_id=7id 来与另一个项目匹配。注意：对于这个问题，假设一个项目只会属于一个组合组（并且永远不会有可能属于两个组的情况）。我怎样才能做到这一点pandas？

查看完整描述

2 回答

郎朗坤

TA贡献1921条经验获得超9个赞

这一点也不奇怪~网络问题

import networkx as nx

#we need to handle the miss value first , we fill it with same row, so that we did not calssed them into wrong group

df['key1']=df['group1_id'].fillna(df['group2_id'])

df['key2']=df['group2_id'].fillna(df['group1_id'])

# here we start to create the network

G=nx.from_pandas_edgelist(df, 'key1', 'key2')

l=list(nx.connected_components(G))

L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

d={k: v for d in L for k, v in d.items()}

# we using above dict to map the same group into the same one in order to groupby them

out=df.groupby(df.key1.map(d)).agg(objs = ('obj',list) , Count = ('obj','count'), g1= ('group1_id', lambda x : set(x[x.notnull()].tolist())), g2= ('group2_id', lambda x : set(x[x.notnull()].tolist())))

# notice here I did not conver the composite id into string format , I keep them into different columns which more easy to understand

Out[53]:

objs Count g1 g2

key1

0 [Ball, Balloon, Bottle, Thought] 4 {92, 3} {7}

1 [Person] 1 {14} {11}

反对回复 2023-07-11

红糖糍粑

TA贡献1815条经验获得超6个赞

这里有一个更详细的解决方案，我为分组集合构建了“第一个键”的映射：

# using four id fields instead of 2

grouping_fields = ['group1_id', 'group2_id', 'group3_id', 'group4_id']

id_fields = df.loc[df[grouping_fields].notnull().any(axis=1), grouping_fields]

# build a set of all similarly-grouped items

# and use the 'first seen' as the grouping key for that

FIRST_SEEN_TO_ALL = defaultdict(set)

KEY_TO_FIRST_SEEN = {}

for row in id_fields.to_dict('records'):

# why doesn't nan fall out in a boolean check?

keys = [id for id in row.values() if id and (str(id) != 'nan')]

row_id = keys[0]

for key in keys:

if (row_id != key) or (key not in KEY_TO_FIRST_SEEN):

KEY_TO_FIRST_SEEN[key] = row_id

first_seen_key = row_id

else:

first_seen_key = KEY_TO_FIRST_SEEN[key]

FIRST_SEEN_TO_ALL[first_seen_key].add(key)

def fetch_group_id(row):

keys = filter(None, row.to_dict().values())

for key in keys:

first_seen_key = KEY_TO_FIRST_SEEN.get(key)

if first_seen_key:

return first_seen_key

df['group_super'] = df[grouping_fields].apply(fetch_group_id, axis=1)

反对回复 2023-07-11

2 回答
0 关注
114 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

大熊猫中棘手的级联分组

大熊猫中棘手的级联分组

2 回答

添加回答