1 回答
TA贡献1831条经验 获得超10个赞
4K 行的交叉合并还不错(产生大约 16M 行)。让我们尝试交叉合并和查询:
n = 2
# dummy key
df['dummy'] = 1
# this is the member group number
df['rank'] = df['member'].rank(method='dense')
# cross merge and filter
new_df = (df.merge(df, on='dummy')
.query('rank_x<rank_y<=rank_x+@n')
)
# euclidean distance
dist = (new_df[['x_x','y_x','z_x']].sub(new_df[['x_y','y_y','z_y']].values)**2).sum(1)**.5
# output dataframe with member label
pd.DataFrame({'member1':new_df['member_x'], 'member2':new_df['member_y'],
'dist':dist})
输出:
member1 member2 dist
2 0 1 2.449490
3 0 1 1.414214
4 0 2 1.414214
5 0 2 1.732051
12 0 1 2.236068
13 0 1 3.000000
14 0 2 2.236068
15 0 2 2.828427
24 1 2 3.162278
25 1 2 3.000000
26 1 5 8.485281
27 1 5 4.690416
34 1 2 1.414214
35 1 2 1.000000
36 1 5 5.477226
37 1 5 6.164414
46 2 5 5.477226
47 2 5 6.164414
48 2 6 3.000000
49 2 6 1.414214
56 2 5 5.744563
57 2 5 6.557439
58 2 6 4.000000
59 2 6 1.000000
68 5 6 5.744563
69 5 6 6.633250
78 5 6 5.916080
79 5 6 5.830952
选项 2:如果数据帧较大,则循环可能还不错:
from scipy.spatial.distance import cdist
ret = []
for i in set(df['rank']):
this_group = df['rank']==i
other_groups = df['rank'].between(i,i+n, inclusive=False)
t = df.loc[this_group,['x','y','z']].values
o = df.loc[other_groups,['x','y','z']].values
ret.append(cdist(t,o).ravel())
dist = np.concatenate(ret)
- 1 回答
- 0 关注
- 92 浏览
添加回答
举报