3 回答

TA贡献1828条经验 获得超4个赞
get_dummies那我们试试dot
df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
Out[307]:
cat 1,3,4
dog 1,2,4
dolphin 3,5
hamster 5
dtype: object
如果会考虑列表添加reindex
df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
Out[308]:
cat 1,3,4
dog 1,2,4
hamster 5
dolphin 3,5
dtype: object

TA贡献1862条经验 获得超7个赞
基于 NumPy 的 perf。-
def list_occ(df):
id_col='id'
item_col='animals'
sidx = np.argsort(animals)
s = [i.split(',') for i in df[item_col]]
d = np.concatenate(s)
p = sidx[np.searchsorted(animals, d, sorter=sidx)]
C = np.bincount(p, minlength=len(animals))
l = list(map(len,s))
r = np.repeat(np.arange(len(l)), l)
v = df[id_col].values[r[np.lexsort((r,p))]]
out = pd.DataFrame({'ids':np.split(v, C[:-1].cumsum())}, index=animals)
return out
样品运行 -
In [41]: df
Out[41]:
id animals
0 1 dog,cat
1 2 dog
2 3 cat,dolphin
3 4 cat,dog
4 5 hamster,dolphin
In [42]: animals
Out[42]: ['cat', 'dog', 'hamster', 'dolphin']
In [43]: list_occ(df)
Out[43]:
ids
cat [1, 3, 4]
dog [1, 2, 4]
hamster [5]
dolphin [3, 5]
对标
使用给定的样本并简单地增加项目的数量。
# Setup
N = 100 # scale factor
s = [i.split(',') for i in df['animals']]
df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
df_big['id'] = range(1, len(df_big)+1)
animals = np.unique(np.concatenate(df_big.animals)).tolist()
df_big['animals'] = [','.join(i) for i in df_big.animals]
df = df_big
时间 -
# Using given df & scaling it up by replicating elems with progressive IDs
In [9]: N = 100 # scale factor
...: s = [i.split(',') for i in df['animals']]
...: df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
...: df_big['id'] = range(1, len(df_big)+1)
...: animals = np.unique(np.concatenate(df_big.animals)).tolist()
...: df_big['animals'] = [','.join(i) for i in df_big.animals]
...: df = df_big
# @BEN_YO's soln-1
In [10]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
163 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @BEN_YO's soln-2
In [11]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
166 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @Andy L.'s soln
%timeit (df.astype(str).assign(animals=df.animals.str.split(',')).explode('animals').groupby('animals').id.agg(','.join).reset_index())
13.4 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit list_occ(df)
2.81 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

TA贡献1785条经验 获得超4个赞
使用str.split,explode和agg.join
df_final = (df.astype(str).assign(animals=df.animals.str.split(','))
.explode('animals').groupby('animals').id.agg(','.join)
.reset_index())
Out[155]:
animals id
0 cat 1,3,4
1 dog 1,2,4
2 dolphin 3,5
3 hamster 5
添加回答
举报