1 回答
TA贡献1735条经验 获得超5个赞
使用一些技巧
用于
pd.factorize()
将分类数据转换为每个类别的值计算代表组/子组对的值/因子f
随机化一点
np.random.uniform()
,最小值和最大值接近 1一旦有一个代表分组的值,就可以
sort_values()
并且reset_index()
有一个干净的有序索引最终通过整数余数进行分组
group = list("ABCD")
subgroup = list("abcdef")
df = pd.DataFrame([{"group":group[random.randint(0,len(group)-1)],
"subgroup":subgroup[random.randint(0,len(subgroup)-1)],
"value":random.randint(1,3)} for i in range(300)])
bins=6
dfc = df.assign(
# take into account concentration of group and subgroup
# randomise a bit....
f = ((pd.factorize(df["group"])[0] +1)*10 +
(pd.factorize(df["subgroup"])[0] +1)
*np.random.uniform(0.99,1.01,len(df))
),
).sort_values("f").reset_index(drop=True).assign(
gc=lambda dfa: dfa.index%(bins)
).drop(columns="f")
# check distribution ... used plot for SO
dfc.groupby(["gc","group","subgroup"]).count().unstack(0).plot(kind="barh")
# every group same size...
# dfc.groupby("gc").count()
# now it's easy to get each of the cuts.... 0 through 5
# dfcut0 = dfc.query("gc==0").drop(columns="gc").copy().reset_index(drop=True)
# dfcut0
添加回答
举报