为了账号安全,请及时绑定邮箱和手机立即绑定

根据类别分布在训练和测试之间划分数据集

根据类别分布在训练和测试之间划分数据集

holdtom 2022-10-25 16:12:03
我想在具有以下分布的给定数据集中运行 10 次机器学习算法np.unique(x[:,24], return_counts=True)(array([1., 2.]), array([700, 300]))这意味着我 70% 的数据来自第 1 类,30% 来自第 2 类。下面是我的数据的快照。最后一列通知类标签(1 或 2):1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,12,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,24,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,11,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,11,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,24,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,14,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,12,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,14,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,12,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,22,12,2,13,1,2,2,1,3,25,3,1,1,1,1,1,0,1,0,1,0,0,0,1,21,48,2,43,1,2,2,4,2,24,3,1,1,1,1,0,0,1,0,1,0,0,0,1,22,12,2,16,1,3,2,1,3,22,3,1,1,2,1,0,0,1,0,0,1,0,0,1,11,24,4,12,1,5,3,4,3,60,3,2,1,1,1,1,0,1,0,0,1,0,1,0,21,15,2,14,1,3,2,4,3,28,3,1,1,1,1,1,0,1,0,1,0,0,0,1,11,24,2,13,2,3,2,2,3,32,3,1,1,1,1,0,0,1,0,0,1,0,1,0,24,24,4,24,5,5,3,4,2,53,3,2,1,1,1,0,0,1,0,0,1,0,0,1,11,30,0,81,5,2,3,3,3,25,1,3,1,1,1,0,0,1,0,0,1,0,0,1,12,24,2,126,1,5,2,2,4,44,3,1,1,2,1,0,1,1,0,0,0,0,0,0,24,24,2,34,3,5,3,2,3,31,3,1,2,2,1,0,0,1,0,0,1,0,0,1,14,9,4,21,1,3,3,4,3,48,3,3,1,2,1,1,0,1,0,0,1,0,0,1,11,6,2,26,3,3,3,3,1,44,3,1,2,1,1,0,0,1,0,1,0,0,0,1,11,10,4,22,1,2,3,3,1,48,3,2,2,1,2,1,0,1,0,1,0,0,1,0,12,12,4,18,2,2,3,4,2,44,3,1,1,1,1,0,1,1,0,0,1,0,0,1,14,10,4,21,5,3,4,1,3,26,3,2,1,1,2,0,0,1,0,0,1,0,0,1,11,6,2,14,1,3,3,2,1,36,1,1,1,2,1,0,0,1,0,0,1,0,1,0,14,6,0,4,1,5,4,4,3,39,3,1,1,1,1,0,0,1,0,0,1,0,1,0,13,12,1,4,4,3,2,3,1,42,3,2,1,1,1,0,0,1,0,1,0,0,0,1,12,7,2,24,1,3,3,2,1,34,3,1,1,1,1,0,0,0,0,0,1,0,0,1,11,60,3,68,1,5,3,4,4,63,3,2,1,2,1,0,0,1,0,0,1,0,0,1,22,18,2,19,4,2,4,3,1,36,1,1,1,2,1,0,0,1,0,0,1,0,0,1,11,24,2,40,1,3,3,2,3,27,2,1,1,1,1,0,0,1,0,0,1,0,0,1,1完整的数据集可以在这里找到我想将数据分成 90% 用于训练和 10% 用于测试。但是,对于每个拆分,我必须保持数据的比例(例如,在训练和验证拆分中,70% 的数据必须属于 1 类,30% 属于 2 类)我知道如何简单地将数据划分为训练和测试,但我不知道如何使这种划分服从我上面引用的类分布。如何在 Python 中做到这一点?
查看完整描述

2 回答

?
慕码人8056858

TA贡献1803条经验 获得超6个赞

您可以使用RepeatedStratifiedKFold,顾名思义,重复 K 折交叉验证器n时间。要重复处理10时间,设置,并在/大小中具有大约 n_repeats的比例,我们可以设置:9:1traintestn_splits=10

from sklearn.model_selection import RepeatedStratifiedKFold


X = a[:,:-1]

y = a[:,-1]


rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=2)


for train_index, test_index in rskf.split(X, y):

    X_train, X_test = X[train_index], X[test_index]

    y_train, y_test = y[train_index], y[test_index]

    print(f'\nClass 1: {((y_train==1).sum()/len(y_train))*100:.0f}%') 

    print(f'\nShape of train: {X_train.shape[0]}')

    print(f'Shape of test: {X_test.shape[0]}')

Class 1: 73%


Shape of train: 33

Shape of test: 4


Class 1: 73%


Shape of train: 33

Shape of test: 4


Class 1: 73%


Shape of train: 33

Shape of test: 4


Class 1: 73%


Shape of train: 33

Shape of test: 4

...




查看完整回答
反对 回复 2022-10-25
?
精慕HU

TA贡献1845条经验 获得超8个赞

将数据拆分为训练和测试的一种众所周知的方法是 scikit-learn train_test_split

model_selection.train_test_split的 API 文档。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

您可以使用random_state变量(种子),直到您的类之间的比例正确。虽然train_test_split不会强制执行比例,但它通常遵循人口比例。


查看完整回答
反对 回复 2022-10-25
  • 2 回答
  • 0 关注
  • 105 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信