首页猿问基于内部值的 Numpy 数组操作

基于内部值的 Numpy 数组操作

Python

胡说叔叔 2021-11-30 19:22:43

我正在尝试完成一项奇怪的任务。我需要在不使用 sklearn 的情况下完成以下操作，最好使用 numpy：给定一个数据集，将数据分成 5 个相等的“折叠”或分区在每个分区内，将数据拆分为“训练”和“测试”集，拆分比例为 80/20这里有一个问题：你的数据集被标记为类。以一个有 100 个实例的数据集为例，A 类有 33 个样本，B 类有 67 个样本。我应该创建 5 个 20 个数据实例的折叠，其中在每个折叠中，A 类有 6 或 7 (1/3) 个值，B 类有其余的我的问题是：我不知道如何为每个折叠正确返回测试和训练集，尽管能够适当地分割它，而且，更重要的是，我不知道如何合并每个类的元素数量的正确划分.我当前的代码在这里。有人评论我被卡住的地方：import numpydef csv_to_array(file): # Open the file, and load it in delimiting on the ',' for a comma separated value file data = open(file, 'r') data = numpy.loadtxt(data, delimiter=',') # Loop through the data in the array for index in range(len(data)): # Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0 try: data[index] = [float(x) for x in data[index]] except Exception: data[index] = 0 except ValueError: data[index] = 0 # Return the now type-formatted data return datadef five_cross_fold_validation(dataset): # print("DATASET", dataset) numpy.random.shuffle(dataset) num_rows = dataset.shape[0] split_mark = int(num_rows / 5) folds = [] temp1 = dataset[:split_mark] # print("TEMP1", temp1) temp2 = dataset[split_mark:split_mark*2] # print("TEMP2", temp2) temp3 = dataset[split_mark*2:split_mark*3] # print("TEMP3", temp3) temp4 = dataset[split_mark*3:split_mark*4] # print("TEMP4", temp4) temp5 = dataset[split_mark*4:] # print("TEMP5", temp5) folds.append(temp1) folds.append(temp2) folds.append(temp3) folds.append(temp4) folds.append(temp5) # folds = numpy.asarray(folds) for fold in folds: # fold = numpy.asarray(fold) num_rows = fold.shape[0] split_mark = int(num_rows * .8) fold_training = fold[split_mark:] fold_testing = fold[:split_mark]

查看完整描述

1 回答

互换的青春

TA贡献1797条经验获得超6个赞

编辑我替换np.random.shuffle(A)为A = np.random.permutation(A)，唯一的区别是它不会改变输入数组。这在这段代码中没有任何区别，但通常更安全。

这个想法是通过使用随机采样输入numpy.random.permutation。一旦行被打乱，我们只需要遍历所有可能的测试集（所需大小的滑动窗口，这里是输入大小的 20%）。相应的训练集仅由所有剩余元素组成。

这将保留所有子集上的原始类分布，即使我们因为我们打乱了输入而按顺序选择了它们。

以下代码迭代测试/训练集组合：

import numpy as np

def csv_to_array(file):

with open(file, 'r') as f:

data = np.loadtxt(f, delimiter=',')

return data

def classes_distribution(A):

"""Print the class distributions of array A."""

nb_classes = np.unique(A[:,-1]).shape[0]

total_size = A.shape[0]

for i in range(nb_classes):

class_size = sum(row[-1] == i for row in A)

class_p = class_size/total_size

print(f"\t P(class_{i}) = {class_p:.3f}")

def random_samples(A, test_set_p=0.2):

"""Split the input array A in two uniformly chosen

random sets: test/training.

Repeat this until all rows have been yielded once at least

once as a test set."""

A = np.random.permutation(A)

sample_size = int(test_set_p*A.shape[0])

for start in range(0, A.shape[0], sample_size):

end = start + sample_size

yield {

"test": A[start:end,],

"train": np.append(A[:start,], A[end:,], 0)

}

def main():

ecoli = csv_to_array('ecoli.csv')

print("Input set shape: ", ecoli.shape)

print("Input set class distribution:")

classes_distribution(ecoli)

print("Training sets class distributions:")

for iteration in random_samples(ecoli):

test_set = iteration["test"]

training_set = iteration["train"]

classes_distribution(training_set)

print("---")

# ... Do what ever with these two sets

main()

它产生以下形式的输出：

Input set shape: (169, 8)

Input set class distribution:

P(class_0) = 0.308

P(class_1) = 0.213

P(class_2) = 0.207

P(class_3) = 0.118

P(class_4) = 0.154

Training sets class distributions:

P(class_0) = 0.316

P(class_1) = 0.206

P(class_2) = 0.199

P(class_3) = 0.118

P(class_4) = 0.162

...

反对回复 2021-11-30

1 回答
0 关注
160 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

基于内部值的 Numpy 数组操作

基于内部值的 Numpy 数组操作

1 回答

添加回答