所有子集上的岭回归 rmse 高于总集

我在一个集合上训练了一个模型，并尝试在所有子集上使用它。从数学上讲，总 rmse 和 mae（平均误差）应该在单个 rsme 和 mae 之间。但是所有单个 rmse' 和 mae's 都高于总 rmse' 和 mae。我做了以下事情：%pysparkdef preprocessing(features, attributes): features_2 = features[attributes] y = features['y'].values x = features_2.values robustScaler = RobustScaler(quantile_range=(25.0,75.0)) xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]]) xScaled[xScaled < -2.0] = -2.0 xScaled[xScaled > 2.0] = 2.0 xCustomers = x[:,0] xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) x_TS = xScaled x_T0 = xScaled[:,:] x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) xCustR = xCustomers.reshape((x[:,0].size, 1)) x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) x_all = np.hstack((x_T0_all, x_TS_all)) variable_names = features_2.columns.get_values()[1:].tolist() return x_all, variable_names, ydef trainModel(features,attributes,optAlpha): x_all, variable_names, y = preprocessing(features, attributes) ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto') ridge.fit(x_all, y) return ridgedef useModel(features,ridge,attributes): x_all, variable_names, y = preprocessing(features, attributes) y_pred = ridge.predict(x_all) rmse = np.sqrt(mean_squared_error(y,y_pred)) mae = mean_absolute_error(y, y_pred) print "RMSE on test set: ", round(rmse,2) print "MAE on test set: ", round(mae,2) return y_pred, y, rmse, maeridge = trainModel(df_features_train, attributes, optAlpha)useModel(df_features_train,ridge,attributes)RMSE on test set: 67.05任何想法出了什么问题？

查看完整描述

1 回答

月关宝盒

TA贡献1772条经验获得超5个赞

我自己找到的。

预处理中的robustScaler 在不同的集合/子集上的工作方式不同。

因此，子集中的值以不同方式准备，因此不再适合模型。

反对回复 2021-08-24

热搜

最近搜索清空

所有子集上的岭回归 rmse 高于总集

所有子集上的岭回归 rmse 高于总集

1 回答

添加回答