2 回答
TA贡献1850条经验 获得超11个赞
以下是 R² 分数的公式:
\hat{y_i} 是第 i 个观测值 y_i 的预测变量,\bar{y} 是所有观测值的平均值。
因此,负 R² 意味着如果有人知道您样本的平均值y_test
并始终将其用作“预测”,则该“预测”将比您的模型更准确。
转到您的数据集(感谢 @Prayson W. Daniel 提供了方便的加载脚本),让我们快速浏览一下您的数据。
df.population.plot()
看起来对数变换可能会有所帮助。
import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
现在让我们使用 OpenTURNS 执行线性回归。
import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
输出:
Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
这几乎是精确的配合。
编辑
正如 @Prayson W. Daniel 所建议的,这是转换回原始比例后的模型拟合。
# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)
# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))
# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
输出:
R2 score in original scale = 0.9979032805107133
TA贡献1816条经验 获得超4个赞
Sckit-learn 的 LinearRegression 分数使用 𝑅2 分数。负 𝑅2 意味着该模型与您的数据拟合得非常糟糕。由于 𝑅2 将模型的拟合度与原假设(水平直线)的拟合度进行比较,因此当模型拟合度比水平线差时,𝑅2 为负。
𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
因此,如果 SUM((y - ypred)**2大于SUM((y - AVG(y))**2,则 𝑅2 将为负数。
原因及纠正方法
问题 1:您正在执行时间序列数据的随机分割。随机分割将忽略时间维度。
解决方案:保留时间流(参见下面的代码)
问题2:目标值太大。
解决方案:除非我们使用基于树的模型,否则您将必须进行一些目标特征工程,以将数据缩放到模型可以学习的范围内。
这是一个代码示例。使用 LinearRegression 的默认参数和log|exp目标值的转换,我的尝试产生了约 87% 的 R2 分数:
import pandas as pd
import numpy as np
# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# your data, df
# transform year to reference
df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)
split = int(df.shape[0] *.9) #split at 90%, 10%-ish
df = df[['ref_year', 'population']]
train_df = df.iloc[:split]
test_df = df.iloc[split:]
X_train = train_df[['ref_year']]
y_train = train_df.population
X_test = test_df[['ref_year']]
y_test = test_df.population
# regressor
regressor = LinearRegression()
lr = TransformedTargetRegressor(
regressor=regressor,
func=np.log, inverse_func=np.exp)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))
对于那些有兴趣让它变得更好的人,这里有一种读取该数据集的方法
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
'''))
结果:
添加回答
举报