为了账号安全,请及时绑定邮箱和手机立即绑定

线性回归的负精度

线性回归的负精度

喵喵时光机 2023-07-18 10:37:25
我的线性回归模型的决定系数 R²为负。怎么会发生这种事呢?任何想法都有帮助。这是我的数据集:year,population1960,22151278.01961,22671191.01962,23221389.01963,23798430.01964,24397022.01965,25013626.01966,25641044.01967,26280132.01968,26944390.01969,27652709.01970,28415077.01971,29248643.01972,30140804.01973,31036662.01974,31861352.01975,32566854.01976,33128149.01977,33577242.01978,33993301.01979,34487799.01980,35141712.01981,35984528.01982,36995248.01983,38142674.01984,39374348.01985,40652141.01986,41965693.01987,43329231.01988,44757203.01989,46272299.01990,47887865.01991,49609969.01992,51423585.01993,53295566.01994,55180998.01995,57047908.01996,58883530.01997,60697443.01998,62507724.01999,64343013.02000,66224804.02001,68159423.02002,70142091.02003,72170584.02004,74239505.02005,76346311.02006,78489206.02007,80674348.02008,82916235.02009,85233913.02010,87639964.02011,90139927.02012,92726971.02013,95385785.02014,98094253.02015,100835458.02016,103603501.02017,106400024.02018,109224559.0模型的代码LinearRegression如下:import pandas as pdfrom sklearn.linear_model import LinearRegressiondata =pd.read_csv("data.csv", header=None )data = data.drop(0,axis=0)X=data[0]Y=data[1]from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)lm = LinearRegression()lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))Y_pred = lm.predict(X_test.values.reshape(-1,1))accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)print(accuracy)output-3592622948027972.5
查看完整描述

2 回答

?
慕盖茨4494581

TA贡献1850条经验 获得超11个赞

以下是 R² 分数的公式:

//img1.sycdn.imooc.com//64b5fb12000158e002310062.jpg

\hat{y_i} 是第 i 个观测值 y_i 的预测变量,\bar{y} 是所有观测值的平均值。

因此,负 R² 意味着如果有人知道您样本的平均值y_test并始终将其用作“预测”,则该“预测”将比您的模型更准确。

转到您的数据集(感谢 @Prayson W. Daniel 提供了方便的加载脚本),让我们快速浏览一下您的数据。

df.population.plot()

//img1.sycdn.imooc.com//64b5fb200001e49503720258.jpg

看起来对数变换可能会有所帮助。


import numpy as np

df_log = df.copy()

df_log.population = np.log(df.population)

df_log.population.plot()

//img1.sycdn.imooc.com//64b5fb2f000171e103750243.jpg

现在让我们使用 OpenTURNS 执行线性回归。


import openturns as ot

sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample

sam.setDescription(['year', 'logarithm of the population'])

linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])

linreg.run()

linreg_result = linreg.getResult()

coeffs = linreg_result.getCoefficients()

print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))

print("R2 score = {}".format(linreg_result.getRSquared()))

ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)

输出:


Best fitting line = -38.35148311467912 + year * 0.028172928802559845

R2 score = 0.9966261033648469

//img1.sycdn.imooc.com//64b5fb3e0001e83f05830398.jpg

这几乎是精确的配合。


编辑


正如 @Prayson W. Daniel 所建议的,这是转换回原始比例后的模型拟合。


# Get the original data in openturns Sample format

orig_sam = ot.Sample(np.array(df))

orig_sam.setDescription(df.columns)


# Compute the prediction in the original scale

predicted = ot.Sample(orig_sam) # start by copying the original data

predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values

error = np.array((predicted - orig_sam)[:, 1]) # compute error

r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale

print("R2 score in original scale = {}".format(r2))


# Plot the model

graph = ot.Graph("Original scale", "year", "population", True, '')

curve = ot.Curve(predicted)

graph.add(curve)

points = ot.Cloud(orig_sam)

points.setColor('red')

graph.add(points)

graph

输出:


R2 score in original scale = 0.9979032805107133

//img1.sycdn.imooc.com//64b5fb4c0001571705600396.jpg

查看完整回答
反对 回复 2023-07-18
?
繁华开满天机

TA贡献1816条经验 获得超4个赞

Sckit-learn 的 LinearRegression 分数使用 𝑅2 分数。负 𝑅2 意味着该模型与您的数据拟合得非常糟糕。由于 𝑅2 将模型的拟合度与原假设(水平直线)的拟合度进行比较,因此当模型拟合度比水平线差时,𝑅2 为负。


𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))

因此,如果 SUM((y - ypred)**2大于SUM((y - AVG(y))**2,则 𝑅2 将为负数。


原因及纠正方法

问题 1:您正在执行时间序列数据的随机分割。随机分割将忽略时间维度。

解决方案:保留时间流(参见下面的代码)


问题2:目标值太大。

解决方案:除非我们使用基于树的模型,否则您将必须进行一些目标特征工程,以将数据缩放到模型可以学习的范围内。


这是一个代码示例。使用 LinearRegression 的默认参数和log|exp目标值的转换,我的尝试产生了约 87% 的 R2 分数:



import pandas as pd

import numpy as np


# we need to transform/feature engineer our target

# I will use log from numpy. The np.log and np.exp to make the value learnable


from sklearn.linear_model import LinearRegression

from sklearn.compose import TransformedTargetRegressor


# your data, df


# transform year to reference


df = df.assign(ref_year = lambda x: x.year - 1960)

df.population = df.population.astype(int)


split = int(df.shape[0] *.9) #split at 90%, 10%-ish


df = df[['ref_year', 'population']]


train_df = df.iloc[:split]

test_df = df.iloc[split:]


X_train = train_df[['ref_year']]

y_train = train_df.population


X_test = test_df[['ref_year']]

y_test = test_df.population



# regressor

regressor = LinearRegression()


lr = TransformedTargetRegressor(

        regressor=regressor, 

        func=np.log, inverse_func=np.exp)


lr.fit(X_train,y_train)

print(lr.score(X_test,y_test))

对于那些有兴趣让它变得更好的人,这里有一种读取该数据集的方法


import pandas as pd

import io


df = pd.read_csv(io.StringIO('''year,population

1960,22151278.0 

1961,22671191.0 

1962,23221389.0 

1963,23798430.0 

1964,24397022.0 

1965,25013626.0 

1966,25641044.0 

1967,26280132.0 

1968,26944390.0 

1969,27652709.0 

1970,28415077.0 

1971,29248643.0 

1972,30140804.0 

1973,31036662.0 

1974,31861352.0 

1975,32566854.0 

1976,33128149.0 

1977,33577242.0 

1978,33993301.0 

1979,34487799.0 

1980,35141712.0 

1981,35984528.0 

1982,36995248.0 

1983,38142674.0 

1984,39374348.0 

1985,40652141.0 

1986,41965693.0 

1987,43329231.0 

1988,44757203.0 

1989,46272299.0 

1990,47887865.0 

1991,49609969.0 

1992,51423585.0 

1993,53295566.0 

1994,55180998.0

1995,57047908.0 

1996,58883530.0 

1997,60697443.0 

1998,62507724.0 

1999,64343013.0 

2000,66224804.0 

2001,68159423.0 

2002,70142091.0 

2003,72170584.0 

2004,74239505.0

2005,76346311.0

2006,78489206.0 

2007,80674348.0 

2008,82916235.0 

2009,85233913.0 

2010,87639964.0 

2011,90139927.0 

2012,92726971.0 

2013,95385785.0 

2014,98094253.0 

2015,100835458.0 

2016,103603501.0 

2017,106400024.0 

2018,109224559.0

'''))

结果:

//img1.sycdn.imooc.com//64b5fb64000150c006210371.jpg

查看完整回答
反对 回复 2023-07-18
  • 2 回答
  • 0 关注
  • 111 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信