2 回答
![?](http://img1.sycdn.imooc.com/533e4ce900010ae802000200-100-100.jpg)
TA贡献1887条经验 获得超5个赞
有一些方法可以做到这一点,例如,对覆盖范围良好的数据集应用 FFT,并查看它与覆盖范围较差的数据集的拟合情况,同时删除高频项。
但是,我非常怀疑这是否有用:覆盖率高的数据集几乎完全适合覆盖率低的数据集。无论您要应用哪种方法,与具有高覆盖率的数据集相似、同时拟合具有较差覆盖率的数据集的最佳函数是具有高覆盖率的数据集本身。
![?](http://img1.sycdn.imooc.com/545869470001a00302200220-100-100.jpg)
TA贡献1827条经验 获得超8个赞
让我们创建一个试验数据集来解决您的问题:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
t = np.linspace(0, 30*2*np.pi, 30*24*2)
td = pd.date_range("2020-01-01", freq='30T', periods=t.size)
T0 = np.sin(t)*8 - 15 + np.random.randn(t.size)*0.2
T1 = np.sin(t)*7 - 13 + np.random.randn(t.size)*0.1
T2 = np.sin(t)*9 - 10 + np.random.randn(t.size)*0.3
T3 = np.sin(t)*8.5 - 11 + np.random.randn(t.size)*0.5
T = np.vstack([T0, T1, T2, T3]).T
features = pd.DataFrame(T, columns=["s1", "s2", "s3", "s4"], index=td)
看起来像:
axe = features[:"2020-01-04"].plot()
axe.legend()
axe.grid()
然后,如果您的时间序列线性相关良好,您可以简单地通过普通最小二乘回归的平均值来预测缺失值。SciKit-Learn 提供了一个方便的接口来执行此类计算:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
# Remove target site from features:
target = features.pop("s4")
# Split dataset into train (actual data) and test (missing temperatures):
x_train, x_test, y_train, y_test = train_test_split(features, target, train_size=0.25, random_state=123)
# Create a Linear Regressor and train it:
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
# Assess regression score with test data:
reg.score(x_test, y_test) # 0.9926150729585087
# Predict missing values:
ypred = reg.predict(x_test)
ypred = pd.DataFrame(ypred, index=x_test.index, columns=["s4p"])
结果如下:
axe = features[:"2020-01-04"].plot()
target[:"2020-01-04"].plot(ax=axe)
ypred[:"2020-01-04"].plot(ax=axe, linestyle='None', marker='.')
axe.legend()
axe.grid()
error = (y_test - ypred.squeeze())
axe = error.plot()
axe.legend(["Prediction Error"])
axe.grid()
添加回答
举报