个机器学习项目主要步骤为:
1. 获取数据
使用Pandas加载数据,并返回一个包含所有数据的Pandas
DataFrame
对象。
import pandas as pddef load_housing_data(housing_path=HOUSING_PATH): csv_path = os.path.join(housing_path, "housing.csv") return pd.read_csv(csv_path)
使用DataFrame的
head()
方法查看该数据集的前5行:housing.head()
使用
describe()
方法展示数值属性的概括:housing.describe()
创建测试集(根据收入,进行分层采样):
from sklearn.model_selection import StratifiedShuffleSplit split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)for train_index, test_index in split.split(housing, housing["income_cat"]): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]
2. 发现并可视化数据,发现规律
地理数据的可视化:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"]/100, label="population", c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, ) plt.legend()
数据的地理信息散点图
查找关联
使用corr()
方法计算出每对属性间的标准相关系数(standard correlation coefficient,也称作皮尔逊相关系数):
>>> corr_matrix = housing.corr()>>> corr_matrix["median_house_value"].sort_values(ascending=False)#每个属性和房价中位数的关联度median_house_value 1.000000median_income 0.687170total_rooms 0.135231housing_median_age 0.114220households 0.064702total_bedrooms 0.047865population -0.026699longitude -0.047279latitude -0.142826Name: median_house_value, dtype: float64
尝试不同的属性组合
>>> housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]>>> housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]>>> housing["population_per_household"]=housing["population"]/housing["households"]>>> corr_matrix = housing.corr()>>> corr_matrix["median_house_value"].sort_values(ascending=False) median_house_value 1.000000median_income 0.687170rooms_per_household 0.199343total_rooms 0.135231housing_median_age 0.114220households 0.064702total_bedrooms 0.047865population_per_household -0.021984population -0.026699longitude -0.047279latitude -0.142826bedrooms_per_room -0.260070Name: median_house_value, dtype: float64#可以看出来,与总房间数或卧室数相比,新的bedrooms_per_room属性与房价中位数的关联更强
3. 数据预处理
处理缺失值
from sklearn.preprocessing import Imputer imputer = Imputer(strategy="median") housing_num = housing.drop("ocean_proximity", axis=1)#创建一份不包括文本属性ocean_proximity的数据副本imputer.fit(housing_num) X = imputer.transform(housing_num)
处理文本和类别属性(使用独热编码One-Hot Encoding)
from sklearn.preprocessing import CategoricalEncoder # in future versions of Scikit-Learncat_encoder = CategoricalEncoder() housing_cat_reshaped = housing_cat.values.reshape(-1, 1) housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
特征缩放
有两种常见的方法可以让所有的属性有相同的量度:线性函数归一化(Min-Max scaling)和标准化(standardization)。转换流水线
from sklearn.pipeline import FeatureUnion num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] num_pipeline = Pipeline([ ('selector', DataFrameSelector(num_attribs)), ('imputer', Imputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(cat_attribs)), ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")), ]) full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline), ])
运行流水线 :
housing_prepared = full_pipeline.fit_transform(housing)
4. 选择模型,进行训练
线性回归模型
from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(housing_prepared, housing_labels)
决策树模型
from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor() tree_reg.fit(housing_prepared, housing_labels)
随机森林模型
from sklearn.ensemble import RandomForestRegressor forest_reg = RandomForestRegressor() forest_reg.fit(housing_prepared, housing_labels)
使用 Scikit-Learn 的交叉验证功能---K 折交叉验证(K-fold cross-validation):
from sklearn.model_selection import cross_val_score scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) rmse_scores = np.sqrt(-scores)
5. 微调模型
网格搜索
from sklearn.model_selection import GridSearchCV param_grid = [ {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(housing_prepared, housing_labels)
作者:Darkchaox
链接:https://www.jianshu.com/p/373f4abc8c99
点击查看更多内容
为 TA 点赞
评论
共同学习,写下你的评论
评论加载中...
作者其他优质文章
正在加载中
感谢您的支持,我会继续努力的~
扫码打赏,你说多少就多少
赞赏金额会直接到老师账户
支付方式
打开微信扫一扫,即可进行扫码打赏哦