【机器学习】Logistic回归——理论推导及算法实现（Python版）

【Logistic回归】

简介
Logistic回归（也称逻辑回归），是最常用的一种二分类算法。回归，一词多用来进行数据拟合，进而预测连续型数据。Logistic回归，是一种对数据进行分类的算法（也可以说是对数据进行预测，只不过预测y是离散型数据），之所以称为回归是由历史原因所决定的，在此不用过多的深究。最后再强调一次：Logistic回归是最经典、最常用的二分类算法。
逻辑回归属于二元分类问题，我们将因变量y的值域限定为{0,1},其中0代表负向类（negative class），1代表正向类(positive class)，相比于线性回归，线性回归因变量y是连续值，随着样本的不同，其变化范围可能严重超出（0，1）区间。
注意： 在这里要区分y和 $h_{\theta} (\chi)$ 的区别，对于线性回归 $y=h_{\theta}(\chi)+\epsilon(\chi)$
其中 $\epsilon(\chi)$ 表示实际值与预测值之间的误差,在忽略误差的情况下 $y=h_{\theta}(\chi)$ 。
对于逻辑回归, $y\neq h_{\theta}(\chi)$
而是 $h_{\theta}(\chi) \rightarrow y$
即：if $h_{\theta}(\chi)\geq 0.5$ ,则 $y=1$ ； if $h_{\theta}(\chi)< 0.5$ ,则 $y=0$
所以 $h_{\theta}(\chi)$ 严格限制在（0，1）区间内，即 $0\leq h_{\theta} (\chi)\leq 1$
总结: Logistic回归和线性回归都用来对数据进行预测，只不过Logistic回归的取值范围为离散型有限集合，例如： $h_{\theta}(x)\epsilon \{0,1\}$ ,而线性回归的取值范围为连续型集合， $h_{\theta}(x)$ 的取值区间是不可控的。特别对于二元分类，Logistic回归， $h_{\theta}(x)\epsilon \{0,1\}$
假说表示
为了找到合适的 $h_{\theta}(\chi)$ 模型，使得 $h_{\theta}(\chi)\epsilon(0,1)$ ，我们假设逻辑回归模型为：
$h_{\theta}(\chi)=g(\theta^{T}X)$
其中，X代表特征向量，g代表逻辑函数（logistic function），g(z)也称作sigmod函数, $g(z)=\frac{1}{1+e^{-z}}$
sigmod函数的图像如下图所示：

我们发现该函数特点明显，Z>0就接近1，Z<0就接近0，Z=0，g(z)=0.5
解释： 将 $h_{\theta}(x)$ 的取值限制在（0，1）区间，是将值得问题转换为概率问题,也就是分类问题。即：h?(?)的作用是，对于给定的输入变量，根据选择的参数计算输出变量(y)=1 的可能性(estimated probablity),即
$h_{\theta}(x)=p(y=1|x;\theta)$
例如，如果对于给定的?，通过已经确定的参数计算得出: $h_{\theta}(x) = 0.7$ 则表示有 70%的几率?为正向类，相应地?为负向类的几率为 1-0.7=0.3。
总结: 逻辑回归假设函数是将线性回归中得到的预测值，作为参数，传递给sigmod函数，完成由值到概率的转换，进而实现分类。
决策边界
决策边界，是用来进行类别划分的边界函数。可以是线性函数也可以是非线性函数。具体表现如下图所示：

线性边界函数，如上图中的蓝色直线，其 $\theta^{T}X=\theta_0+\theta_1x$

线性边界函数，如上图中的黑色曲线，其 $\theta^{T}X=\theta_0+\theta_1x+\theta_2x^{2}+\cdots+\theta_nx_n^{2}x_{n+1}^{2}$
解释在逻辑回归中，我们预测：
- 当 $h_{\theta}(x)\geq0.5$ 时， $y=1$
- 当 $h_{\theta}(x)<0.5$ 时， $y=0$
由于假设函数为：
$h_{\theta}(\chi)=g(\theta^{T}X)$
由 $h_{\theta}(x)\geq0.5$ ,可得 $\theta^{T}X\geq0$ ,进而 $y=1$
$h_{\theta}(x)<0.5$ ,可得 $\theta^{T}X<0$ ,进而 $y=0$
故逻辑回归的决策边界函数的确定，可以通过一下公式来确定：
$\theta^{T}X=0$
而其中，无论 $\theta^{T}X$ 是一阶线性函数（直线边界），函数带有高阶项的非线性函数（曲线边界）。因为无论是高阶还是低阶，我们都可以当做线性回归来处理。
代价函数（损失函数）
对于逻辑回归，我们也可以采用线性回归的方式来构建损失函数，即：代价函数是所有模型误差的平方和 ，但是我们将 $h_{\theta}(\chi)=g(\theta^{T}X)$ 带入其中，发现得到的代价函数是一个非凸函数，存在许多局部最小解，这种函数对于梯度下降算法来说是不可行的。其中线性回归的代价函数为：
$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}\frac{1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^{2}$
以下是吴恩达给出的思路：
重新定义逻辑回归的代价函数为：
$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})$ 其中
$Cost(h_{\theta}(x),y)=\begin{cases} -\log(h_\theta(x)) \quad\quad\quad if\quad y=1\\ -\log(1-h_\theta(x)) \quad if\quad y=0\\ \end{cases}$
对应的 $Cost(h_{\theta}(x),y)$ 的变化趋势如下所示：

当实际 $y=1$ ,且 $h_\theta(x)=1$ 时，Cost函数为0，即误差为0；
当实际 $y=1$ ，而 $h_\theta(x)\neq1$ 时，误差随着 $h_\theta(x)$ 变小而变大。

当实际 $y=0$ ,且 $h_\theta(x)=0$ 时，Cost函数为0，即误差为0；
当实际 $y=0$ ，而 $h_\theta(x)\neq0$ 时，误差随着 $h_\theta(x)$ 变大而变大。
将 $Cost(h_{\theta}(x),y)$ 合并成一个式子如下：
$Cost(h_\theta(x),y)=-y*log(h_\theta(x))-(1-y)*\log(1-h_\theta(x))$
带入代价函数 $J(\theta)$ 得到：
$J(\theta)=\frac{1}{m}\sum_{i=1}^{m}[-y^{(i)}\log(h_\theta(x^{(i)}))-(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$
即： $J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$
虽然这个公式从这个角度来解释很合理，但由于没有详细的推导过程总是感觉十分突兀。
我认为更合理的推导思路：从概率的角度分析问题
预测函数：
$h_\theta(x)=g(\theta^{T}x)=\frac{1}{1+e^{-\theta^{T}x}}$
对于分类任务：
$\begin{cases} p(y=1|x;\theta)=h_\theta(x) \\ p(y=0|x;\theta)=1-h_\theta(x) \\ \end{cases}$
整合可得：
$p(y|x;\theta)=(h_\theta(x))^{y}(1-h_\theta(x))^{1-y}$
似然函数：求在未知参数 $\theta$ 的情况下，整体样本出现的概率最大。
$L(\theta)=\prod_{i=1}^{m}p(y_i|x_i;\theta)=\prod_{i=1}^{m}(h_\theta(x_i))^{y_i}(1-h_\theta(x_i))^{1-y_i}$
对数似然：
$l(\theta)=\log L(\theta)=\sum_{i=1}^{m}(y_i\log h_\theta(x_i)+(1-y_i)\log(1-h_\theta(x_i)))$
损失函数是一个用来衡量数据拟合程度的函数，通过求损失函数的最小值，来确定参数。对于对数似然， $l(\theta)$ 越大，拟合的就越好，于是问题变成了求 $l(\theta)$ 最大值的问题，为了，采用梯度下降的算法，我们引入
$J(\theta)=-\frac{1}{m}l(\theta)$
将梯度上升转换为梯度下降。故损失函数 $J(\theta)$ 为：
$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}(y_i\log h_\theta(x_i)+(1-y_i)\log(1-h_\theta(x_i)))$
通过对比发现，这两种解释得到的最终的结果都是一样的。
梯度下降算法理论推导
根据线性回归的相关知识，我们知道梯度下降算法为：
$Repeat\{ \theta_j:=\theta_j-\alpha \frac{\partial}{\partial \theta_j}J(\theta)\}$
$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}(y_i\log h_\theta(x_i)+(1-y_i)\log(1-h_\theta(x_i)))$
所以对 $j(\theta)$ 求偏导如下：
$\begin{aligned} \frac{\partial}{\partial \theta_j}J(\theta) &=-\frac{1}{m}\sum_{i=1}^{m} \left(y_i\frac{1}{h_\theta(x_i)}\frac{\partial}{\partial \theta_j}h_\theta(x_i)-(1-y_i)\frac{1}{1-h_\theta(x_i)}\frac{\partial}{\partial \theta_j}h_\theta(x_i)\right)\\ &=-\frac{1}{m}\sum_{i=1}^{m}\left(y_i\frac{1}{g(\theta^{T}x_i)}-(1-y_i)\frac{1}{1-g(\theta^{T}x_i)}\right)\frac{\partial}{\partial \theta_j}g(\theta^{T}x_i)\\ & =-\frac{1}{m}\sum_{i=1}^{m}\left(y_i\frac{1}{g(\theta^{T}x_i)}-(1-y_i)\frac{1}{1-g(\theta^{T}x_i)} \right)g(\theta^{T}x_i)(1-g(\theta^{T}x_i))\frac{\partial}{\partial \theta_j}\theta^{T}x_i\\ &=-\frac{1}{m}\sum_{i=1}^{m}\left(y_i(1-g(\theta^{T}x_i))-(1-y_i)g(\theta^{T}x_i)\right)x_i^{j}\\ &=-\frac{1}{m}\sum_{i=1}^{m}\left(y_i-g(\theta^{T}x_i)\right)x_i^{j}\\ &=-\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_i^{j} \end {aligned}$
故梯度下降算法的最终迭代过程为：
$\color{red}{Repeat}\{ \theta_j:=\theta_j-\alpha \frac{1}{m }\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_i^{j}\}$

代码实现：

#数据采用：链接: https://pan.baidu.com/s/1gtXlB9E2E5XbnwswG2banQ 提取码: 5zju
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import numpy.random 
import time
#定义了三种停止梯度下降循环的方式
STOP_ITER=0
STOP_COST=1
STOP_GRAD=2


def readData(path):
    #读取数据
    data =pd.read_csv(path,header=None,names=['Exam1','Exam2','Admitted'])
    #加工数据：在每一个样本的开始添加一个1字段
    data.insert(0,'One',1)
    return data

#sigmoid函数，将数值转换为0~1之间的概率
def sigmoid(z):
    return 1/(1+np.exp(-z))

#定义预测函数hθ(x),也即我们的预测函数
def model(X,theta):
    return sigmoid(np.dot(X,theta.T))
#损失函数的定义
def cost(X,Y,theta):
    left =np.multiply(-Y,np.log(model(X,theta)))
    right = np.multiply(1 - Y, np.log(1 - model(X, theta)))
    return np.sum(left-right)/len(X)
#计算梯度，相当于一次梯度下降循环 
def gradient(X, Y, theta):
    grad = np.zeros(theta.shape)
    error = (model(X, theta)- Y).ravel()
    for j in range(len(theta.ravel())): #for each parmeter
        term = np.multiply(error, X[:,j])
        grad[0, j] = np.sum(term) / len(X)
    
    return grad

#对数据进行洗牌，避免由于人工搜集数据导致的数据分布不随机，增加数据的一般性
def shuffleData(data):
    np.random.shuffle(data)
    cols = orig_data.shape[1]
    X = data[:,0:cols-1]
    Y = data[:,cols-1:]
    return X,Y
#设定三种不同的停止策略
def stopCriterion(type,value,threshold):
    if type ==STOP_ITER: return value>threshold
    elif type ==STOP_COST: return abs(value[-1]-value[-2])<threshold
    elif type ==STOP_GRAD: return np.linalg.norm(value)<threshold

#梯度下降函数
def descent(data,theta,batchSize,stopType,thresh,alpha):
    #梯度下降求解
    #初始化时间
    init_time =time.time()
    i =0#迭代次数
    k =0 #batch
    X,Y=shuffleData(data)
    grad =np.zeros(theta.shape)
    costs =[cost(X,Y,theta)]#损失值
    
    
    while True:
        grad =gradient(X[k:k+batchSize],Y[k:k+batchSize],theta)
        k+=batchSize #取batch数量个样本数据
        if k >= n:
            k=0
            X,Y =shuffleData(data)#重新洗牌
        theta =theta -alpha*grad #参数更新
        costs.append(cost(X,Y,theta))
        i+=1
        
        if stopType ==STOP_ITER :value=i
        elif stopType ==STOP_COST :value =costs
        elif stopType ==STOP_GRAD :value =grad
        if stopCriterion(stopType,value,thresh):break
    return theta,i-1,costs,grad,time.time()-init_time

#运行函数
def runExpe(data, theta, batchSize, stopType, thresh, alpha):
    #import pdb; pdb.set_trace();
    theta, iter, costs, grad, dur = descent(data, theta, batchSize, stopType, thresh, alpha)
    name = "Original" if (data[:,1]>2).sum() > 1 else "Scaled"
    name += " data - learning rate: {} - ".format(alpha)
    if batchSize==n: strDescType = "Gradient"
    elif batchSize==1:  strDescType = "Stochastic"
    else: strDescType = "Mini-batch ({})".format(batchSize)
    name += strDescType + " descent - Stop: "
    if stopType == STOP_ITER: strStop = "{} iterations".format(thresh)
    elif stopType == STOP_COST: strStop = "costs change < {}".format(thresh)
    else: strStop = "gradient norm < {}".format(thresh)
    name += strStop
    print ("***{}\nTheta: {} - Iter: {} - Last cost: {:03.2f} - Duration: {:03.2f}s".format(
        name, theta, iter, costs[-1], dur))
    fig, ax = plt.subplots(figsize=(12,4))
    ax.plot(np.arange(len(costs)), costs, 'r')
    ax.set_xlabel('Iterations')
    ax.set_ylabel('Cost')
    ax.set_title(name.upper() + ' - Error vs. Iteration')
    return theta
    
#设定阈值
def predict(X, theta):
    return [1 if x >= 0.5 else 0 for x in model(X, theta)]

if __name__ =='__main__':
    n=100
    data =readData('LogiReg_data.txt')
    orig_data = data.as_matrix() 
    theta = np.zeros([1, 3])
    runExpe(orig_data, theta, n, STOP_ITER, thresh=5000, alpha=0.000001)
	#runExpe(orig_data, theta, n, STOP_ITER, thresh=5000, alpha=0.000001)
    #runExpe(orig_data, theta, n, STOP_COST, thresh=0.000001, alpha=0.001)
    #runExpe(orig_data, theta, n, STOP_GRAD, thresh=0.05, alpha=0.001)
    
    #runExpe(orig_data, theta, 1, STOP_ITER, thresh=5000, alpha=0.001)
    #runExpe(orig_data, theta, 1, STOP_ITER, thresh=15000, alpha=0.000002)
    #runExpe(orig_data, theta, 16, STOP_ITER, thresh=15000, alpha=0.001)
    
    #scaled_X = scaled_data[:, :3]
    #y = scaled_data[:, 3]
    #predictions = predict(scaled_X, theta)
    #correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
    #accuracy = (sum(map(int, correct)) % len(correct))
    #print ('accuracy = {0}%'.format(accuracy))

结果显示如下图（多个图，只展示一个）：

以下是吴恩达第一次作业的其他高级解法，代码如下：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Oct 16 16:48:53 2018

@author: lxh
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
from sklearn.metrics import classification_report

def readData(path,renames):
    data = pd.read_csv(path,names=renames)
    return data


def sigmoid(z):
    return 1/(1+np.exp(-z))


def cost(theta,X,Y):
    left =(-Y)*np.log(sigmoid(X@theta))
    right =(1-Y)*np.log(1-sigmoid(X@theta))
    return np.mean(left-right)

def gradient(theta,X,Y):
    return (X.T@(sigmoid(X@theta)-Y))/len(X)


def predict(theta,X):
    probability =sigmoid(X@theta)
    return [1 if x>=0.5 else 0 for x in probability]


if __name__ =='__main__':
    data = readData('ex2data1.txt',['exam1','exam2','admitted'])
    print(data.head())
    print(data.describe())
    positive = data[data.admitted.isin(['1'])]
    negetive = data[data.admitted.isin(['0'])]
    fig,ax=plt.subplots(figsize=(6,5))
    ax.scatter(positive['exam1'],positive['exam2'],c='b',s=20,label='Admitted')
    ax.scatter(negetive['exam1'],negetive['exam2'],c='r',s=20,label='Not admitted')
    #设置图例显示在图的上方
    box =ax.get_position()
    ax.set_position([box.x0,box.y0,box.width,box.height*0.8])
    ax.legend(loc='center left',bbox_to_anchor=(0.2,1.12),ncol=3)
    #设置横纵坐标名
    ax.set_xlabel('Exam 1 score')
    ax.set_ylabel('Exam 2 score')
    plt.show()
    
    x1=np.arange(-10,10,0.1)
    plt.plot(x1,sigmoid(x1),c='g')
    plt.show()
    
    if 'Ones' not in data.columns:
        data.insert(0,'Ones',1)
        
    X =data.iloc[:,:-1].as_matrix()
    Y =data.iloc[:,-1].as_matrix()
    
    print(X)
    print(Y)
    
    theta =np.zeros(X.shape[1])
    print(cost(theta,X,Y))
    
    print(gradient(theta,X,Y))
    
    #调用opt.fmin_tnc函数实现梯度下降
    result =opt.fmin_tnc(func=cost,x0=theta,fprime=gradient,args=(X,Y))
    print(result)
    #调用opt.minimize函数实现梯度下降
    res =opt.minimize(fun=cost,x0=theta,args=(X,Y),method='TNC',jac=gradient)
    
    print(res)
    
    
    print(cost(result[0],X,Y))
    
    final_theta =result[0]
    predictions =predict(final_theta,X)
    correct = [1 if a==b else 0 for(a,b) in zip(predictions,Y)]
    accuracy =sum(correct)/len(X)
    
    print(accuracy)
    
    print(classification_report(predictions,Y))
    
    
    x1 = np.arange(130,step=0.1)
    x2 = -(final_theta[0]+x1*final_theta[1])/final_theta[2]
    
    
    fig,ax=plt.subplots(figsize=(6,5))
    ax.scatter(positive['exam1'],positive['exam2'],c='b',s=20,label='Admitted')
    ax.scatter(negetive['exam1'],negetive['exam2'],c='r',s=20,label='Not admitted')
    ax.plot(x1,x2)
    ax.set_xlim(0,130)
    ax.set_ylim(0,130)
    ax.set_xlabel('x1')
    ax.set_ylabel('x2')
    ax.set_title('Decision Boundary')
    plt.show()

决策边界展示图如下：

热搜

最近搜索清空

【机器学习】Logistic回归——理论推导及算法实现（Python版）

【Logistic回归】

阅读免费教程