Regression

课程笔记：线性回归和逻辑斯蒂回归

1. 线性回归

The cost function of linear regression is always a convex function, which means gradient descent can always find the global optimum wherever it starts. Because convex function has no local optima.

Two solution to minimize J(theta) – Gradient descent and normal equations.
Batch gradient descent uses all training data at each iteration. BGD. There are some other methods like SGD.

1.1 Feature Scaling

不同Feature取值范围的差异会导致参数的Loss function呈现细长的碗装，这回导致梯度下降时的来回震荡，进而导致收敛速度减慢。可以使用Feature Scaling对Feature进行映射，使其取值相似。一般使用如下缩放公式：

$$
x_i:=\frac{x_i-\mu_i}{s_i}
$$

$x_0$不需要缩放因为它总是1。如果所有特征值的最大值最小值的绝对值落在[1/3,3]这个范围内，那么就是个比较好的特征值，不需要缩放。当然这个范围并不是个绝对的值。

1.2 Learning rate

学习率会影响线性回归的收敛速度。理论可以证明，如果学习率设置合适，线性回归的代价函数必然随着迭代次数递减（注意，复杂的网络中这个结论往往不成立，复杂网络盲目调小学习率是不可取的，要取一个合适的值）。如果不是单调的，很大概率是学习率设置过大。一般3倍左右取一个学习率的值，如0.001，0.003，0.01……不断尝试，观察代价函数的曲线，找一个合适的学习率。

1.3 Feature Design & Polynomial Regression

可以自己设计特征，不一定需要直接使用原始数据。例如有长和宽，可以设计一个面积，做单变量线性回归而不是双变量。同理，也可以设计多项式线性回归，如 $ y=\theta_0+\theta_1 * X_1+\theta_2 * X_2 $，其中$ X_1=x, X_2=x^2 $。这样可以用二次函数拟合。更好的特征设计可以更好地拟合样本，这会很大程度上改变收敛速度和最终结果。当然，使用特征设计时必须注意特征缩放。

1.4 Normal Equation

可以用线性代数的方法快速找到最优的参数。其原理是对矩阵函数$Y=X\cdot\theta$求导，并将导数设为0后解方程组。在技术上可以直接使用公式

$$
\theta=(X^T \cdot X)^{-1} \cdot X^T \cdot Y
$$

Linear Regression	Normal Equation
多次迭代	一步到位
需要选择学习率	不需要学习率
对任意参数数目均有效	参数数目比较大（10,000以上）时，矩阵的逆求解会很慢($\theta (N^3)$)
对任意问题均有效	在分类问题，逻辑斯蒂回归等问题中不起效

事实上正规方程中可能会遇到$X^T \cdot X$是奇异矩阵的情况，这种情况下可以用伪逆来代替逆，也可以正常运算。但是需要尽量避免这个情况。奇异矩阵产生的原因可能有如下两个方面：

不同参数之间存在线性关系。需要消除多余的参数。
样本数小于参数数目。需要合并参数，删除参数，或者使用regularization。

2. 逻辑斯蒂回归

逻辑斯蒂回归在寻找decision boundary. 假如我们对一个问题有一个hypotheses $h_\theta(x)=g(\theta_0+\theta_1 * x_1+\theta_2 * x_2+\theta_3 * x^2_1+\theta4 * x^2_2)$ where $\theta=(-1,0,0,1,1)$ $g(z)=\frac{1}{1+e^{-z}}$(g(z) 是sigmoid function，也被叫做logistic function，逻辑斯蒂回归由此得名), 边界$x_1^2+x_2^2>=1$是一个圆而非直线.

一个隐藏层的MLP网络似乎就是在利用一层的N个Hidden neuron来编码$2^N$个One-hot vector

2.1 Cost Function

The cost function of logistic regression is:

$$
Cost(h_\theta(x^{(i)},y^{(i)}))=\begin{cases}
-log(h_\theta(x^{(i)})) & if\quad y^{(i)} == 1 \
-log(1-h_\theta(x^{(i)})) & if\quad y^{(i)} == 0
\end{cases}
$$

The choice of cost function is different from linear regression because if we apply the cost function of linear regression we will get a non-convex optimization problem and the cost function will have multiple local minimal. The root of the difference is that we use sigmoid function in cost function and sigmoid is not a linear function. It is provable that the new cost function in logistic regression is a convex function.

我们可以用计算理论中原始递归的思路来简化上述函数:

$$
Cost(h_\theta(x^{(i)},y^{(i)}))=-y^{(i)}log(h_\theta(x^{(i)}))-(1-y^{(i)})log(1-h_\theta(x^{(i)}))
$$

这个式子在Pytorch中是NLLLoss，配合LogSoftmax就是CrossEntropyLoss。这个式子是从最大似然模型中得出的？

利用微积分知识可以得

$$
\frac{\partial J(\theta)}{\partial \theta_j}=
\frac{1}{m}\sum_i^mx_j^{(i)}(h_\theta(x^{(i)})-y^{(i)})
$$

where

$$
J(\theta)=\frac{1}{m}\sum_i^mCost^{(i)}(\theta)
$$

Cost function的偏导数形式和线性回归高度类似，但是注意$h_\theta(x)$在这两个地方是完全不同的假设函数

2.2 One-versus-all

对于多分类问题，可以将问题编码为one-hot coding，用逻辑斯蒂回归分别对one-hot code的每一位训练分类器。再预测的时候，只需要选择所有预测器结果最大的那个作为预测结果。

2.3 分类器

除了感知机模型，其实还有三方法。感知机是在寻找分类超平面

2.3.1 最小二乘法

就是Norm equation

2.3.2 最大似然估计

Cost是均方差。就是线性回归？分类依据是大于0和小于0.好像不完全是线性回归，有Basis function $\phi$ (哦这TM不就是线性回归中的特征设计吗？)

但是训练方法不一样

假设输出为$t=y(x, w)+\epsilon$，其中$\epsilon$为零均值高斯分布随机量，precision(方差倒数)为$\beta$。因此有（是假设了输出符合高斯分布？这TM是怎么一生二的？）

$$
p(t|x,w,\beta)=N(t|y(x,w),\beta^{-1})
$$

我们的优化目标是，最大似然，让期望最大

$$
E[t|x]=\int tp(t|x)dt=y(x,w)
$$

也就是说，再训练数据集X上，我们要下面公式的值最大

$$
p(t|X,w,\beta)=\prod_{n=1}^NN(t_n|w^T\phi(x_n),\beta^{-1})
$$

取对数，用最优化的思想求最值，nmd求出来w的最优值就是Norm equation，那个$\beta$也是很容易推导出来的

$$
lnp(t|X,w,\beta)=\sum_{n=1}^NlnN(t_n|w^T\phi(x_n),\beta^{-1})=\frac{N}{2}ln\beta-\frac{N}{2}ln(2\pi)-\beta E_D(w)
$$

$$
E_D(w)=\frac{1}{2}||t-\Phi w||^2
$$

上面说的是最大似然估计，最小二乘法就是平方和回归。有结论，如果error符合均值为0的高斯分布，那么最大似然估计和最小二乘法结果是一样的。

2.3.3 Fisher’s linear discriminant

让投影后同类数据距离最小，其实也就是找分类平面，因为投影方向就是超平面方向。Cost函数如下，要最大（类间均值差距最大，类内方差最小）

$$
J(w)=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}
$$

类内协方差和类间协方差

用拉格朗日乘子法求条件下的最优化

3. 实验代码

linear.pyview raw

import matplotlib.pyplot as plt
import numpy as np
from datareader import DataReader


class Solution:
    def __init__(self, path, nvar, iteration=1500, lr=0.01):
        """

        :param path:
        :param nvar: 变量的数目
        :param iteration:
        :param lr:
        """
        data = DataReader.read(path, nvar + 1)  # nvar + y
        self.y = np.array([data[-1]]).transpose()
        self.x = np.array([np.ones((len(self.y),))] \
                          + [np.array(data[i]) for i in range(nvar)]).transpose()
        self.theta = np.zeros((nvar + 1, 1))
        self.iteration = iteration
        self.lr = lr
        self.nvar = nvar

        self.mu = self.x.mean(0)
        self.s = self.x.max(0) - self.x.min(0)
        self.mu[0] = 0
        self.s[0] = 1  # for x_0: (1 - 0) / 1 = 1
        self.feature_normed = False

    def feature_norm(self):
        self.feature_normed = True
        for i in range(len(self.x)):
            self.x[i] = (self.x[i] - self.mu) / self.s

    def linear_regression(self):
        for i in range(self.iteration):
            yield self.cal_j_theta()
            self.__step()
        yield self.cal_j_theta()

    def cal_j_theta(self):
        deltay = self.predict(self.x) - self.y
        return (deltay * deltay).sum() / 2 / len(self.y)

    def __step(self):
        self.theta = self.theta - self.x.transpose().dot(self.predict(self.x) - self.y) * self.lr / len(self.y)

    def norm_equation(self):
        xT = self.x.transpose()
        self.theta = np.linalg.pinv(xT.dot(self.x)).dot(xT).dot(self.y)

    def predict(self, x):
        return x.dot(self.theta)


if __name__ == "__main__":
    s = Solution("ex1data2.txt", 2, lr=0.01)
    s.feature_norm()
    j_thetas = np.array([i for i in s.linear_regression()])
    print("J(theta) of linear regression = %f" % (j_thetas[-1]))
    x = np.linspace(0, len(j_thetas), len(j_thetas))
    plt.scatter(x, j_thetas)
    plt.show()
    # s.norm_equation()
    # print("J(theta) of normal equation = %f"%(s.cal_j_theta()))