线性回归

单变量回归

训练集

$(x, y)$，$x$自变量，$y$因变量

规定$m$是训练集大小，$(x^{(i)}, y^{(i)})$是第i个训练样本

hypothesis函数

$h_\theta(x)=\theta_0+\theta_1x$，即拟合的曲线

cost function

寻找参数$\theta_0 ,\theta_1$，使$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}))^2$最小，这样的$\theta_0,\theta_1$就是我们选定hypothesis函数的$\theta_0,\theta_1$

多变量回归

概念更新

更新以上概念为：

$n$是特征数目

$x^{(i)}_j$是第i个训练样本的第j个特征

为了方便我们定义$x^{(i)}_0=1$

矩阵形式表示

$x^{(i)}=\left( \begin{matrix} x^{(i)}_0\\ x^{(i)}_1\\ \vdots\\ x^{(i)}_n \end{matrix} \right)_{(n+1)\times 1}$

参数向量$\theta=\left( \begin{matrix} \theta_0\\ \theta_1\\ \vdots\\ \theta_n \end{matrix} \right)_{(n+1)\times 1}$

$X=\left( \begin{matrix} (x^{(1)})^{T}\\ (x^{(2)})^{T}\\ \vdots\\ (x^{(m)})^{T} \end{matrix} \right)_{m\times (n+1)}$

hypothesis函数

$h_\theta(x)=\theta^Tx$，其中$\theta,x$是列向量，定义如上

$h_\theta(X)=X\theta$，且$h_\theta(X)\in \mathbb{R}^{n+1}$

cost function

$\begin{align} J(\theta)&=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}))^2 \\ &= \frac{1}{2m}(h_\theta(X)-y)^T(h_\theta(X)-y)\\ &=\frac{1}{2m}(X\theta-y)^T(X\theta-y)\end{align}$

多项式回归

$n$次多项式，将$x^1,x^2,\cdots,x^n$看作n个特征即可。

正则项

为了防止过拟合，我们可以将$J(\theta)$加入正则项，修改为$J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}))^2+\lambda\sum_{i=1}^{n}\theta_i^2\right]$。

Minimize $J(\theta)$

梯度下降法

选择一个学习率$\alpha$，迭代次数$epcho$。每次沿梯度下降最快的方向，即导数方向下降，具体来讲：每次迭代可以表示为$\theta_j=\theta_j-\alpha\frac{\partial J(\theta)}{\theta_j}$。

不难得出：

$\begin{align} & repeat\ epcho\ times\ \{ \\ & \ \ \ \ \ \ \theta_j = \theta_j-\frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \\ & \} for\ j = 0\ \cdots \ n, simultaneous \ update \end{align}$

用向量的形式表示：

$\theta = \theta-\frac{\alpha}{m}X^T(X\theta-y)$

如果有正则项的话，导数项会多一项，具体变为：

$\begin{align} & repeat\ epcho\ times\ \{ \\ & \ \ \ \ \ \ \theta_0 = \theta_0-\frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)} \\ & \ \ \ \ \ \ \theta_j = \theta_j(1-\frac{\alpha\lambda}{m})-\frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \\ & \} for\ j = 1\ \cdots \ n, simultaneous \ update \end{align}$

用向量的形式表示：

$\begin{equation}
tmp
=\left(\begin{matrix}
1 & 0 & \cdots\ &0\\
0 & 1-\frac{\alpha\lambda}{m} & \cdots\ & 0\\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots\ & 1-\frac{\alpha\lambda}{m}\\
\end{matrix}\right)
\end{equation}_{(n+1)\times(n+1)}$

$\theta = tmp\times\theta-\frac{\alpha}{m}X^T(X\theta-y)$

正规方程法

实质是直接求最小二乘法中使得方程$\frac{\partial J(\theta)}{\theta_j}=0$的解。

$\theta=(X^TX)^{-1}X^Ty$

尽管$X^TX$可能不可逆，用matlab中的pinv或者python中的np.linalg.pinv可以求伪逆，依然可以得出正解。

如果有正则项，那么

$\begin{equation}
tmp
=\left(\begin{matrix}
0 & 0 & \cdots\ &0\\
0 & 1 & \cdots\ & 0\\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots\ & 1\\
\end{matrix}\right)
\end{equation}_{(n+1)\times(n+1)}$

$\theta=(X^TX+\lambda tmp)^{-1}X^Ty$

事实上$X^TX+\lambda tmp$一定可逆。

共轭梯度法

共轭梯度法可以解决这一类问题：求$Ax=b$的解，其中$A$正定。

设$f(x)=\frac{1}{2} x^{T} A x-b^{T} x$，$f’(x)=Ax-b$。所以最小化$f(x)$的问题等价于求$Ax=b$的解的问题。

接着我们考虑cost function：

$\begin{align}J(\theta)&=\frac{1}{2m}(X\theta-y)^T(X\theta-y)\\ &=\frac{1}{2m} (\theta^TX^T-y^T)(X\theta-y)\\ &= \frac{1}{2m}(\theta^TX^TX\theta-2y^TX\theta+y^Ty)\end{align}$

那么设$A=X^{T} X$，$b=X^Ty$，那么最小化$J(\theta)$的$\theta$就是$A\theta=b$的解。（注：$A$一定正定）

至于真正的共轭梯度法，Click Here and Click Here

主要的思想是，梯度下降法每次沿着一个方向下降，并不会朝这个方向走到最低点走到底，这导致梯度下降法收敛很慢。共轭梯度法找出n个共轭正交向量，每次眼这n个方向走到底，使得剩余残差和之前的所有方向向量正交。理论上n次迭代之后就会找到最优解。

但是事实上并不会真的找到最优解，接着迭代cost function还在下降，待填坑究其原因。

伪代码：

$A=X^{T} X$, $b=X^Ty$

$\theta_0 = $ any

$r0 := b−A\theta_0$ ($r_i$为第i次迭代的误差)

$d0:=r0$ ($d_i$是我们要求的共轭向量，在python实现时别忘copy)

$k:=0 $ (k表示第几次迭代)

repeat

$\alpha_{k} :=\frac{r_{k}^{T} r_{k}}{d_{k}^{T} A d_{k}}$ (该项为学习率，是求出来的)

$x_{k+1} :=x_{k}+\alpha_{k} d_{k}$

$r_{k+1} :=r_{k}-\alpha_{k} A d_{k}$

如果$r_{k+1}$足够小，则提前退出循环（也就是认为已经找到最优解了）

$\beta_{k} :=\frac{r_{k+1}^{T} r_{k+1}}{r_{k}^{T} r_{k}}$
$d_{k+1} :=r_{k+1}+\beta_{k} d_{k}$ （施密特正交法求$d_k$）

$k:=k+1$

end repeat

The result is $x_{k+1}$

python：

def conjugate_gradient(x_train, y_train, epochs=10000, eps=1e-6):
    feature_number, data_number = x_train.shape
    theta = (np.zeros(data_number)).reshape(data_number, 1)
    a = np.dot(x_train.T, x_train)
    b = np.dot(x_train.T, y_train)
    r0 = b - np.dot(a, theta)
    d0 = np.copy(r0)
    for i in range(epochs):
        alpha = np.sum(np.dot(r0.T, r0) / np.dot(np.dot(d0.T, a), d0))
        theta = theta + alpha * d0
        r1 = r0 - alpha * np.dot(a, d0)
        beta = np.sum(np.dot(r1.T, r1) / np.dot(r0.T, r0))
        if np.dot(r1.T, r1) < eps:
            return theta, i + 1
        d1 = r1 + beta * d0
        d0 = np.copy(d1)
        r0 = np.copy(r1)
        # print(cost_function(theta, x_train, y_train))
    return theta, epochs