[机器学习]线性回归推导

fengwm

2023-11-06

机器学习

1.原理

1.1 闭式求解

模型：$h_\theta(x)=\theta^TX$
损失函数：$J(\theta)=\left|X\theta-Y\right|_2^2$
目标：$\theta=\arg\min J(\theta)$
说明：

$$ \begin{cases}\theta\in\mathbb{R}^{d\times1}\\[2ex]X\in\mathbb{R}^{m\times d}\\[2ex]Y\in\mathbb{R}^{m\times1}\end{cases} $$

正规方程形式求解，即为直接求 $J(\theta)$ 的最小值：

先展开 $J(\theta)$ ：

$$\begin{align*} J(\theta) &= \|X\theta - Y\|_{2}^{2} \\ &= (X\theta - Y)^{T}(X\theta - Y) \\ &= (X^{T}\theta^{T} - Y^{T})(X\theta - Y) \\ &= X^{T}\theta^{T}X\theta - Y^{T}X\theta - Y^{T}X\theta + Y^{T}Y \\ &= X^{T}\theta^{T}X\theta - 2Y^{T}X\theta + Y^{T}Y \end{align*} $$

对 $J(\theta)$ 进行求导：

$$\begin{aligned} \frac{\partial J(\theta)}{\partial\theta}& =\frac{\partial X^T\theta^TX\theta-2Y^TX\theta+Y^TY}{\partial\theta} \\ &=2X^{T}X\theta-2Y^{T}X \end{aligned}$$

令 $J(\theta)=0$ 得：

$$\begin{aligned} 2X^{T}X\theta-2Y^{T}X& =0 \\ X^{T}X\theta & =Y^{T}X \\ \theta & =(X^TX)^{-1}Y^TX \end{aligned}$$

上述结果即为求解结果，需要说明的是：特征矩阵 $X$ 不满秩（即存在特征间的线性相关性），则正规方程求解过程中的矩阵求逆操作可能会导致数值不稳定性。

1.2 梯度下降求解

模型：$h_\theta(x)=\sum_{i=1}^d\theta_ix_i$

注：$x_i$表示$x$的第$i$维
损失函数：$J(\theta)=\frac1{2m}\sum_{j=0}^m\left(y^j-h_\theta(x^j)\right)^2$

注：$x^j$表示第$j$个样本
目标：$\theta=\arg\min J(\theta)$
说明：

$$ \begin{cases}\theta\in\mathbb{R}^d\\[2ex]x\in\mathbb{R}^d\\[2ex]y\in\mathbb{R}^m\end{cases} $$

损失函数 $J(\theta)$ 是一个关于参数 $\theta$ 的二次型，对 $J(\theta)$ 进行展开：

$$\begin{aligned} J(\theta)& =\frac{1}{2m}\sum_{j=0}^{m}\Big(y^{j}-h_{\theta}(x^{j})\Big) \\ &=\frac{1}{2m}\sum_{j=0}^{m}\Bigg(y^{j}-\sum_{i=1}^{d}\theta_{i}x_{i}^{j}\Bigg)^{2} \end{aligned}$$

对 $J(\theta)$ 进行偏微分求导运算得到：

$$\begin{aligned} \partial\frac{J(\theta)}{\partial\theta_i}& =\frac{\partial}{\partial\theta_{i}}\frac{1}{2m}\sum_{j=0}^{m}\Bigg(y^{j}-\sum_{i=1}^{d}\theta_{i}x_{i}^{j}\Bigg)^{2} \\ &=\frac{1}{m}\sum_{j=0}^{m}\Bigg( y^{j}-\sum_{i=1}^{d}\theta_{i}x_{i}^{j}\Bigg)(-x_{i}^{j}) \\ &=\frac{1}{m}\sum_{j=0}^{m}\Bigg(\sum_{i=1}^{d}\theta_{i}x_{i}^{j}-y^{j}\Bigg)x_{i}^{j} \end{aligned}$$

每次根据梯度更新参数：

$$\begin{aligned} \theta_{i}& =\theta_i-\alpha\partial\frac{J(\theta)}{\partial\theta_i} \\ &=\theta_i-\alpha(\frac1m\sum_{j=0}^m\biggl(\sum_{i=1}^d\theta_ix_i^j-y^j\biggr)x_i^j) \\ &=\theta_i+\alpha \frac{1}{m}\sum_{j=0}^m\Bigg( y^j-\sum_{i=1}^d\theta_ix_i^j\Bigg)x_i^j \end{aligned}$$

梯度下降法步骤：

$\text{Repeat until convergence } \{$ $$\theta_i:=\theta_i+\alpha\:\frac{1}{m}\sum_{j=0}^m\Bigg(y^j-\sum_{i=1}^d\theta_ix_i^j\Bigg)x_i^j$$ $\}$

2.Python实现

2.0 导包

# 导入所需的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import time

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

2.1 读取数据集

# 读取数据
df = pd.read_csv("./housing.csv")
# 预览数据
print(df.head())

print(df.info())

2.2 数据预处理

# 1）处理缺失值
# 取出有缺失值的列
# reshape是为了适应sklearn要求
total_bedrooms = df.loc[:, "total_bedrooms"].values.reshape(-1, 1)  

# 复制一份不破坏原数据
filled_df = df.copy()  

# 中位数填补
filled_df.loc[:, "total_bedrooms"] = SimpleImputer(strategy="median").fit_transform(total_bedrooms)  

# 看一下效果
filled_df.info()


# 2）编码
# 编码
code = OneHotEncoder().fit_transform(filled_df.loc[:, "ocean_proximity"].values.reshape(-1, 1))

# 合并
coded_df = pd.concat([filled_df, pd.DataFrame(code.toarray())], axis=1)

# 删除原列
coded_df.drop(["ocean_proximity"], axis=1, inplace=True)

# 改下表头
coded_df.columns = list(coded_df.columns[:-5]) + ["ocean_0", "ocean_1", "ocean_2", "ocean_3", "ocean_4"]
# coded_df.columns = coded_df.columns.astype(str)

# 看看效果
coded_df.head(10)

2.3 划分数据集

feature = coded_df.iloc[:, :8].join(coded_df.iloc[:, -5:])
label = coded_df["median_house_value"]

Xtrain,Xtest,Ytrain,Ytest = train_test_split(feature,label,test_size=0.3)

Xtrain.head()

2.4 求解模型

2.4.0 评价指标R^2

1
2
3

# 计算R^2
def R2(y, y_pred):
    return 1 - (np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2))

2.4.1 数据标准化

# 数据标准化
def normalize(X):
    sigma = np.std(X, axis=0)
    mu = np.mean(X, axis=0)
    X = (X - mu) / sigma
    return np.array(X)

X = np.array(Xtrain).reshape(np.size(Xtrain, 0), -1)
y = np.array(Ytrain).T.reshape(-1, 1)

# 标准化（闭式求解其实不需要，但梯度下降需要，为了对比统一都采用归一化）
X = normalize(X)
y = normalize(y)

2.4.2 闭合形式求解

# 1）线性回归模型的闭合形式参数求解
# 正规方程求解
def Normal_Equation(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

start_time = time.time()
theta_ne = Normal_Equation(X, y)

print(f"花费时间：{time.time() - start_time}")v
print(f"R^2：{R2(y, X @ theta_ne)}")

# 创建 DataFrame
result_cf = pd.DataFrame({"ColumnName": list(Xtrain.columns), "Theta": theta_ne.flatten()})
result_cf

2.4.3 梯度下降求解

# 2）线性回归梯度下降参数求解
# 损失函数
def MSE_Loss(y, y_pred):
    return np.sum((y_pred - y) ** 2) / (2 * np.size(y))

# 梯度下降
def GD(X, y, lr=0.01, epochs=5000):
    m, n = X.shape

    # 初始化参数为标准正态分布
    theta = np.random.randn(n, 1)
    # 记录每代损失
    loss = np.zeros(epochs)

    for epoch in range(epochs):
        # 计算梯度
        gradient = (1 / m) * (X.T @ (X @ theta - y))
        # 更新参数
        theta -= lr * gradient
        # 记录损失
        loss[epoch] = MSE_Loss(y, X @ theta)

    return theta, loss

start_time = time.time()
[theta_gd, loss] = GD(X, y)

print(f"花费时间：{time.time() - start_time}")
print(f"R^2：{R2(y, X @ theta_gd)}")

# 创建 DataFrame
result_gd = pd.DataFrame({"ColumnName": list(Xtrain.columns), "Theta": theta_gd.flatten()})
result_gd

# 绘制损失函数梯度下降曲线
sns.lineplot(x=np.arange(5000), y=loss.flatten(), label='Loss Curve')

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Gradient Descent Loss Curve')