Softmax Regression

softmax回归是logistic regression的推广,本文从假设、损失函数、优化、正则到代码实现,简述softmax回归的核心

Posted by Yewbs on March 12, 2019

Title: Ramon Casas and Pere Romeu on a Tandem
Creator: Ramon Casas
Creator Lifespan: 1866 - 1932
Creator Death Place: Barcelona
Creator Birth Place: Barcelona
Date Created: 1897
Location Created: Barcelona
Tag: Genre and society, Portrait
Physical Dimensions: w2155 x h1880 cm
Inventory number: 069806-000
Artwork history: Acquired 1964
Type: Painting

前言

softmax回归是logistic regression的推广

Logistic Regression是求2个后验概率:
$P(y=0|x;\theta)$ 和 $P(y=1|x;\theta)$

而Softmax Regerssion则是求k个概率,k即为分类的个数,最后取最大的概率

Softmax函数

在直接写softmax regression公式之前,先来介绍一下softmax函数

softmax function,又称归一化指数函数, 是logistic function的一种推广。该函数应用广泛,在多项式逻辑回归、多项线性判别分析、朴素贝叶斯分类、人工神经网络、深度学习等多种基于概率多分类问题中都有应用。

其表达式为:
$ \sigma (z)_j = \frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}} 1; j=1,2,3,…,K. $

Hypothesis

假设: $ h_\theta(x) = \left[ \begin{matrix} P(y=1|x; \theta) \
P(y=2|x; \theta) \
\cdots \
P(y=k|x; \theta) \end{matrix} \right] = \frac{1}{\sum_{j=1}^{K}e^{\theta_j^Tx}} \left[ \begin{matrix} e^{\theta_1^Tx} \
e^{\theta_2^Tx} \
\cdots \
e^{\theta_k^Tx} \
\end{matrix} \right] $

  • 保证所有概率之和等于1

Cost

  • 同logistic,也采用对数损失函数

loss:
$ \begin{aligned} J_\theta(x) = cost(h_\theta(x), y) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{K} I\{y^{(i)}=j\} log{\frac{e^{\theta_j^Tx^{(i)}}}{\sum_{k=1}^{K}e^{\theta_k^Tx^{(i)}}}} \right] + \frac{\lambda}{2}\theta^2 \end{aligned} $

  • 其中$I\{y^{(i)}=j\}$指的是:当{}内为真时,I值为1;当{}内为假时,I值为0
  • ${\frac{e^{\theta_j^Tx^{(i)}}}{\sum_{k=1}^{K}e^{\theta_k^Tx^{(i)}}}}$项即为$h_\theta(x)$
  • $\frac{\lambda}{2}\theta^2$为附加的正则项,即regularization项

Optimization

问题即求cost最小时的$\theta$的取值

这里采用Gradient Descent方法来求解$minimize(loss)$

令:
$W = (\omega_1, \omega_2, \omega_3, …, \omega_n, b)^T $
$X = (x_1, x_2, x_3, …, x_n, 1)^T $
$Y = (y_1, y_2, y_3, …, y_n)^T $

梯度下降法,先求梯度,即对$\theta$求导数得:
$ \frac{\partial J(\theta)}{\partial \theta} = - \frac{1}{m} \sum_{j=1}^{m} \left[x^{(i)}(I\{y^{(i)} = j\} - P(y^{(i)}=j|x^{(i)}; \theta)) \right] + \lambda \theta_j $

其中:
$ P(y^{(i)}=j|x^{(i)}; \theta)) = log \frac{e^{\theta_j^Tx^{(i)}}}{\sum_{n=1}^{K}e^{\theta_n^Tx^{(i)}}} $

沿着梯度下降更新$\theta$的值为:
$ \theta = \theta - \alpha \frac{\partial J(\theta)}{\partial \theta} $

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# coding=UTF-8
'''
@Author: Yewbs Wei
@Date: 2019-08-06 15:54:59
@LastEditors: Yewbs Wei
@LastEditTime: 2019-08-06 23:22:17
@Description: 使用梯度下降方法实现Softmax Regression
'''
import numpy as np
from sklearn import datasets

class SoftmaxRegression():
    def __init__(self, X_train, Y_train, learning_rate=0.01, epochs=10000, threshold=0.0001, lam=0.1):
        self.X_train = X_train
        self.Y_train = Y_train
        self.theta =  np.random.rand(self.X_train.shape[1] + 1, len(np.unique(self.Y_train)))
        self.grad = None
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.threshold = threshold
        self.cost = None
        self.lam = lam

    def softmax(self, x):
        '''
        @description: 计算softmax函数
        @param {type} 
        @return: 
        '''
        return (np.exp(x).T / np.sum(np.exp(x), axis=1)).T
        
    def softmax_hypo(self, x):
        '''
        @description: 计算hypothesis
            x = (x1, x2, x3, ..., xn, 1)
        @param {type} 
        @return:  
        '''
        xx = np.dot(np.c_[np.ones(x.shape[0]), x], self.theta)
        return self.softmax(xx)

    def onthot(self, y):
        '''
        @description: 把y标签转换为onehot形式的向量
        @param {type} 
        @return: 
        '''
        n = len(np.unique(y))
        m = y.shape[0]
        b = np.zeros((m, n))
        for i in range(m):
            b[i,y[i]] = 1
        return b

    def softmax_cost(self, x, y):
        '''
        @description: 计算损失
        @param {type} 
        @return: 
        '''
        y_onehot = self.onthot(y)
        self.cost = - 1.0 / y.shape[0] * np.sum(np.dot(y_onehot.T, np.log(self.softmax_hypo(x)))) \
            + self.lam/2.0 * np.sum(self.theta * self.theta)

    def softmax_grad(self, x, y):
        '''
        @description: 计算梯度
        @param {type} 
        @return: 
        '''
        y_onehot = self.onthot(y)
        self.grad = - 1.0 / y.shape[0] * \
            np.dot(np.c_[np.ones(x.shape[0]), x].T, y_onehot - self.softmax_hypo(x)) \
            + self.lam * self.theta
        
    def softmax_grad_desc(self):
        '''
        @description: 用梯度下降法来进行优化
        @param {type} 
        @return: 
        '''
        self.softmax_cost(self.X_train, self.Y_train)
        cost_change = 1.0
        i = 1
        while i < self.epochs and cost_change > self.threshold:
            pre_cost = self.cost
            self.softmax_grad(self.X_train, self.Y_train)
            self.theta = self.theta - self.learning_rate * self.grad
            self.softmax_cost(self.X_train, self.Y_train)
            cost_change = abs(self.cost - pre_cost)
            i += 1
            if i%100 == 0:
                # 每100次输出一次loss大小
                print("-------%d ---- loss = %f" %(i ,self.cost))
        
    def fit(self):
        '''
        @description: emm, sklearn里面的fit函数 
        @param {type} 
        @return: 
        '''
        self.softmax_grad_desc()
def main():
    # 加载数据集
    dataset = datasets.load_digits()

    X = dataset.data[:, :]
    Y = dataset.target[:, None]

    sr = SoftmaxRegression(X, Y)
    sr.fit()
    
main()