*This project is made by Binshuai Wang, Shan Gao, Xi Yang, Ziyi Zhou.
*Homepage: https://github.com/DRKWang/STA208_SPRING2020
*This project was inspired by Y. Tang's Deep Learning using Linear Support Vector Machines (2013), and by Abien Fred M. Agarap's An Architecture Combining Convolutional Neural Network (CNN) and Support Vector Machine (SVM) for Image Classification
First, clone the project.
Then, go to the repository's directory, and open the following notebooks to check the verification results on 3 different datasets (MNIST, CIFAR_10, fer_13).
- DNN on MNIST.ipynb
- CNN on CIFAR_10.ipynb
- Facial_Expression_Recognition.ipynb
We also encapsulate the DNN with SVMs as a class, like a model in sklearn.
-
using the following code to build a model
model = CNN_softmax_and_SVM.CNN_SVM()
-
using the following code to fit the model
model.fit(x_train,y_train)
-
using the following code to evaluate the testset
model.evaluate(x_test,y_test)
To implement the algorithm that replacing the soft-max layer with a linear support vector machine, we took several steps as following:
We tried two different methods to construct DNN and CNN model.
One of them is to use keras api to create a keras model. It is a stable and efficient way to implement neural network. However, there is no SVM model for DNN or CNN in keras and it is impossible for us to replace the softmax layer with the SVM layer. Even we try to define the SVM by ourselves to match the requests, we still fail to obtain the loss and the expression of weights and biases. So, we try the second method[1]: Constructing the NN model by defining each layer, loss function, and optimizer process.
In normal CNN model, the last layer is the softmax function and the output is the probability of each class:
$$p_i = \frac{exp(a_i)}{\sum_{j}exp(a_j)}$$
and the predicted class $\hat{i}$
$$\hat{i} = \text{arg} \max p_i$$
We use cross-entropy loss here.
With Support Vector Machines, we delete the softmax layer and output the result from the last layer(Attention: No biases). Then, we define the soft margin loss[2]:
$$\min_w w^Tw + C\sum_{n=1}^{N}\max(1-w^T x_n y_n,0)^2$$
Actually, the primal form is L1_SVM with the standard hinge loss. But it is not differetiable so we use L2-SVM instead.
To predict the class of data:
$$\hat{i} = \text{arg}_y \max (w^Tx)y$$
Here we only use linear SVM.
To implement Multiclass SVM, we use one-vs-rest approach. For K class problems, K linear SVMs are trained independently. The output of the
$$a_k(x) = w^Tx$$
and the predicted class is
$$\text{arg}_k \max a_k(x)$$
We also meet the problem that the graph is colorful. In this case, we consider one more parameter in our model, channel. If the data has k color channels,then we need k times parameters at the beginning. Actually, there are no much difference between the black-white data and colorful data.
1. DNN with SVM v.s. DNN with softmax on MNIST.
The hyperparameters used on MNIST were manually assigned, and not through optimization.
Hyperparameters | CNN-Softmax | CNN-SVM |
---|---|---|
Batch size | 200 | 200 |
Learning rate | 1e-3 | 1e-3 |
Steps | 120000 | 120000 |
SVM C | N/A | 2 |
The experiments were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU.
Figure 1. Training accuracy and loss of CNN-Softmax and CNN-SVM on MNIST
We used a simple fully connected model by first performing PCA from 784 dimensions down to 70 dimensions. The data is then divided up into 300 minibatches of 200 samples each. We trained using stochastic gradient descent with momentum on these 300 minibatches for 400 epochs, totaling 120K weight updates. To prevent overftting and critical to achieving good results, a lot of Gaussian noise is added to the input. Noise of standard deviation of 1.0 is added.
Two hidden layers of 512 units each is followed by a softmax or a L2-SVM. The accuracy of DNN-softmax on the testset is 0.9789. The accuracy of DNN-SVM on the testset is 0.9757. There are no problems about overfitting.
2. CNN with SVM v.s. CNN with softmax on CIFAR_10.
features | parameters |
---|---|
data dimension | 32*32 |
optimizer | RMSprop |
training_steps | 5000 |
pooling layer | 22 to 11 |
dropout rate | 0.2 |
channels | 3 |
first layer | 32 |
second layer | 64 |
last layer | 3072 |
local receptive fields | 5*5 |
The experiments were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU.
Figure 2. Training accuracy and loss of CNN-Softmax and CNN-SVM on CIFAR_10
For normal CNN-softmax, the accuracy of test set is 0.691. For CNN-SVM, the accuracy of test set is 0.725. We may claim that in this case, the second model has a better performance.
3. CNN with SVM v.s. CNN with softmax on fer-13.
features | parameters |
---|---|
data dimension | 35887*3 |
pixels dimension | 48*48 |
optimizer | RMSprop |
train_step | 3000 |
pooling layer | 22 to 11 |
dropout_rate | 0.4 |
channels | 3 |
The experiments were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU.
Figure 3. Training accuracy and loss of CNN-Softmax and CNN-SVM on fer-13.
For normal CNN-softmax, the accuracy of test set is 0.542. For CNN-SVM, the accuracy of test set is 0.549. We may claim that in this case, the second model has a slightly better performance.
[1]. https://github.com/aymericdamien/TensorFlow-Examples/tree/master/tensorflow_v2
[2]. https://github.com/AFAgarap/cnn-svm/tree/35-implement-cnn-svm-tf2/model