SVD-evolutive-CNN

(Pytorch implementation, done in July 2021)

Toy example of a tool to optimize neural network layers dimensions during training, according their singular values decomposition (SVD). The neural network grows if the task is too difficult for the current structure, and shrinks if it is overparametrized for the task.

Layers considered : convolution, dense, Residual Block.

This tool can surely be extended to transformers, as a generalization of Collaborative Attention

Usage :

Given a general network architecture, it optimizes (=modifies) layers-width during training, without loss of accuracy, enabling to use networks weights from a step to the other. Thus, it enables a re-use of weights of a previously trained network, saving time and energy-consumption.

Results :

Tested on the MNIST dataset, gives a 98,5% accuracy with a light 6-layers ResNet, with only 15k params (starting from 2M params). This reduction can be reached within an hour on domestic GPU, automatically and without loss of stability (on this easy dataset). Approx. 50 automatic optimization steps to reduce the network by 99% without loss in accuracy.

Tested on the Fashion-MNIST dataset, gives a 90 % accuracy with a light 6-layers ResNet, with 350k params (starting from ~500k params). This reduction can be reached within an hour on domestic GPU, automatically and without loss of stability.

Tested on the CIFAR-10 dataset, gives a 92% accuracy on train set with 2M parameters with a 9 layers ResNet, wich is far too deep for a 32x32 images set. The test accuracy is not that good, around 75%, but I guess it is because the model has no batch norm, the learning rate is not fine-tuned and data is not augmented. EDIT 1 : With basic data augmentation, the test accuracy increases to 85% EDIT 2 : Applying SVD reduction to Id + AB on each resblock makes larger model. The work is still in progress, maybe thresholds for pruning or expanding the network are too low ?

Idea and principle

Given a neural network structure, the tool performs a SVD decomposition on each layer weight.

On the singular values diagonal matrix S :

the tool pruns lowest (low-energy/low variance) values, and pruns dims along the corresponding vectors on matrix U and V.T, and
adds new dims on layer where S have high-energy values, orthogonal from existing singular vectors.

Formally, given a layer l, an input X, an output Y and a transformation Φ : X -> y = σ (A @ X +b) on this layer, the SVD transform the matrix A as :

A = U @ Σ @ V.T, where U and V are unitary (U @ U.T = U.T @ U = Id) and and Σ is diagonal.

If A is in R^{d_out x d_in}, and d_out < d_in, then Σ is in R^{d_outx d_out} .

We class the values of Σ in decreasing order. Replacing the last one by 0 define an approximate matrix Σ _{d_out - 1} .

The difference is bounded and small, and projects A on a subspace of dimension d_out - 1. Moreover, the projection of this approximation on the first d_out - 1 dimensions is also of dimension d_out - 1. Thus, without loss of the interpretation power of the neural network, we can restrict the output of this layer in dimension d_out - 1, and thus reduce the number of parameters.

Now, A = U_{d_out - 1} @ Σ_{d_out - 1} @ V_{d_out - 1, d_in}.T is the new Φ (x) on this new output space. We use this new output space as the input space of the next layer, setting A(l+1) = A_{d_out(l) - 1}, d_{out(l + 1)} thus reducing the size of the 2 layers and the total number of parameters of the networks.

On following layer, the reducted weights are calculated like this :

.

Regarding bias, they are computed as the vector that minimizes :

.

Symetrically, on layers where singular values are high, we can expand the output space R^d_out -> R^{d_out + 1} with vectors orthogonal to the original output space, allowing the neural network to find new relevant features to improve its overall accuracy.

Experimental findings to be explained :

in Resnet Blocks, the intermediate channel size seems to converge to a size significantly (around 3 times) smaller than input and output sizes. As if the Neural Network distillate channel information throught space, and re-channelize it before performing the addition with the (space-oriented) residue branch

Directions to improve the model :

The "optimize layers" util can be split in two : one utils to manage layers enlargment or layers shrinking, and another tool which layers to enlarge or to shrinking, given a constraint (GPU memory...).
To be tested with transformers, batchnorm, separable convolutions...
On ResBlock, my current intuition is that singular values of (Id + A @ B) indicates optimal width of layers, and singular values of A and B indicates optimal depth of the network. To be investigated...
- Proto, needs quite a lot of work to industrialize ;)

Related works :

During my search on related works, I found these articles about neural networks and SVD.

And this hint that Transformers too can be compressed efficiently :

Multi-Head Attention: Collaborate Instead of Concatenate Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

To re-use this work :

Please notice your interest through the "issues" on this repository, follow the rules of the given Licence and cite me as author as :

Jérome Dejaegher, SVD evolutive CNN, published the 29/07/2021 on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
img		img
jupyter-notebook		jupyter-notebook
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SVD-evolutive-CNN

Usage :

Results :

Idea and principle

Experimental findings to be explained :

Directions to improve the model :

Related works :

To re-use this work :

About

Releases

Packages

Languages

License

jeromedejaegher/SVD-evolutive-CNN

Folders and files

Latest commit

History

Repository files navigation

SVD-evolutive-CNN

Usage :

Results :

Idea and principle

Experimental findings to be explained :

Directions to improve the model :

Related works :

To re-use this work :

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages