Skip to content

Toy example of a tool to optimize CNN layers widths, according their SVD decomposition

License

Notifications You must be signed in to change notification settings

jeromedejaegher/SVD-evolutive-CNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SVD-evolutive-CNN

(Pytorch implementation, done in July 2021)

Toy example of a tool to optimize neural network layers dimensions during training, according their singular values decomposition (SVD). The neural network grows if the task is too difficult for the current structure, and shrinks if it is overparametrized for the task.

Layers considered : convolution, dense, Residual Block.

This tool can surely be extended to transformers, as a generalization of Collaborative Attention

Usage :

Given a general network architecture, it optimizes (=modifies) layers-width during training, without loss of accuracy, enabling to use networks weights from a step to the other. Thus, it enables a re-use of weights of a previously trained network, saving time and energy-consumption.

Results :

Tested on the MNIST dataset, gives a 98,5% accuracy with a light 6-layers ResNet, with only 15k params (starting from 2M params). This reduction can be reached within an hour on domestic GPU, automatically and without loss of stability (on this easy dataset). Approx. 50 automatic optimization steps to reduce the network by 99% without loss in accuracy.

Tested on the Fashion-MNIST dataset, gives a 90 % accuracy with a light 6-layers ResNet, with 350k params (starting from ~500k params). This reduction can be reached within an hour on domestic GPU, automatically and without loss of stability.

Tested on the CIFAR-10 dataset, gives a 92% accuracy on train set with 2M parameters with a 9 layers ResNet, wich is far too deep for a 32x32 images set. The test accuracy is not that good, around 75%, but I guess it is because the model has no batch norm, the learning rate is not fine-tuned and data is not augmented. EDIT 1 : With basic data augmentation, the test accuracy increases to 85% EDIT 2 : Applying SVD reduction to Id + AB on each resblock makes larger model. The work is still in progress, maybe thresholds for pruning or expanding the network are too low ?

Idea and principle

Given a neural network structure, the tool performs a SVD decomposition on each layer weight.

On the singular values diagonal matrix S :

  • the tool pruns lowest (low-energy/low variance) values, and pruns dims along the corresponding vectors on matrix U and V.T, and
  • adds new dims on layer where S have high-energy values, orthogonal from existing singular vectors.

Formally, given a layer l, an input X, an output Y and a transformation Φ : X -> y = σ (A @ X +b) on this layer, the SVD transform the matrix A as :

A = U @ Σ @ V.T, where U and V are unitary (U @ U.T = U.T @ U = Id) and and Σ is diagonal.

If A is in Rdout x din, and dout < din, then Σ is in R doutx dout .

We class the values of Σ in decreasing order. Replacing the last one by 0 define an approximate matrix Σ dout - 1 .

The difference is bounded and small, and projects A on a subspace of dimension dout - 1. Moreover, the projection of this approximation on the first dout - 1 dimensions is also of dimension dout - 1. Thus, without loss of the interpretation power of the neural network, we can restrict the output of this layer in dimension dout - 1, and thus reduce the number of parameters.

Now, A = Udout - 1 @ Σdout - 1 @ Vdout - 1, din.T is the new Φ (x) on this new output space. We use this new output space as the input space of the next layer, setting A(l+1) = Adout(l) - 1, dout(l + 1) thus reducing the size of the 2 layers and the total number of parameters of the networks.

On following layer, the reducted weights are calculated like this :

next_layer_shrinking_equations.

Regarding bias, they are computed as the vector that minimizes :

bias_approx_equations.

Symetrically, on layers where singular values are high, we can expand the output space Rdout -> Rdout + 1 with vectors orthogonal to the original output space, allowing the neural network to find new relevant features to improve its overall accuracy.

Experimental findings to be explained :

  • in Resnet Blocks, the intermediate channel size seems to converge to a size significantly (around 3 times) smaller than input and output sizes. As if the Neural Network distillate channel information throught space, and re-channelize it before performing the addition with the (space-oriented) residue branch

Directions to improve the model :

  • The "optimize layers" util can be split in two : one utils to manage layers enlargment or layers shrinking, and another tool which layers to enlarge or to shrinking, given a constraint (GPU memory...).
  • To be tested with transformers, batchnorm, separable convolutions...
  • On ResBlock, my current intuition is that singular values of (Id + A @ B) indicates optimal width of layers, and singular values of A and B indicates optimal depth of the network. To be investigated...
    • Proto, needs quite a lot of work to industrialize ;)

Related works :

During my search on related works, I found these articles about neural networks and SVD.

And this hint that Transformers too can be compressed efficiently :

To re-use this work :

Please notice your interest through the "issues" on this repository, follow the rules of the given Licence and cite me as author as :

Jérome Dejaegher, SVD evolutive CNN, published the 29/07/2021 on GitHub

About

Toy example of a tool to optimize CNN layers widths, according their SVD decomposition

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published