Residual networks can be viewed as the ensembles of a lot of shallower sub-networks. The idea is to train the residual networks in such a way that the knowledge in the ensemble is distilled into the sub-networks in a single procedure. The advantages of doing the same are
- Increment in the accuracy of the original ResNet
- Possible training of residual networks of multiple depths in a single and efficient procedure
- A better approach for knowledge distillation when compared to the traditional distillation methods.