In this paper, to detect actions with transfer learning + RNNs we have 3 steps:
-
In the first step, we use a frame selection algorithm to avoid the redundancy of videos which is explained in this paper Adaptive Frame Selection In Two-Dimensional Convolutional Neural Network Action Recognition and code can be found here(Code)
-
In the next stage, we use this repository(Feature extraction) to extract the spatial features from each selected frame by pre-trained ResNet-50 to have one spatial feature vector for each selected frame.
-
In the end, we use a temporal pooling method to divide each video into 4 parts and have strong spatial-temporal feature vectors for each video; after feature extraction, the RNN models are trained to classify actions. Moreover, using LOOCV helps to have reasonable results because we evaluate and train all videos of UCF11.