IEEE Access (Jan 2024)
Transfer Learning Models for CNN Fusion With Fisher Vector for Codebook Optimization of Foreground Features
Abstract
Human action recognition has become one of the main topics in the computer vision field due to its high demand and competitiveness in real-world applications. The main goals of human action recognition are to improve classification accuracy and reduce computational complexity. Previous studies have mainly used two approaches: the hand-crafted feature extraction approach and the deep learning approach. The hand-crafted approach is simple, which confers it with an added advantage in terms of computational complexity. However, this method is low in accuracy. Conversely, the deep learning approach achieves high accuracy even for complex datasets, but it suffers in terms of computational complexity and long training time as it needs to process huge datasets during training. Other approaches include the use of pre-trained deep learning networks to fuse both methods. In this paper, we will introduce a combination of pre-trained convolutional neural networks (CNN) to extract features, an improved Fisher vector (iFV) codebook, and an optimized support vector machine SVM to achieve improved human action recognition. We leveraged three pre-trained CNNs, namely, Inception-ResNet-v2, NASNet-Large, and Xception, to extract the features. Then, we applied the improved Fisher vector codebook to encode them. We subsequently trained the codebook using SVM for classification and re- adjusted the SVM weights using five different optimization techniques, which are SGD, Adadelta, ADAM, Adamax, and Nadam. To evaluate the performance, we utilized UCF101 and HMDB51 datasets. The results demonstrate that the accuracy and computational complexity of our approach are comparable to state-of-the-art techniques.
Keywords