IEEE Access (Jan 2024)
Transform Domain Learning for Image Recognition
Abstract
Image and video classification are distinct tasks in computer vision. Three-dimensional convolutional neural networks (3D CNNs) are commonly employed for video classification, while two-dimensional convolutional neural networks (2D CNNs) are more suitable for image classification. To enable image and video recognition to adopt the same network-3D CNNs, we propose a transform domain learning approach for image recognition utilizing the video recognition model 3D CNNs. The transform domain learning not only permits the use of RGB images as input for 3D CNNs but also allows for the input of orthogonally transformed image sequences into the networks. Furthermore, randomly transformed images can be fed into the networks, where the random transformation is a customized arbitrary transformation. The standard 3D CNNs can be seamlessly applied to both images and videos. The experiments show that whether orthogonal transformation or random transformation is used as the input, 3D CNNs can effectively classify images. Compared to 2D CNNs, the data after orthogonal transformation does not reduce the accuracy. To unify the tasks of image and video classification, the image dataset Caltech101 and the video dataset UCF101 are mixed, and 3D CNNs are used to recognize images in the transform domain. The results illustrate that mixed and individual training produce almost the same recognition effect. Additionally, it can be observed that directly transferring the video pre-training model to the image classification task can significantly enhance the performance.
Keywords