IEEE Access (Jan 2024)
No-Reference Video Quality Assessment Using Transformers and Attention Recurrent Networks
Abstract
In recent years, numerous studies have investigated the development of methods for video quality assessment (VQA). These studies have predominantly focused on specific types of video degradation tailored to the application of interest. However, natural videos or recent videos generated by users (UGC) present complex distortions that are not easy to model. Consequently, most current VQA approaches struggle to achieve high performance when applied to these videos. In this paper, we propose a novel Transformer-based architecture that extracts spatial distortion features and spatio-temporal features from videos in two specialized branches. The spatial distortion branch leverages a transfer learning strategy where a standard ViT is pre-trained using a masked autoencoder (MAE) self-supervised learning task, and then fine-tuned to predict the distortion type of corrupted images from the CSIQ database. The features from this branch capture degradation at the level of individual frames. On the other hand, the second branch employs a 3D Shifted Windows Transformer (Swin-T) to extract spatio-temporal features across multiple frames. Once again, we use transfer learning to extract rich features by pre-training this 3D Swin-T model on a video dataset for human action recognition. Finally, a temporal memory block hinged on an attention recurrent neural networks is proposed to predict the final video quality score from the spatio-temporal sequence of features. We evaluate the performance of our method on two popular UGC databases, namely KoNViD-1k and LIVE-VQC. Results show it outperforms state-of-the-art models on the KoNViD-1k database, achieving a SROCC performance of 0.927 and a PLCC of 0.925, while also delivering highly competitive results on the LIVE-VQC database.
Keywords