IEEE Access (Jan 2024)

A Novel Zero-Shot Real World Spatio-Temporal Super-Resolution (ZS-RW-STSR) Model for Video Super-Resolution

  • Ankit Shukla,
  • Avinash Upadhyay,
  • Manoj Sharma,
  • Anil Saini,
  • Nuzhat Fatema,
  • Hasmat Malik,
  • Asyraf Afthanorhan,
  • Mohammad Asef Hossaini

DOI
https://doi.org/10.1109/ACCESS.2024.3406476
Journal volume & issue
Vol. 12
pp. 123969 – 123984

Abstract

Read online

Super-resolution (SR) of the degraded and real low-resolution (LR) video remains a challenging problem despite the development of deep learning-based SR models. Most existing state-of-the-art networks focus on getting high-resolution (HR) videos from the corresponding down-sampled LR video but fail in scenarios with noisy or degraded low-resolution video. In this article, a novel real-world “zero-shot” video spatio-temporal SR model, i.e., 3D-Deep Convolutional Auto-Encoder (3D-CAE) guided attention-based deep spatio-temporal back-projection network has been proposed. 3D-CAE is utilized for extracting noise-free features from real low-resolution video and used in the attention-based deep spatio-temporal back-projection network for clean, high-resolution video reconstruction. In the proposed framework, the denoising loss of low-resolution video with high-resolution video reconstruction loss is jointly used in an end-to-end manner with a zero-shot setting. Further, Meta-learning is used to initialize the weights of the proposed model to take advantage of learning on the external dataset with internal learning in a zero-shot environment. To maintain the temporal coherency, we have used the Motion Compensation Transformer (MCT) for motion estimation and the Sub-Pixel Motion Compensation (SPMC) layer for motion compensation. We have evaluated the performance of our proposed model on REDS and Vid4 Dataset. The PSNR value of our model is 25.13 dB for the RealVSR dataset, which is 0.72 dB more than the next-best performing model, EAVSR+. For MVSR4x, our model provides 24.61 db PSNR, 0.67 dB more than the EAVSR+ model. Experimental results demonstrate the effectiveness of the proposed framework on degraded and noisy real low-resolution video compared to the existing methods. Furthermore, an ablation study has been conducted to highlight the contribution of 3D-CAE and attention layer to the overall network performance.

Keywords