IEEE Access (Jan 2024)

A Dual-Branch Spatial–Temporal Learning Network for Video Prediction

  • Huilin Huang,
  • Yepeng Guan

DOI
https://doi.org/10.1109/ACCESS.2024.3394209
Journal volume & issue
Vol. 12
pp. 73258 – 73267

Abstract

Read online

Video prediction aims to predict future frames through modeling the complex spatial-temporal correlations between given frames, which plays an important role in the computer vision community. Despite significant progress has been achieved, there are still obvious limitations in existing methods, e.g., (1) several methods heavily rely on external information or inputs (e.g., semantic map, optical flow, etc.) to help predict, which hinders their wider applications; (2) most existing methods still struggle to model the accurate future motion from given video frames while simultaneously keeping the consistency of its appearance across video frames, resulting in blurry artifacts in predicted frames and low visual quality. In this work, to predict more accurate future motion and maintain consistent appearance across video frames, we propose a dual-branch video prediction network. Specifically, to predict accurate future motion, we propose a novel motion prediction unit (MPU) to sequentially capture inter-frame motion and intra-frame appearance. To better learn temporal evolution, the temporal attention is utilized in the MPU to enhance the feature interactions of the temporal domain. The multiple-scale convolution layers in the MPU are utilized to enlarge the receptive field. Additionally, to preserve appearance consistency, we design the spatial prediction unit (SPU) to focus on spatial information by capturing various appearance features of given video frames. Moreover, considering that mean squared error (MSE) loss is more concerned with static features, we introduce a novel divergence regularization to constrain global motion variations to generate naturalistic future frames. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several public benchmarks.

Keywords