Jisuanji kexue (Oct 2021)

Coherent Semantic Spatial-Temporal Attention Network for Video Inpainting

  • LIU Lang, LI Liang, DAN Yuan-hong

DOI
https://doi.org/10.11896/jsjkx.200600130
Journal volume & issue
Vol. 48, no. 10
pp. 239 – 245

Abstract

Read online

Existing video inpainting methods usually produce blurred texture,distorted structure and artifacts,while applying image-based inpainting model directly to the video inpainting will lead to inconsistent time.From the perspective of time,a novel coherent semantic spatial-temporal attention(CSSTA) for video inpainting is proposed,through the attention layer,the model focuses on the information that the target frame is partially blocked and the adjacent frames are visible,so as to obtain the visible content to fill the hole region of the target frame.The CSSTA layer can not only model the semantic correlation between hole features but also remotely correlate the long-range information with the hole regions.In order to complete semantically coherent hole regions,a novel loss function Feature Loss is proposed to replace VGG Loss.The model is built on a two-stage coarse-to-fine encoder-decoder model for collecting and refining information from adjacent frames.Experimental results on the YouTube-VOS and DAVIS datasets show that the method in this paper runs almost in real-time and outperforms the three typical video inpainting methods in terms of inpainting results,peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).

Keywords