IET Computer Vision (Jun 2021)

Video captioning via a symmetric bidirectional decoder

  • Shanshan Qi,
  • Luxi Yang

DOI
https://doi.org/10.1049/cvi2.12043
Journal volume & issue
Vol. 15, no. 4
pp. 283 – 296

Abstract

Read online

Abstract The dominant video captioning methods employ the attentional encoder–decoder architecture, where the decoder is an autoregressive structure that generates sentences from left‐to‐right. However, these methods generally suffer from the exposure bias issue and neglect the guidance of future output contexts obtained from the right‐to‐left decoding. Here, the authors propose a new symmetric bidirectional decoder for video captioning. The authors first integrate the self‐attentive multi‐head attention and bidirectional gated recurrent unit for capturing the long‐term semantic dependencies in videos. The authors then apply one single decoder to generate accurate descriptions from left‐to‐right and right‐to‐left simultaneously. The decoder in each decoding direction performs two cross‐attentive multi‐head attention modules to consider both the past hidden states from the same decoding direction and the future hidden states from the reverse decoding direction at each time step. A symmetric semantic‐guided gated attention module is specially devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones for avoiding under‐description. Experimental evaluations on two widely applied benchmark datasets: Microsoft research video to text and Microsoft video description corpus, demonstrate that the authors' proposed method obtains substantially state‐of‐the‐art performance, which validates the superiority of the bidirectional decoder.

Keywords