IEEE Access (Jan 2020)

Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning

  • Fangyi Zhu,
  • Jenq-Neng Hwang,
  • Zhanyu Ma,
  • Guang Chen,
  • Jun Guo

DOI
https://doi.org/10.1109/ACCESS.2020.3021857
Journal volume & issue
Vol. 8
pp. 169146 – 169159

Abstract

Read online

Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Besides, most methods adopt frame-level inter-tangled features among objects and ambiguous descriptions during training, which is difficult for learning the vision-language relationships. Without identifying the classes and locations of separate objects, neither associating the transition trajectories among the objects, these data-driven image-based video captioning methods cannot reason the activities with visual features only, not to mention performing well under small-samples. We propose a novel task, namely the object-oriented video captioning, which focuses on understanding the videos in object-level. We re-annotate the object-oriented video captioning dataset (Object-Oriented Captions) with object-sentence pairs to facilitate more effective cross-modal learning. Thereafter, we design the video-based structured trajectory network via adversarial learning (STraNet) to effectively analyze the activities along the time domain and proactively capture the vision-language connections under small datasets. The proposed STraNet consists of four components: the structured trajectory representation, the attribute explorer, the attribute-enhanced caption generator, and the adversarial discriminator. The high-level structured trajectory representation provides useful supplement over previous image-based approaches, allowing to reason the activities from the temporal evolution of visual features and the dynamic movement of spatial locations. The attribute explorer helps to capture discriminative features among different objects, with which the subsequent caption generator can generate more informative and accurate descriptions. Finally, by adding an adversarial discriminator on the caption generation task, we can improve learning the inter-relationships between the visual contents and the corresponding visual words. To demonstrate the effectiveness, we evaluate the proposed method on the new dataset and compare it with the state-of-the-arts for video captioning. From the experimental results, the STraNet exhibits the ability of precisely describing the concurrent objects and their activities in details.

Keywords