IEEE Access (Jan 2023)

Problematic Unordered Queries in Temporal Moment Measurement by Using Natural Language

  • Hafiza Sadia Nawaz,
  • Junyu Dong

DOI
https://doi.org/10.1109/ACCESS.2023.3264443
Journal volume & issue
Vol. 11
pp. 37976 – 37986

Abstract

Read online

This study examines the difficulty in measuring temporal moments by using natural language (TMMNL) in the untrimmed video. The purpose of TMMNL is to use natural language query to find a specific moment within a lengthy video. It’s a challenging task since other, closely related activities may divert attention from the target temporal moment. This issue has been addressed by existing research employing computer vision techniques like reinforcement, anchor, and ranking. In this research, we not only propose a TMMNL solution and show how to use natural language query to find the required moment, but we also identify a novel issue: if the given natural language query is unordered (without proper subject, verb and object), the system will have trouble understanding and the network may perform poorly. Previous methods perform poorer when a query is unordered and cannot able to retrieve the relevant moment as a result overall performance degradation. We introduce novel the-visual, the-action, the-object, and the-connecting words concept to address the problem of unordered query in TMMNL. Graph Convolutions with Latent variable for Visual-Textual Network (GCL-VTN) is our suggested network, which has three components: 1) visual-graph-convolution (visual GC); 2) textual-graph-convolution (textual GC); and 3) compatible method for learning embedding’s (CMLE). Visual-nodes in the visual GC detect regional attributes, object, and actor information in the same way as textual-nodes in the textual GC maintain word sequence using grammar-based query rules. A compatible method for learning embedding (CMLE) is also proposed, which integrates different modalities (moment, query) and trained grammar-based words into the same embedding space. In order to align and keep the query sequence, we also incorporate a stochastic latent variable in the CMLE that has prior and posterior distributions. The posterior distribution deals with both visual-textual data and works when the query is in the correct sequence or based on grammar rules. The prior distribution only deals with textual data and is effective when the query is unordered or not based on grammar rules. TACoS, Charades-STA, and activityNet-Captions are state-of-the-art; our GCL-VTN exceeds them all.

Keywords