IEEE Access (Jan 2024)

Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple Clues

  • Shaowen Su,
  • Yan Zhang

DOI
https://doi.org/10.1109/ACCESS.2024.3388532
Journal volume & issue
Vol. 12
pp. 54661 – 54672

Abstract

Read online

The spatio-temporal action detection task requires the output of the temporal and spatial positions as well as the action category of the target action instances in the form of action tubes. However, the current definition of video-level metrics in spatio-temporal action detection tasks is not sufficiently clear and unified to fully describe the ability of network models to perform spatio-temporal detection. Furthermore, existing tube linking methods are not only heavily dependent on the quality of the detection stage but also lack reliable linking criteria, resulting in poor tube linking performance. To address these issues, this study proposes a hierarchical linking method based on multiple clues. This method first dynamically utilizes various correlation clues at two levels, including appearance features, spatial overlap, motion prediction, category scores, tube length, and tube confidence status, to reduce the negative impact of unreliable information on the correlation. Then, it employs inter-class correlation to handle the mutual influence between different categories, followed by joint probability data association to address the mutual influence between correlated objects, ultimately achieving robust and accurate online linking of action tubes. The method is experimentally compared with other correlation methods on the untrimmed UCF24 and MultiSports datasets, demonstrating state-of-the-art tube link performance. We also conducted ablation experiments to explore the impact of the different modules and stages in the proposed tube-linking method.

Keywords