IEEE Access (Jan 2024)
Action Progression Networks for Temporal Action Detection in Videos
Abstract
This study introduces an innovative Temporal Action Detection (TAD) model that is distinguished by its lightweight structure and capability for end-to-end training, delivering competitive performance. Traditional TAD approaches often rely on pre-trained models for feature extraction, compromising on end-to-end training for efficiency, yet encounter challenges due to misalignment with tasks and data shifts. Our method addresses these challenges by processing untrimmed videos on a snippet basis, facilitating a snippet-level TAD model that is trained end-to-end. Central to our approach is a novel frame-level label, termed “action progressions,” designed to encode temporal localization information. The prediction of action progressions not only enables our snippet-level model to incorporate temporal information effectively but also introduces a granular temporal encoding for the evolution of actions, enhancing the precision of detection. Beyond a streamlined pipeline, our model introduces several novel capabilities: 1) It directly learns from raw videos, unlike prevalent TAD methods that depend on frozen, pre-trained feature extraction models; 2) It is flexible for training with trimmed and untrimmed videos; 3) It is the first TAD model to avoid the detection of incomplete actions; and 4) It can accurately detect long-lasting actions or those with clear evolutionary patterns. Utilizing these advantages, our model achieves commendable performance on benchmark datasets, securing averaged mean Average Precision (mAP) scores of 54.8%, 30.5%, and 78.7% on THUMOS14, ActivityNet-1.3, and DFMAD, respectively.
Keywords