Variable Temporal Length Training for Action Recognition CNNs

Tan-Kun Li; Kwok-Leung Chan; Tardi Tjahjadi

doi:10.3390/s24113403

Sensors (May 2024)

Variable Temporal Length Training for Action Recognition CNNs

Tan-Kun Li,
Kwok-Leung Chan,
Tardi Tjahjadi

Affiliations

Tan-Kun Li: Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
Kwok-Leung Chan: Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
Tardi Tjahjadi: School of Engineering, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, UK

DOI: https://doi.org/10.3390/s24113403
Journal volume & issue: Vol. 24, no. 11
p. 3403

Abstract

Read online

Most current deep learning models are suboptimal in terms of the flexibility of their input shape. Usually, computer vision models only work on one fixed shape used during training, otherwise their performance degrades significantly. For video-related tasks, the length of each video (i.e., number of video frames) can vary widely; therefore, sampling of video frames is employed to ensure that every video has the same temporal length. This training method brings about drawbacks in both the training and testing phases. For instance, a universal temporal length can damage the features in longer videos, preventing the model from flexibly adapting to variable lengths for the purposes of on-demand inference. To address this, we propose a simple yet effective training paradigm for 3D convolutional neural networks (3D-CNN) which enables them to process videos with inputs having variable temporal length, i.e., variable length training (VLT). Compared with the standard video training paradigm, our method introduces three extra operations during training: sampling twice, temporal packing, and subvideo-independent 3D convolution. These operations are efficient and can be integrated into any 3D-CNN. In addition, we introduce a consistency loss to regularize the representation space. After training, the model can successfully process video with varying temporal length without any modification in the inference phase. Our experiments on various popular action recognition datasets demonstrate the superior performance of the proposed method compared to conventional training paradigm and other state-of-the-art training paradigms.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords