IET Computer Vision (Dec 2022)

A robust and efficient method for skeleton‐based human action recognition and its application for cross‐dataset evaluation

  • Tien‐Thanh Nguyen,
  • Dinh‐Tan Pham,
  • Hai Vu,
  • Thi‐Lan Le

DOI
https://doi.org/10.1049/cvi2.12119
Journal volume & issue
Vol. 16, no. 8
pp. 709 – 726

Abstract

Read online

Abstract Skeleton‐based human action recognition has emerged recently thanks to its compactness and robustness to appearance variations. Although impressive results have been obtained in recent years, the performance of skeleton‐based action recognition methods has to be improved to be deployed in real‐time applications. Recently, a lightweight network structure named Double‐feature Double‐motion Network (DD‐Net) has been proposed for the skeleton‐based human action recognition. With high speed, the DD‐Net achieves state‐of‐the‐art performance on hand and body actions. The DD‐Net could not distinguish actions if they have a weak connection with the global trajectories. However, the DD‐Net is suitable for human action recognition where actions strongly correlate to the global trajectories. In this paper, the authors propose TD‐Net, an improved version of the DD‐Net in which a new branch is added. The new branch takes the normalised coordinates of joints (NCJ) to enrich the spatial information. On five datasets for skeleton‐based human activity recognition that are MSR‐Action3D, CMDFall, JHMDB, FPHAB, and NTU RGB + D, the TD‐Net consistently obtains superior performance compared with the baseline model DD‐Net. The proposed method outperforms different state‐of‐the‐art methods, including both hand‐designed and deep learning‐based methods on four datasets (MSR‐Action3D, CMDFall, JHMDB, and FPHAB). Furthermore, the generalisation of the proposed method is confirmed through cross‐dataset evaluation. To illustrate the potential use of the model for real‐time human action recognition, the authors have deployed an application on an edge device. The experimental result shows that the application can process up to 40 fps for pose estimation using MediaPipe. It takes only 0.04 ms to recognise an action from skeleton sequences.