Cascading Pose Features with CNN-LSTM for Multiview Human Action Recognition

Najeeb ur Rehman Malik; Syed Abdul Rahman Abu-Bakar; Usman Ullah Sheikh; Asma Channa; Nirvana Popescu

doi:10.3390/signals4010002

Signals (Jan 2023)

Cascading Pose Features with CNN-LSTM for Multiview Human Action Recognition

Najeeb ur Rehman Malik,
Syed Abdul Rahman Abu-Bakar,
Usman Ullah Sheikh,
Asma Channa,
Nirvana Popescu

Affiliations

Najeeb ur Rehman Malik: Computer Vision, Video and Image Processing Lab, ECE Department, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
Syed Abdul Rahman Abu-Bakar: Computer Vision, Video and Image Processing Lab, ECE Department, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
Usman Ullah Sheikh: Computer Vision, Video and Image Processing Lab, ECE Department, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
Asma Channa: Computer Science Department, University POLITEHNICA of Bucharest, 060042 Bucharest, Romania
Nirvana Popescu: Computer Science Department, University POLITEHNICA of Bucharest, 060042 Bucharest, Romania

DOI: https://doi.org/10.3390/signals4010002
Journal volume & issue: Vol. 4, no. 1
pp. 40 – 55

Abstract

Read online

Human Action Recognition (HAR) is a branch of computer vision that deals with the identification of human actions at various levels including low level, action level, and interaction level. Previously, a number of HAR algorithms have been proposed based on handcrafted methods for action recognition. However, the handcrafted techniques are inefficient in case of recognizing interaction level actions as they involve complex scenarios. Meanwhile, the traditional deep learning-based approaches take the entire image as an input and later extract volumes of features, which greatly increase the complexity of the systems; hence, resulting in significantly higher computational time and utilization of resources. Therefore, this research focuses on the development of an efficient multi-view interaction level action recognition system using 2D skeleton data with higher accuracy while reducing the computation complexity based on deep learning architecture. The proposed system extracts 2D skeleton data from the dataset using the OpenPose technique. Later, the extracted 2D skeleton features are given as an input directly to the Convolutional Neural Networks and Long Short-Term Memory (CNN-LSTM) architecture for action recognition. To reduce the complexity, instead of passing the whole image, only extracted features are given to the CNN-LSTM architecture, thus eliminating the need for feature extraction. The proposed method was compared with other existing methods, and the outcomes confirm the potential of the proposed technique. The proposed OpenPose-CNNLSTM achieved an accuracy of 94.4% for MCAD (Multi-camera action dataset) and 91.67% for IXMAS (INRIA Xmas Motion Acquisition Sequences). Our proposed method also significantly decreases the computational complexity by reducing the number of inputs features to 50.

Published in Signals

ISSN: 2624-6120 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Applied mathematics. Quantitative methods
Website: https://www.mdpi.com/journal/signals

About the journal

Abstract

Keywords