Hand Activity Recognition From Automatic Estimated Egocentric Skeletons Combining Slow Fast and Graphical Neural Networks

Viet-Duc Le; Van-Nam Hoang; Tien-Thanh Nguyen; Van-Hung Le; Thanh-Hai Tran; Hai Vu; Thi-Lan Le

doi:10.1142/S219688882250035X

Vietnam Journal of Computer Science (Feb 2023)

Hand Activity Recognition From Automatic Estimated Egocentric Skeletons Combining Slow Fast and Graphical Neural Networks

Viet-Duc Le,
Van-Nam Hoang,
Tien-Thanh Nguyen,
Van-Hung Le,
Thanh-Hai Tran,
Hai Vu,
Thi-Lan Le

Affiliations

Viet-Duc Le: School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam
Van-Nam Hoang: MICA International Research Institute, Hanoi University of Science and Technology, Hanoi, Vietnam
Tien-Thanh Nguyen: School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam
Van-Hung Le: Tan Trao University, Tuyen Quang, Vietnam
Thanh-Hai Tran: School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam
Hai Vu: School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam
Thi-Lan Le: School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam

DOI: https://doi.org/10.1142/S219688882250035X
Journal volume & issue: Vol. 10, no. 01
pp. 75 – 100

Abstract

Read online

In this paper, we present an unified framework for understanding hand action from the first-person video. The proposed framework composes two main components. The first component estimates three-dimensional (3D) hand joints from RGB images. Two network structures derived from the baseline HopeNet network are proposed: convolutional neural networks (CNNs) which are traditional multi-layer CNN and CNN combining with GraphCNN to perform 3D hand pose estimation, without the use of GraphUNet as in baseline HopeNet method. The second component of the framework recognizes hand action from skeleton stream. We first deploy two recent advanced neuronal networks that are PA-ResGCN and Double-feature Double-motion (DDNet). To focus more on the hand pose changes, we improve DDNet with two normalization strategies of the hand joints. Finally, we fuse PA-ResGCN with our improved DDNet to still boost the recognition performance. We evaluate our proposed methods on First-Person Hand Action Benchmark dataset. Experiments show that our model for 3D hand joints estimation achieves the best precision (36.6 mm). Our hand joint normalization strategies improve the original DDNet from 0.71% to 4.05% of accuracy with the ground-truth hand pose while the improvement is significantly larger (from 2.96% to 10.98%) with the estimated hand pose. The late fusion schemes outperform different state-of-the-art methods for the hand action recognition with the highest accuracy of 86.67%. These experimental results show potential and extendable possibilities for developing practical first-person vision applications.

Published in Vietnam Journal of Computer Science

ISSN: 2196-8888 (Print); 2196-8896 (Online)
Publisher: World Scientific Publishing
Country of publisher: Singapore
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.worldscientific.com/worldscinet/vjcs

About the journal

Abstract

Keywords