Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition

Ganzorig Batnasan; Munkh-Erdene Otgonbold; Qurban Ali Memon; Timothy K. Shih; Munkhjargal Gochoo

doi:10.1109/access.2025.3591123

IEEE Access (Jan 2025)

Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition

Ganzorig Batnasan,
Munkh-Erdene Otgonbold,
Qurban Ali Memon,
Timothy K. Shih,
Munkhjargal Gochoo

Affiliations

Ganzorig Batnasan: Department of Computer Science and Software Engineering, UAEU, Al Ain, United Arab Emirates
Munkh-Erdene Otgonbold: Department of Computer Science and Software Engineering, UAEU, Al Ain, United Arab Emirates
Qurban Ali Memon: ORCiD; Department of Electrical and Communication Engineering, UAEU, Al Ain, United Arab Emirates
Timothy K. Shih: ORCiD; College of EECS, National Central University, Taoyuan, Taiwan
Munkhjargal Gochoo: ORCiD; Department of Computer Science and Software Engineering, UAEU, Al Ain, United Arab Emirates

DOI: https://doi.org/10.1109/access.2025.3591123
Journal volume & issue: Vol. 13
pp. 127926 – 127940

Abstract

Read online

Sign Language Recognition (SLR) presents a significant challenge as a fine-grained, scene- and subject-invariant video classification task, primarily relying on hand gestures and facial expressions to convey meaning. Vision foundation models, such as Vision Transformers (ViTs), trained on general human action recognition datasets, often struggle to capture the nuanced features of signs. We highlight two main challenges: 1) the loss of critical spatial features in the head and hand regions due to video downscaling during preprocessing, and 2) the lack of sufficient domain-specific knowledge of sign gestures in ViTs. To tackle these, we propose a pipeline comprising our Head & Hands Tunneling (H&HT) preprocessor and a domain-specifically pre-trained 32-frame ViT classifier. The H&HT preprocessor, incorporating the MediaPipe pose predictor, maximizes the preservation of critical spatial details from the signer’s head and hands in raw sign language videos. When the ViT model is pre-trained on a domain-specific, large-scale SLR dataset, the two parts complement each other. As a result, the 32-frame H&HT pipeline achieves a Top-1 accuracy of 62.82% on the WLASL2000 benchmark, surpassing the performance of the 32-frame models and ranking second among the 64-frame models. We also provide benchmarking results on the ASL-Citizen dataset and two revised versions of the WLASL2000 dataset. All weights and codes are available in this link.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords