A Comprehensive Benchmark and Evaluation of Thai Finger Spelling in Multi-Modal Deep Learning Models

Wuttichai Vijitkunsawat; Teeradaj Racharak

doi:10.1109/ACCESS.2024.3486729

IEEE Access (Jan 2024)

A Comprehensive Benchmark and Evaluation of Thai Finger Spelling in Multi-Modal Deep Learning Models

Wuttichai Vijitkunsawat,
Teeradaj Racharak

Affiliations

Wuttichai Vijitkunsawat: ORCiD; Department of Electronics and Telecommunication Engineering, Rajamangala University of Technology Krungthep, Bangkok, Thailand
Teeradaj Racharak: ORCiD; School of Information Science, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3486729
Journal volume & issue: Vol. 12
pp. 158079 – 158093

Abstract

Read online

Sign Language Recognition (SLR) is an intricate and demanding area within computer vision that requires advanced models for accurate interpretation. This research presents a comprehensive analysis and evaluation of the newly benchmarked Thai Finger Spelling (TFS) dataset through seven main experiments. It utilizes both RGB-based (evaluated by CNN-LSTM, VGG-LSTM, I3D, Fusion-3, MEMP, DeepSign-CNN, and ChatGPT4) and pose-based input modalities (assessed by Pose-GRU, Pose-TGCN, SPOTER, Bi-RNN, and FNN-LSTM) across one-hand and two-hand poses, covering 90 standard letters. Findings from the one-handed experiments show that models employing pose-based input modalities substantially outperform those using RGB-based modalities for TFS. Indeed, the pose-based models achieve scores higher than 95% in in-sample testing and 66% in out-of-sample testing. The pose-based models show strong resilience to environmental factors like lighting, background, and clothing which often affect the performance of RGB-based models. This robustness enhances the effectiveness of pose-based systems in diverse settings, improving sign language interpretation’s accuracy and expanding the applicability of SLR technologies in various contexts. However, scenarios involving two-hand poses add complexity, challenging both RGB-based and pose-based modalities in accurately tracking and distinguishing interactions between two hands, particularly during rapid or overlapping movements. These challenges can lead to occlusions in RGB-based systems and difficulties in mapping spatial relationships in pose-based systems. As a result, the performance of out-of-sample tests significantly decreases to below 50% for both static-point-on-hand and total two-hand poses. This benchmark research offers comprehensive insights into TFS and guides the development of state-of-the-art models for TFS.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords