IEEE Access (Jan 2024)
A Comprehensive Benchmark and Evaluation of Thai Finger Spelling in Multi-Modal Deep Learning Models
Abstract
Sign Language Recognition (SLR) is an intricate and demanding area within computer vision that requires advanced models for accurate interpretation. This research presents a comprehensive analysis and evaluation of the newly benchmarked Thai Finger Spelling (TFS) dataset through seven main experiments. It utilizes both RGB-based (evaluated by CNN-LSTM, VGG-LSTM, I3D, Fusion-3, MEMP, DeepSign-CNN, and ChatGPT4) and pose-based input modalities (assessed by Pose-GRU, Pose-TGCN, SPOTER, Bi-RNN, and FNN-LSTM) across one-hand and two-hand poses, covering 90 standard letters. Findings from the one-handed experiments show that models employing pose-based input modalities substantially outperform those using RGB-based modalities for TFS. Indeed, the pose-based models achieve scores higher than 95% in in-sample testing and 66% in out-of-sample testing. The pose-based models show strong resilience to environmental factors like lighting, background, and clothing which often affect the performance of RGB-based models. This robustness enhances the effectiveness of pose-based systems in diverse settings, improving sign language interpretation’s accuracy and expanding the applicability of SLR technologies in various contexts. However, scenarios involving two-hand poses add complexity, challenging both RGB-based and pose-based modalities in accurately tracking and distinguishing interactions between two hands, particularly during rapid or overlapping movements. These challenges can lead to occlusions in RGB-based systems and difficulties in mapping spatial relationships in pose-based systems. As a result, the performance of out-of-sample tests significantly decreases to below 50% for both static-point-on-hand and total two-hand poses. This benchmark research offers comprehensive insights into TFS and guides the development of state-of-the-art models for TFS.
Keywords