Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network

Refat Khan Pathan; Munmun Biswas; Suraiya Yasmin; Mayeen Uddin Khandaker; Mohammad Salman; Ahmed A. F. Youssef

doi:10.1038/s41598-023-43852-x

Scientific Reports (Oct 2023)

Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network

Refat Khan Pathan,
Munmun Biswas,
Suraiya Yasmin,
Mayeen Uddin Khandaker,
Mohammad Salman,
Ahmed A. F. Youssef

Affiliations

Refat Khan Pathan: Department of Computing and Information Systems, School of Engineering and Technology, Sunway University
Munmun Biswas: Department of Computer Science and Engineering, BGC Trust University Bangladesh
Suraiya Yasmin: Department of Computer and Information Science, Graduate School of Engineering, Tokyo University of Agriculture and Technology
Mayeen Uddin Khandaker: Centre for Applied Physics and Radiation Technologies, School of Engineering and Technology, Sunway University
Mohammad Salman: College of Engineering and Technology, American University of the Middle East
Ahmed A. F. Youssef: College of Engineering and Technology, American University of the Middle East

DOI: https://doi.org/10.1038/s41598-023-43852-x
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Sign Language Recognition is a breakthrough for communication among deaf-mute society and has been a critical research topic for years. Although some of the previous studies have successfully recognized sign language, it requires many costly instruments including sensors, devices, and high-end processing power. However, such drawbacks can be easily overcome by employing artificial intelligence-based techniques. Since, in this modern era of advanced mobile technology, using a camera to take video or images is much easier, this study demonstrates a cost-effective technique to detect American Sign Language (ASL) using an image dataset. Here, “Finger Spelling, A” dataset has been used, with 24 letters (except j and z as they contain motion). The main reason for using this dataset is that these images have a complex background with different environments and scene colors. Two layers of image processing have been used: in the first layer, images are processed as a whole for training, and in the second layer, the hand landmarks are extracted. A multi-headed convolutional neural network (CNN) model has been proposed and tested with 30% of the dataset to train these two layers. To avoid the overfitting problem, data augmentation and dynamic learning rate reduction have been used. With the proposed model, 98.981% test accuracy has been achieved. It is expected that this study may help to develop an efficient human–machine communication system for a deaf-mute society.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal