Using Lip Reading Recognition to Predict Daily Mandarin Conversation

Muhamad Amirul Haq; Shanq-Jang Ruan; Wen-Jie Cai; Lieber Po-Hung Li

doi:10.1109/ACCESS.2022.3175867

IEEE Access (Jan 2022)

Using Lip Reading Recognition to Predict Daily Mandarin Conversation

Muhamad Amirul Haq,
Shanq-Jang Ruan,
Wen-Jie Cai,
Lieber Po-Hung Li

Affiliations

Muhamad Amirul Haq: ORCiD; Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Shanq-Jang Ruan: ORCiD; Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Wen-Jie Cai: Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Lieber Po-Hung Li: ORCiD; Faculty of Medicine, Institute of Brain Science, National Yang Ming Chiao Tung University, Taipei, Taiwan

DOI: https://doi.org/10.1109/ACCESS.2022.3175867
Journal volume & issue: Vol. 10
pp. 53481 – 53489

Abstract

Read online

Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords