IEEE Access (Jan 2022)

Using Lip Reading Recognition to Predict Daily Mandarin Conversation

  • Muhamad Amirul Haq,
  • Shanq-Jang Ruan,
  • Wen-Jie Cai,
  • Lieber Po-Hung Li

DOI
https://doi.org/10.1109/ACCESS.2022.3175867
Journal volume & issue
Vol. 10
pp. 53481 – 53489

Abstract

Read online

Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.

Keywords