Applied Sciences (Mar 2023)

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

  • Bing Yin,
  • Shutong Niu,
  • Haitao Tang,
  • Lei Sun,
  • Jun Du,
  • Zhenhua Ling,
  • Cong Liu

DOI
https://doi.org/10.3390/app13074100
Journal volume & issue
Vol. 13, no. 7
p. 4100

Abstract

Read online

Robust speech recognition in real world situations is still an important problem, especially when it is affected by environmental interference factors and conversational multi-speaker interactions. Supplementing audio information with other modalities, such as audio–visual speech recognition (AVSR), is a promising direction for improving speech recognition. The end-to-end (E2E) framework can learn information between multiple modalities well; however, the model is not easy to train, especially when the amount of data is relatively small. In this paper, we focus on building an encoder–decoder-based end-to-end audio–visual speech recognition system for use under realistic scenarios. First, we discuss different pre-training methods which provide various kinds of initialization for the AVSR framework. Second, we explore different model architectures and audio–visual fusion methods. Finally, we evaluate the performance on the corpus from the first Multi-modal Information based Speech Processing (MISP) challenge, which is recorded in a real home television (TV) room. By system fusion, our final system achieves a 23.98% character error rate (CER), which is better than the champion system of the first MISP challenge (CER = 25.07%).

Keywords