Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (May 2016)

ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

  • D.V. Ivanko,
  • I. S. Kipyatkova,
  • A. L. Ronzhin,
  • A. A. Karpov

DOI
https://doi.org/10.17586/2226-1494-2016-16-3-387-401
Journal volume & issue
Vol. 16, no. 3
pp. 387 – 401

Abstract

Read online

The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV) fusion (integration) of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

Keywords