Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

Alexey Kashevnik; Igor Lashkov; Alexandr Axyonov; Denis Ivanko; Dmitry Ryumin; Artem Kolchin; Alexey Karpov

doi:10.1109/ACCESS.2021.3062752

IEEE Access (Jan 2021)

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

Alexey Kashevnik,
Igor Lashkov,
Alexandr Axyonov,
Denis Ivanko,
Dmitry Ryumin,
Artem Kolchin,
Alexey Karpov

Affiliations

Alexey Kashevnik: ORCiD; St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, Russia
Igor Lashkov: ORCiD; St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, Russia
Alexandr Axyonov: St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, Russia
Denis Ivanko: St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, Russia
Dmitry Ryumin: St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, Russia
Artem Kolchin: Information Technologies and Programming Faculty, ITMO University, Saint Petersburg, Russia
Alexey Karpov: St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Saint Petersburg, Russia

DOI: https://doi.org/10.1109/ACCESS.2021.3062752
Journal volume & issue: Vol. 9
pp. 34986 – 35003

Abstract

Read online

This paper introduces a new methodology aimed at comfort for the driver in-the-wild multimodal corpus creation for audio-visual speech recognition in driver monitoring systems. The presented methodology is universal and can be used for corpus recording for different languages. We present an analysis of speech recognition systems and voice interfaces for driver monitoring systems based on the analysis of both audio and video data. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). Our methodology identifies the main steps and requirements for multimodal corpus designing, including the development of a new framework for audio-visual corpus creation. We identify the main research questions related to the speech corpus creation task and discuss them in detail in this paper. We also consider some main cases of usage that require speech recognition in a vehicle cabin for interaction with a driver monitoring system. We also consider other important use cases when the system detects dangerous states of driver’s drowsiness and starts a question-answer game to prevent dangerous situations. At the end based on the proposed methodology, we developed a mobile application that allows us to record a corpus for the Russian language. We created RUSAVIC corpus using the developed mobile application that at the moment a unique audiovisual corpus for the Russian language that is recorded in-the-wild condition.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords