Method for visual analysis of driver's face for automatic lip-reading in the wild

A.A. Axyonov; D.A. Ryumin; A.M. Kashevnik; D.V. Ivanko; A.A. Karpov

doi:10.18287/2412-6179-CO-1092

Компьютерная оптика (Dec 2022)

Method for visual analysis of driver's face for automatic lip-reading in the wild

A.A. Axyonov,
D.A. Ryumin,
A.M. Kashevnik,
D.V. Ivanko,
A.A. Karpov

Affiliations

A.A. Axyonov: St. Petersburg Federal Research Center of the RAS (SPC RAS)
D.A. Ryumin: St. Petersburg Federal Research Center of the RAS (SPC RAS)
A.M. Kashevnik: St. Petersburg Federal Research Center of the RAS (SPC RAS)
D.V. Ivanko: St. Petersburg Federal Research Center of the RAS (SPC RAS)
A.A. Karpov: St. Petersburg Federal Research Center of the RAS (SPC RAS)

DOI: https://doi.org/10.18287/2412-6179-CO-1092
Journal volume & issue: Vol. 46, no. 6
pp. 955 – 962

Abstract

Read online

The paper proposes a method of visual analysis for automatic speech recognition of the vehicle driver. Speech recognition in acoustically noisy conditions is one of big challenges of artificial intelligence. The problem of effective automatic lip-reading in vehicle environment has not yet been resolved due to the presence of various kinds of interference (frequent turns of driver's head, vibration, varying lighting conditions, etc.). In addition, the problem is aggravated by the lack of available databases on this topic. A MediaPipe Face Mesh is used to find and extract the region-of-interest (ROI). We have developed End-to-End neural network architecture for the analysis of visual speech. Visual features are extracted from a single image using a convolutional neural network (CNN) in conjunction with a fully connected layer. The extracted features are input to a Long Short-Term Memory (LSTM) neural network. Due to a small amount of training data we proposed that a Transfer Learning method should be applied. Experiments on visual analysis and speech recognition present great opportunities for solving the problem of automatic lip-reading. The experiments were performed on an in-house multi-speaker audio-visual dataset RUSAVIC. The maximum recognition accuracy of 62 commands is 64.09 %. The results can be used in various automatic speech recognition systems, especially in acoustically noisy conditions (high speed, open windows or a sunroof in a vehicle, backgoround music, poor noise insulation, etc.) on the road.

Published in Компьютерная оптика

ISSN: 0134-2452 (Print); 2412-6179 (Online)
Publisher: Samara National Research University
Country of publisher: Russian Federation
LCC subjects: Science: Science (General): Cybernetics: Information theory; Science: Physics: Optics. Light
Website: http://computeroptics.ru/eng/index.html

About the journal

Abstract

Keywords