WaveNet With Cross-Attention for Audiovisual Speech Recognition

Hui Wang; Fei Gao; Yue Zhao; Licheng Wu

doi:10.1109/ACCESS.2020.3024218

IEEE Access (Jan 2020)

WaveNet With Cross-Attention for Audiovisual Speech Recognition

Hui Wang,
Fei Gao,
Yue Zhao,
Licheng Wu

Affiliations

Hui Wang: ORCiD; School of Information Engineering, Minzu University of China, Beijing, China
Fei Gao: ORCiD; School of Information Engineering, Minzu University of China, Beijing, China
Yue Zhao: ORCiD; School of Information Engineering, Minzu University of China, Beijing, China
Licheng Wu: School of Information Engineering, Minzu University of China, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2020.3024218
Journal volume & issue: Vol. 8
pp. 169160 – 169168

Abstract

Read online

In this paper, the WaveNet with cross-attention is proposed for Audio-Visual Automatic Speech Recognition (AV-ASR) to address multimodal feature fusion and frame alignment problems between two data streams. WaveNet is usually used for speech generation and speech recognition, however, in this paper, we extent it to audiovisual speech recognition, and the cross-attention mechanism is introduced into different places of WaveNet for feature fusion. The proposed cross-attention mechanism tries to explore the correlated frames of visual feature to the acoustic feature frame. The experimental results show that the WaveNet with cross-attention can reduce the Tibetan single syllable error about 4.5% and English word error about 39.8% relative to the audio-only speech recognition, and reduce Tibetan single syllable error about 35.1% and English word error about 21.6% relative to the conventional feature concatenation method for AV-ASR.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords