Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Yewei Xiao; Jian Huang; Xuanming Liu; Aosu Zhu

doi:10.1007/s40747-024-01451-x

Complex & Intelligent Systems (May 2024)

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Yewei Xiao,
Jian Huang,
Xuanming Liu,
Aosu Zhu

Affiliations

Yewei Xiao: Institute of Automation and Electronic Information, Xiangtan University
Jian Huang: Institute of Automation and Electronic Information, Xiangtan University
Xuanming Liu: Institute of Automation and Electronic Information, Xiangtan University
Aosu Zhu: Institute of Automation and Electronic Information, Xiangtan University

DOI: https://doi.org/10.1007/s40747-024-01451-x
Journal volume & issue: Vol. 10, no. 4
pp. 5721 – 5741

Abstract

Read online

Abstract Conformer-based models have proven highly effective in Audio-visual Speech Recognition, integrating auditory and visual inputs to significantly enhance speech recognition accuracy. However, the widely utilized softmax attention mechanism within conformer models encounters scalability issues, with its spatial and temporal complexity escalating quadratically with sequence length. To address these challenges, this paper introduces the Shifted Linear Attention Conformer, an evolved iteration of the conformer architecture. Shifted Linear Attention Conformer adopts shifted linear attention as a scalable alternative to softmax attention. We conducted a thorough analysis of the factors constraining the efficiency of linear attention. To mitigate these issues, we propose the utilization of a straightforward yet potent mapping function and an efficient rank restoration module, enhancing the effectiveness of self-attention while maintaining low computational complexity. Furthermore, we integrate an advanced attention-shifting technique facilitating token manipulation within attentional mechanisms, thereby enhancing information flow across various groups. This three-part approach enhances cognitive computations, particularly beneficial for processing longer sequences. Our model achieves exceptional Word Error Rates of 1.9% and 1.5% on the Lip Reading Sentences 2 and Lip Reading Sentences 3 datasets, respectively, showcasing its state-of-the-art performance in audio-visual speech recognition tasks.

Published in Complex & Intelligent Systems

ISSN: 2199-4536 (Print); 2198-6053 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://www.springer.com/journal/40747

About the journal

Abstract

Keywords