An Mcformer encoder integrating Mamba and Cgmlp for improved acoustic feature extraction

Nurmemet Yolwas; Yongchao Li; Lixu Sun; Jian Peng; Zhiwu Sun; Yajie Wei; Yineng Cai

doi:10.1038/s41598-025-04979-1

Scientific Reports (Jul 2025)

An Mcformer encoder integrating Mamba and Cgmlp for improved acoustic feature extraction

Nurmemet Yolwas,
Yongchao Li,
Lixu Sun,
Jian Peng,
Zhiwu Sun,
Yajie Wei,
Yineng Cai

Affiliations

Nurmemet Yolwas: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University
Yongchao Li: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University
Lixu Sun: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University
Jian Peng: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University
Zhiwu Sun: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University
Yajie Wei: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University
Yineng Cai: School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University

DOI: https://doi.org/10.1038/s41598-025-04979-1
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Currently, attention models based on the Conformer architecture have become mainstream in the field of speech recognition due to their integration of self-attention mechanisms and convolutional networks. However, further research indicates that Conformers still exhibit limitations in capturing global information. To address this limitation, the Mcformer encoder is introduced, which incorporates the Mamba module in parallel with multi-head attention blocks to enhance the model’s global context processing capabilities. Additionally, a Convolutional Gated Multilayer Perceptron (Cgmlp) structure is employed to improve the extraction of local features through deep convolutional layers. Experimental results on the Aishell-1, Common Voice zh 14 public datasets, and the TED-LIUM 3 English public dataset demonstrate that, without a language model, the Mcformer encoder achieves character error rates (CER) of 4.15%, 4.48%, and 13.28%, 13.06% on the validation and test sets of Aishell-1 and Common Voice zh 14, respectively. When incorporating a language model, the CER further decreases to 3.88%, 4.08%, and 11.89%, 11.29%. On the TED-LIUM 3 English public dataset, the word error rates (WER) for the validation and test sets are 7.26% and 6.95%, respectively, without a language model. These experimental outcomes substantiate the efficacy of Mcformer in enhancing speech recognition performance.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords