Exploring the power of pure attention mechanisms in blind room parameter estimation

Chunxi Wang; Maoshen Jia; Meiran Li; Changchun Bao; Wenyu Jin

doi:10.1186/s13636-024-00344-8

EURASIP Journal on Audio, Speech, and Music Processing (Apr 2024)

Exploring the power of pure attention mechanisms in blind room parameter estimation

Chunxi Wang,
Maoshen Jia,
Meiran Li,
Changchun Bao,
Wenyu Jin

Affiliations

Chunxi Wang: Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information Technology
Maoshen Jia: Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information Technology
Meiran Li: Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information Technology
Changchun Bao: Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information Technology
Wenyu Jin: AcousticDSP Consulting LLC

DOI: https://doi.org/10.1186/s13636-024-00344-8
Journal volume & issue: Vol. 2024, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Dynamic parameterization of acoustic environments has drawn widespread attention in the field of audio processing. Precise representation of local room acoustic characteristics is crucial when designing audio filters for various audio rendering applications. Key parameters in this context include reverberation time (RT $$_{60}$$ 60 ) and geometric room volume. In recent years, neural networks have been extensively applied in the task of blind room parameter estimation. However, there remains a question of whether pure attention mechanisms can achieve superior performance in this task. To address this issue, this study employs blind room parameter estimation based on monaural noisy speech signals. Various model architectures are investigated, including a proposed attention-based model. This model is a convolution-free Audio Spectrogram Transformer, utilizing patch splitting, attention mechanisms, and cross-modality transfer learning from a pretrained Vision Transformer. Experimental results suggest that the proposed attention mechanism-based model, relying purely on attention mechanisms without using convolution, exhibits significantly improved performance across various room parameter estimation tasks, especially with the help of dedicated pretraining and data augmentation schemes. Additionally, the model demonstrates more advantageous adaptability and robustness when handling variable-length audio inputs compared to existing methods.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords