MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Kah Liang Ong; Chin Poo Lee; Heng Siong Lim; Kian Ming Lim; Ali Alqahtani

doi:10.1109/access.2024.3360483

IEEE Access (Jan 2024)

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Kah Liang Ong,
Chin Poo Lee,
Heng Siong Lim,
Kian Ming Lim,
Ali Alqahtani

Affiliations

Kah Liang Ong: ORCiD; Faculty of Information Science and Technology, Multimedia University, Melaka, Malaysia
Chin Poo Lee: ORCiD; Faculty of Information Science and Technology, Multimedia University, Melaka, Malaysia
Heng Siong Lim: ORCiD; Faculty of Engineering and Technology, Multimedia University, Melaka, Malaysia
Kian Ming Lim: ORCiD; Faculty of Information Science and Technology, Multimedia University, Melaka, Malaysia
Ali Alqahtani: ORCiD; Department of Computer Science, King Khalid University, Abha, Saudi Arabia

DOI: https://doi.org/10.1109/access.2024.3360483
Journal volume & issue: Vol. 12
pp. 18237 – 18250

Abstract

Read online

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the “MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP).” The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords