SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer

Kah Liang Ong; Chin Poo Lee; Heng Siong Lim; Kian Ming Lim; Takeki Mukaida

doi:10.1109/ACCESS.2023.3288526

IEEE Access (Jan 2023)

SCQT-MaxViT: Speech Emotion Recognition With Constant-Q Transform and Multi-Axis Vision Transformer

Kah Liang Ong,
Chin Poo Lee,
Heng Siong Lim,
Kian Ming Lim,
Takeki Mukaida

Affiliations

Kah Liang Ong: ORCiD; Faculty of Information Science and Technology, Multimedia University, Malacca, Malaysia
Chin Poo Lee: ORCiD; Faculty of Information Science and Technology, Multimedia University, Malacca, Malaysia
Heng Siong Lim: ORCiD; Faculty of Engineering and Technology, Multimedia University, Malacca, Malaysia
Kian Ming Lim: ORCiD; Faculty of Information Science and Technology, Multimedia University, Malacca, Malaysia
Takeki Mukaida: School of Informatics and Engineering, University of Electro-Communications, Tokyo, Chofu, Japan

DOI: https://doi.org/10.1109/ACCESS.2023.3288526
Journal volume & issue: Vol. 11
pp. 63081 – 63091

Abstract

Read online

Speech emotion recognition presents a significant challenge within the field of affective computing, requiring the analysis and detection of emotions conveyed through speech signals. However, existing approaches often rely on traditional signal processing techniques and handcrafted features, which may not effectively capture the nuanced aspects of emotional expression. In this paper, an approach named “SCQT-MaxViT” is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. The method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms, providing high-frequency resolution and enabling the model to capture intricate emotional details. Additionally, the Multi-axis Vision Transformer (MaxViT) is employed for further representation learning and classification of the CQT spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network and enhancing the ability of the model to learn meaningful features. Furthermore, the dataset is augmented using random time masking techniques to enhance the generalization capabilities. Achieving accuracies of 88.68% on the Emo-DB dataset, 77.54% on the RAVDESS dataset, and 62.49% on the IEMOCAP dataset, the proposed SCQT-MaxViT method exhibits promising performance in capturing and recognizing emotions in speech signals.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords