Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features

Jennifer Santoso; Takeshi Yamada; Kenkichi Ishizuka; Taiichi Hashimoto; Shoji Makino

doi:10.1109/ACCESS.2022.3219094

IEEE Access (Jan 2022)

Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features

Jennifer Santoso,
Takeshi Yamada,
Kenkichi Ishizuka,
Taiichi Hashimoto,
Shoji Makino

Affiliations

Jennifer Santoso: ORCiD; Degree Programs in Systems and Information Engineering, University of Tsukuba, Ibaraki, Japan
Takeshi Yamada: ORCiD; RevComm, Inc, Tokyo, Japan
Kenkichi Ishizuka: RevComm, Inc, Tokyo, Japan
Taiichi Hashimoto: ORCiD; RevComm, Inc, Tokyo, Japan
Shoji Makino: Degree Programs in Systems and Information Engineering, University of Tsukuba, Ibaraki, Japan

DOI: https://doi.org/10.1109/ACCESS.2022.3219094
Journal volume & issue: Vol. 10
pp. 115732 – 115743

Abstract

Read online

Speech emotion recognition (SER) is essential for understanding a speaker’s intention. Recently, some groups have attempted to improve SER performance using a bidirectional long short-term memory (BLSTM) to extract features from speech sequences and a self-attention mechanism to focus on the important parts of the speech sequences. SER also benefits from combining the information in speech with text, which can be accomplished automatically using an automatic speech recognizer (ASR), further improving its performance. However, ASR performance deteriorates in the presence of emotion in speech. Although there is a method to improve ASR performance in the presence of emotional speech, it requires the fine-tuning of ASR, which has a high computational cost and leads to the loss of cues important for determining the presence of emotion in speech segments, which can be helpful in SER. To solve these problems, we propose a BLSTM-and-self-attention-based SER method using self-attention weight correction (SAWC) with confidence measures. This method is applied to acoustic and text feature extractors in SER to adjust the importance weights of speech segments and words with a high possibility of ASR error. Our proposed SAWC reduces the importance of words with speech recognition error in the text feature while emphasizing the importance of speech segments containing these words in acoustic features. Our experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset reveal that our proposed method achieves a weighted average accuracy of 76.6%, outperforming other state-of-the-art methods. Furthermore, we investigated the behavior of our proposed SAWC in each of the feature extractors.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords