Speech Emotion Recognition Based on Multi-task Deep Feature Extraction and MKPCA Feature Fusion

Baoyun LI; Xueying ZHANG; Juan LI; Lixia HUANG; Guijun CHEN; Ying SUN

doi:10.16355/j.tyut.1007-9432.2023.05.004

Taiyuan Ligong Daxue xuebao (Sep 2023)

Speech Emotion Recognition Based on Multi-task Deep Feature Extraction and MKPCA Feature Fusion

Baoyun LI,
Xueying ZHANG,
Juan LI,
Lixia HUANG,
Guijun CHEN,
Ying SUN

Affiliations

Baoyun LI: College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
Xueying ZHANG: College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
Juan LI: College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
Lixia HUANG: College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
Guijun CHEN: College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
Ying SUN: College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China

DOI: https://doi.org/10.16355/j.tyut.1007-9432.2023.05.004
Journal volume & issue: Vol. 54, no. 5
pp. 782 – 788

Abstract

Read online

Purposes Speech emotion recognition allows computers to understand the emotional information contained in human speech, and is an important part of intelligent human-computer interaction. Feature extraction and fusion are key parts in speech emotion recognition systems, and have an important impact on recognition results. Aiming at the problem of insufficient emotional information contained in traditional acoustic features, a deep feature extraction method based on multi-task learning for optimization of acoustic features is proposed in this paper. Methods The proposed acoustic depth feature can better characterize itself and has more emotional information. Then, on the basis of the complementarity between acoustic features and spectrogram features, spectrogram features through convolutional neural network are extracted. Then, the multi-kernel principal component analysis method is used to perform feature fusion and dimension reduction on these two features, and the obtained fusion features can effectively improve the system recognition performance. Findings Experiments are carried out on the EMODB and the CASIA speech databases. When the DNN classifier is used, the multi-kernel fusion feature of the acoustic depth feature and the spectrogram feature achieve the highest recognition rates of 92.71% and 88.25%, respectively. Compared with direct feature splicing, this method increased the recognition rate by 2.43% and 2.83%, respectively.

Published in Taiyuan Ligong Daxue xuebao

ISSN: 1007-9432 (Print)
Publisher: Editorial Office of Journal of Taiyuan University of Technology
Country of publisher: China
LCC subjects: Technology: Chemical technology: Chemical engineering; Technology: Electrical engineering. Electronics. Nuclear engineering: Materials of engineering and construction. Mechanics of materials
Website: https://tyutjournal.tyut.edu.cn/english.html

About the journal

Abstract

Keywords