Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks

Hao Meng; Tianhao Yan; Hongwei Wei; Xun Ji

doi:10.24425/bpasts.2020.136300

Bulletin of the Polish Academy of Sciences: Technical Sciences (Feb 2021)

Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks

Hao Meng,
Tianhao Yan,
Hongwei Wei,
Xun Ji

Affiliations

Hao Meng: Key laboratory of Intelligent Technology and Application of Marine Equipment (Harbin Engineering University), Ministry of Education, Harbin, 150001, China
Tianhao Yan: Key laboratory of Intelligent Technology and Application of Marine Equipment (Harbin Engineering University), Ministry of Education, Harbin, 150001, China
Hongwei Wei: Key laboratory of Intelligent Technology and Application of Marine Equipment (Harbin Engineering University), Ministry of Education, Harbin, 150001, China
Xun Ji: College of Marine Electrical Engineering, Dalian Maritime University, Dalian, 116026, China

DOI: https://doi.org/10.24425/bpasts.2020.136300
Journal volume & issue: Vol. 69, no. No. 1

Abstract

Read online

Speech emotion recognition (SER) is a complicated and challenging task in the human-computer interaction because it is difficult to find the best feature set to discriminate the emotional state entirely. We always used the FFT to handle the raw signal in the process of extracting the low-level description features, such as short-time energy, fundamental frequency, formant, MFCC (mel frequency cepstral coefficient) and so on. However, these features are built on the domain of frequency and ignore the information from temporal domain. In this paper, we propose a novel framework that utilizes multi-layers wavelet sequence set from wavelet packet reconstruction (WPR) and conventional feature set to constitute mixed feature set for achieving the emotional recognition with recurrent neural networks (RNN) based on the attention mechanism. In addition, the silent frames have a disadvantageous effect on SER, so we adopt voice activity detection of autocorrelation function to eliminate the emotional irrelevant frames. We show that the application of proposed algorithm significantly outperforms traditional features set in the prediction of spontaneous emotional states on the IEMOCAP corpus and EMODB database respectively, and we achieve better classification for both speaker-independent and speaker-dependent experiment. It is noteworthy that we acquire 62.52% and 77.57% accuracy results with speaker-independent (SI) performance, 66.90% and 82.26% accuracy results with speaker-dependent (SD) experiment in final.

Published in Bulletin of the Polish Academy of Sciences: Technical Sciences

ISSN: 2300-1917 (Online)
Publisher: Polish Academy of Sciences
Country of publisher: Poland
LCC subjects: Technology: Technology (General)
Website: https://journals.pan.pl/bpasts

About the journal

Abstract

Keywords