ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster

Huan Zhao; Lixuan Li; Xupeng Zha; Yujiang Wang; Zhaoxin Xie; Zixing Zhang

doi:10.3390/s23104777

Sensors (May 2023)

ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster

Huan Zhao,
Lixuan Li,
Xupeng Zha,
Yujiang Wang,
Zhaoxin Xie,
Zixing Zhang

Affiliations

Huan Zhao: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Lixuan Li: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Xupeng Zha: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Yujiang Wang: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Zhaoxin Xie: MicroStrategy, Hangzhou 310000, China
Zixing Zhang: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

DOI: https://doi.org/10.3390/s23104777
Journal volume & issue: Vol. 23, no. 10
p. 4777

Abstract

Read online

Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords