Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition

Wanlu Zheng; Wenming Zheng; Yuan Zong

Virtual Reality & Intelligent Hardware (Feb 2021)

Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition

Wanlu Zheng,
Wenming Zheng,
Yuan Zong

Affiliations

Wanlu Zheng: Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing 210096, China
Wenming Zheng: Corresponding author.; Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing 210096, China
Yuan Zong: Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing 210096, China

Journal volume & issue: Vol. 3, no. 1
pp. 65 – 75

Abstract

Read online

Background: One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech. In recent years, the challenging problem of cross-corpus speech emotion recognition (SER) has generated extensive research. Nevertheless, the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance. Methods: This paper introduces a novel multi-scale discrepancy adversarial (MSDA) network for conducting multiple timescales domain adaptation for cross-corpus SER, i.e.,integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains. Specifically, we extract two kinds of speech features, i.e., handcraft features and deep features, from three timescales of global, local, and hybrid levels. In each timescale, the domain discriminator and the emotion classifier compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator. Results: Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER. The MSDA is affected by the strong discriminate power provided by the adversarial process, where three discriminators are working in tandem with an emotion classifier. Accordingly, the MSDA achieves the best performance over all other baseline methods. Conclusions: The proposed architecture was tested on a combination of one Chinese and two English datasets. The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER.

Published in Virtual Reality & Intelligent Hardware

ISSN: 2096-5796 (Print); 2666-1209 (Online)
Publisher: KeAi Communications Co., Ltd.
Country of publisher: China
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware
Website: https://www.keaipublishing.com/en/journals/virtual-reality-and-intelligent-hardware/

About the journal

Abstract

Keywords