Jisuanji kexue yu tansuo (Dec 2024)
Speech Emotion Recognition Using Two-Stage Multiple Instance Learning Networks
Abstract
In the task of speech emotion recognition (SER), each utterance is usually divided into several equal-length segments when processing the speech signals with unequal lengths, and finally emotion classification is obtained based on the average of the prediction results of all divided segments. However, such processing methods require human emotional expression to be evenly distributed throughout the speech signals. This is not consistent with the actual situation. To address this issue, this paper proposes an SER method using two-stage multiple instance learning networks. In the first stage, each utterance is regarded as a “bag”, and is segmented with equal lengths. A variety of acoustic features are extracted from the segmented samples, which are taken as “instances”. Then, they are fed into the relevant local acoustic feature encoder to learn the corresponding deep feature representations. A consistency-attention mechanism is used to perform feature interaction and enhancement on these extracted different feature representations. In the second stage, a hybrid aggregator based on multi-instance learning is designed so that instance predictions and instance features are fused at the global scale to calculate “bag” level prediction scores. Firstly, an instance distillation module is proposed to filter redundant instances with weak emotional information. Then, the distillation results are combined into a pseudo bag. The pseudo bag features are merged through an adaptive feature aggregation scheme, and then the prediction results are obtained through a classifier. Finally, instance-level and bag-level prediction results are combined by using an adaptive decision aggregation scheme so as to obtain the final emotion results. The achieved recognition accuracy on the IEMOCAP and MELD public datasets are 73.02% and 44.92%, respectively. Experimental results demonstrate the effectiveness of the proposed method.
Keywords