Intelligent Computing (Jan 2024)

Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition

  • Siyuan Shen,
  • Feng Liu,
  • Hanyang Wang,
  • Yunlong Wang,
  • Aimin Zhou

DOI
https://doi.org/10.34133/icomputing.0073
Journal volume & issue
Vol. 3

Abstract

Read online

Recent advances in self-supervised models have led to effective pretrained speech representations in downstream speech emotion recognition tasks. However, previous research has primarily focused on exploiting pretrained representations by simply adding a linear head on top of the pretrained model, while overlooking the design of the downstream network. In this paper, we propose a temporal shift module with pretrained representations to integrate channel-wise information without introducing additional parameters or floating-point operations per second. By incorporating the temporal shift module, we developed corresponding shift variants for 3 baseline building blocks: ShiftCNN, ShiftLSTM, and Shiftformer. Furthermore, we propose 2 technical strategies, placement and proportion of shift, to balance the trade-off between mingling and misalignment. Our family of temporal shift models outperforms state-of-the-art methods on the benchmark Interactive Emotional Dyadic Motion Capture dataset in fine-tuning and feature-extraction scenarios. In addition, through comprehensive experiments using wav2vec 2.0 and Hidden-Unit Bidirectional Encoder Representations from Transformers representations, we identified the behavior of the temporal shift module in downstream models, which may serve as an empirical guideline for future exploration of channel-wise shift and downstream network design.