Applied Sciences (Aug 2024)
FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification
Abstract
Music emotion recognition is becoming an important research direction due to its great significance for music information retrieval, music recommendation, and so on. In the task of music emotion recognition, the key to achieving accurate emotion recognition lies in how to extract the affect-salient features fully. In this paper, we propose an end-to-end spatial-temporal feature extraction method named FFA-BiGRU for music emotion classification. Taking the log Mel-spectrogram of music audio as the input, this method employs an attention-based convolutional residual module named FFA, which serves as a spatial feature learning module to obtain multi-scale spatial features. In the FFA module, three group architecture blocks extract multi-level spatial features, each of which consists of a stack of multiple channel-spatial attention-based residual blocks. Then, the output features from FFA are fed into the bidirectional gated recurrent units (BiGRU) module to capture the temporal features of music further. In order to make full use of the extracted spatial and temporal features, the output feature maps of FFA and those of the BiGRU are concatenated in the channel dimension. Finally, the concatenated features are passed through fully connected layers to predict the emotion classification results. The experimental results of the EMOPIA dataset show that the proposed model achieves better classification accuracy than the existing baselines. Meanwhile, the ablation experiments also demonstrate the effectiveness of each part of the proposed method.
Keywords