Cognitive Computation and Systems (Jun 2022)

A dataset for learning stylistic and cultural correlations between music and videos

  • Xinyi Chen,
  • Hui Zhang,
  • Songruoyao Wu,
  • Jun Zheng,
  • Lingyun Sun,
  • Kejun Zhang

DOI
https://doi.org/10.1049/ccs2.12043
Journal volume & issue
Vol. 4, no. 2
pp. 177 – 187

Abstract

Read online

Abstract Music–visual retrieval is of broad interest in the field of Music Information Retrieval (MIR). Most research relies on emotional tags or is based on content but does not consider stylistic and cultural differences between music and videos. As a result, only one‐sided dimensions are considered for automatic music retrieval for videos, while the stylistic correlation between audio‐visual is ignored. At the same time, the needs of different cultural regions cannot be well met. Therefore, the first labelled extensive Music Video (MV) dataset, Next‐MV, is constructed in this paper consisting of 6000 pieces of 30‐s MV fragments, including five music style labels and four cultural labels. The proposed Next‐Net framework is built to study the correlation between music style and visual style. The optimal audiovisual feature set and model structure are obtained in the experiments. The accuracy reached 71.1%, higher than the baseline model (66.9%). Furthermore, in the cross‐cultural experiment, it is found that the accuracy of the general fusion model (71.1%) is between the model trained by within‐dataset (76%) and the model trained by cross‐dataset (60%), indicating that culture has a significant influence on the correlation between music and visual. The experiments of pair classification on cultures are further carried out. It is found that Rock and Dance are more culturally influenced than R&B and Hip‐hop. Among all the cultures discussed, Chinese and Japanese music and videos show great differences among most of the styles, while Korean music videos styles are more similar to western styles than other eastern cultures.

Keywords