A dataset for learning stylistic and cultural correlations between music and videos

Xinyi Chen; Hui Zhang; Songruoyao Wu; Jun Zheng; Lingyun Sun; Kejun Zhang

doi:10.1049/ccs2.12043

Cognitive Computation and Systems (Jun 2022)

A dataset for learning stylistic and cultural correlations between music and videos

Xinyi Chen,
Hui Zhang,
Songruoyao Wu,
Jun Zheng,
Lingyun Sun,
Kejun Zhang

Affiliations

Xinyi Chen: College of Computer Science and Technology Zhejiang University Hangzhou China
Hui Zhang: College of Computer Science and Technology Zhejiang University Hangzhou China
Songruoyao Wu: College of Computer Science and Technology Zhejiang University Hangzhou China
Jun Zheng: The 19th Asian Games Hangzhou 2022 Organising Committee Hangzhou China
Lingyun Sun: College of Computer Science and Technology Zhejiang University Hangzhou China
Kejun Zhang: College of Computer Science and Technology Zhejiang University Hangzhou China

DOI: https://doi.org/10.1049/ccs2.12043
Journal volume & issue: Vol. 4, no. 2
pp. 177 – 187

Abstract

Read online

Abstract Music–visual retrieval is of broad interest in the field of Music Information Retrieval (MIR). Most research relies on emotional tags or is based on content but does not consider stylistic and cultural differences between music and videos. As a result, only one‐sided dimensions are considered for automatic music retrieval for videos, while the stylistic correlation between audio‐visual is ignored. At the same time, the needs of different cultural regions cannot be well met. Therefore, the first labelled extensive Music Video (MV) dataset, Next‐MV, is constructed in this paper consisting of 6000 pieces of 30‐s MV fragments, including five music style labels and four cultural labels. The proposed Next‐Net framework is built to study the correlation between music style and visual style. The optimal audiovisual feature set and model structure are obtained in the experiments. The accuracy reached 71.1%, higher than the baseline model (66.9%). Furthermore, in the cross‐cultural experiment, it is found that the accuracy of the general fusion model (71.1%) is between the model trained by within‐dataset (76%) and the model trained by cross‐dataset (60%), indicating that culture has a significant influence on the correlation between music and visual. The experiments of pair classification on cultures are further carried out. It is found that Rock and Dance are more culturally influenced than R&B and Hip‐hop. Among all the cultures discussed, Chinese and Japanese music and videos show great differences among most of the styles, while Korean music videos styles are more similar to western styles than other eastern cultures.

Published in Cognitive Computation and Systems

ISSN: 2517-7567 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://ietresearch.onlinelibrary.wiley.com/journal/25177567

About the journal

Abstract

Keywords