EURASIP Journal on Audio, Speech, and Music Processing (Oct 2023)
YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation
Abstract
Abstract Appropriate background music in e-commerce advertisements can help stimulate consumption and build product image. However, many factors like emotion and product category should be taken into account, which makes manually selecting music time-consuming and require professional knowledge and it becomes crucial to automatically recommend music for video. For there is no e-commerce advertisements dataset, we first establish a large-scale e-commerce advertisements dataset Commercial-98K, which covers major e-commerce categories. Then, we proposed a video-music retrieval model YuYin to learn the correlation between video and music. We introduce a weighted fusion module (WFM) to fuse emotion features and audio features from music to get a more fine-grained music representation. Considering the similarity of music in the same product category, YuYin is trained by multi-task learning to explore the correlation between video and music by cross-matching video, music, and tag as well as a category prediction task. We conduct extensive experiments to prove YuYin achieves a remarkable improvement in video-music retrieval on Commercial-98K.
Keywords