Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

Takayuki Nakatsuka; Kento Watanabe; Yuki Koyama; Masahiro Hamasaki; Masataka Goto; Shigeo Morishima

doi:10.1109/ACCESS.2021.3096819

IEEE Access (Jan 2021)

Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

Takayuki Nakatsuka,
Kento Watanabe,
Yuki Koyama,
Masahiro Hamasaki,
Masataka Goto,
Shigeo Morishima

Affiliations

Takayuki Nakatsuka: ORCiD; National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, Japan
Kento Watanabe: National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, Japan
Yuki Koyama: ORCiD; National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, Japan
Masahiro Hamasaki: ORCiD; National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, Japan
Masataka Goto: National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, Japan
Shigeo Morishima: ORCiD; Waseda University, Shinjuku, Tokyo, Japan

DOI: https://doi.org/10.1109/ACCESS.2021.3096819
Journal volume & issue: Vol. 9
pp. 101994 – 102003

Abstract

Read online

We propose a learning-based method of estimating the compatibility between vocal and accompaniment audio tracks, i.e., how well they go with each other when played simultaneously. This task is challenging because it is difficult to formulate hand-crafted rules or construct a large labeled dataset to perform supervised learning. Our method uses self-supervised and joint-embedding techniques for estimating vocal-accompaniment compatibility. We train vocal and accompaniment encoders to learn a joint-embedding space of vocal and accompaniment tracks, where the embedded feature vectors of a compatible pair of vocal and accompaniment tracks lie close to each other and those of an incompatible pair lie far from each other. To address the lack of large labeled datasets consisting of compatible and incompatible pairs of vocal and accompaniment tracks, we propose generating such a dataset from songs using singing voice separation techniques, with which songs are separated into pairs of vocal and accompaniment tracks, and then original pairs are assumed to be compatible, and other random pairs are not. We achieved this training by constructing a large dataset containing 910,803 songs and evaluated the effectiveness of our method using ranking-based evaluation methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords