Training audio transformers for cover song identification

Te Zeng; Francis C. M. Lau

doi:10.1186/s13636-023-00297-4

EURASIP Journal on Audio, Speech, and Music Processing (Aug 2023)

Training audio transformers for cover song identification

Te Zeng,
Francis C. M. Lau

Affiliations

Te Zeng: Department of Computer Science, The University of Hong Kong
Francis C. M. Lau: Department of Computer Science, The University of Hong Kong

DOI: https://doi.org/10.1186/s13636-023-00297-4
Journal volume & issue: Vol. 2023, no. 1
pp. 1 – 11

Abstract

Read online

Abstract In the past decades, convolutional neural networks (CNNs) have been commonly adopted in audio perception tasks, which aim to learn latent representations. However, for audio analysis, CNNs may exhibit limitations in effectively modeling temporal contextual information. Analogous to the successes of transformer architecture used in the fields of computer vision and audio classification, to capture long-range global contexts better, we here extend this line of work and propose an Audio Similarity Transformer (ASimT), a convolution-free, purely transformer network-based architecture for learning effective representations of audio signals. Furthermore, we introduce a novel loss MAPLoss, used in tandem with classification loss, to directly enhance the mean average precision. In the experiments, ASimT demonstrates its state-of-the-art performance in cover song identification on public datasets.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords