Masked multi-center angular margin loss for language recognition

Minghang Ju; Yanyan Xu; Dengfeng Ke; Kaile Su

doi:10.1186/s13636-022-00249-4

EURASIP Journal on Audio, Speech, and Music Processing (Jul 2022)

Masked multi-center angular margin loss for language recognition

Minghang Ju,
Yanyan Xu,
Dengfeng Ke,
Kaile Su

Affiliations

Minghang Ju: School of Information Science and Technology, Beijing Forestry University
Yanyan Xu: School of Information Science and Technology, Beijing Forestry University
Dengfeng Ke: School of Information Science, Beijing Language and Culture University
Kaile Su: Institute for Integrated and Intelligent Systems, Griffith University

DOI: https://doi.org/10.1186/s13636-022-00249-4
Journal volume & issue: Vol. 2022, no. 1
pp. 1 – 22

Abstract

Read online

Abstract Language recognition based on embedding aims to maximize inter-class variance and minimize intra-class variance. Previous researches are limited to the training constraint of a single centroid, which cannot accurately describe the overall geometric characteristics of the embedding space. In this paper, we propose a novel masked multi-center angular margin (MMAM) loss method from the perspective of multiple centroids, resulting in a better overall performance. Specifically, numerous global centers are used to jointly approximate entities of each class. To capture the local neighbor relationship effectively, a small number of centers are adapted to construct the similarity relationship between these centers and each entity. Furthermore, we use a new reverse label propagation algorithm to adjust neighbor relations according to the ground truth labels to learn a discriminative metric space in the classification process. Finally, an additive angular margin is added, which understands more discriminative language embeddings by simultaneously enhancing intra-class compactness and inter-class discrepancy. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17-OLR) corpus. We compare the proposed MMAM method with seven state-of-the-art baselines and verify that our method has 26.2% and 31.3% relative improvements in the equal error rate (EER) and C avg respectively in the full-length test (“full-length” means the average duration of the utterances is longer than 5 s). Also, there are 31.2% and 29.3% relative improvements in the 3-s test and 14% and 14.8% relative improvements in the 1-s test.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords