Application of Split Residual Multilevel Attention Network in Speaker Recognition

Jiji Wang; Fei Deng; Lihong Deng; Ping Gao; Yuanxiang Huang

doi:10.1109/ACCESS.2023.3306026

IEEE Access (Jan 2023)

Application of Split Residual Multilevel Attention Network in Speaker Recognition

Jiji Wang,
Fei Deng,
Lihong Deng,
Ping Gao,
Yuanxiang Huang

Affiliations

Jiji Wang: ORCiD; College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
Fei Deng: College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
Lihong Deng: ORCiD; College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
Ping Gao: Sichuan Tianyi Ecological Garden Group Company Ltd., Chengdu, China
Yuanxiang Huang: ORCiD; Sichuan Tianyi Ecological Garden Group Company Ltd., Chengdu, China

DOI: https://doi.org/10.1109/ACCESS.2023.3306026
Journal volume & issue: Vol. 11
pp. 89359 – 89368

Abstract

Read online

Current speaker recognition systems are mainly for the combined application of network architectures and attention mechanisms, however, lightweight networks are not able to extract frame-level features of speaker speech well, and deeper and wider networks also face the problems of slower inference and excessive number of parameters. To this end, we proposes Split-ResNet, a network structure for split residuals, which can obtain a combination of multiple receptive field at a finer-grained level, thus obtaining a variety of feature representations with different scale combinations and producing more informative and comprehensive multi-scale features. In addition we propose a dual time-frequency attention (DTFA) that enhances key features and suppresses unimportant features by focusing on features in the time and frequency domains and learning weights from the time and frequency channels, respectively. We finally tested the speaker recognition system using a combination of Split-ResNet and DTFA against other speaker recognition systems on the Voxceleb1-O test set. The test results show that the speaker recognition system proposed in this paper is 0.98%, 0.39%, 0.69% and 0.47% lower in EER compared with SpeechNAS, RawNet2, Y-vector and CNN+Transformer, respectively, proving that DTFA+Split-ResNet is a speaker recognition system with good speaker audio feature extraction capability and discriminative capability.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords