Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification

Seong-Hu Kim; Hyeonuk Nam; Yong-Hwa Park

doi:10.1109/ACCESS.2023.3286034

IEEE Access (Jan 2023)

Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification

Seong-Hu Kim,
Hyeonuk Nam,
Yong-Hwa Park

Affiliations

Seong-Hu Kim: ORCiD; Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Hyeonuk Nam: ORCiD; Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Yong-Hwa Park: ORCiD; Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3286034
Journal volume & issue: Vol. 11
pp. 60646 – 60659

Abstract

Read online

Temporal dynamic convolution neural networks (TDY-CNNs) extract speaker embeddings considering the time-varying characteristics of speech and improve text-independent speaker verification performance. In this paper, we optimize TDY-CNNs based on the detailed analysis of the network architecture. The temporal dynamic convolution generates attention weight of basis kernels from features defined by concatenating average channel and frequency data, resulting in a reduction in network parameters by 26%. In addition, the temporal dynamic convolutions replace vanilla convolutions in earlier layers, while the optimized temporal dynamic convolutions of latter layers use a steady kernel regardless of time bin data. As a result, Opt-TDY-ResNet-34( $\times 0.50$ ) shows the best speaker verification performance with EER of 1.07% among speaker verification models without data augmentation including ResNet-based baseline networks and other state-of-the-art networks. Moreover, we validate that Opt-TDY-CNNs adapt to time-bin data through various methods. By comparing the inter and intra phoneme distance of attention weights, it was confirmed that the temporal dynamic convolution uses different kernels depending on the phoneme groups directly related to the time-bin data. In addition, by applying gradient-weighted class activation mapping (Grad-CAM) on speaker verification to obtain speaker activation map (SAM), we showed that temporal dynamic convolution extracts speaker information from frequency characteristics of time bins such as phonemes’ formant frequencies while vanilla convolution extracts vague outline of Mel-spectrogram.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords