IEEE Access (Jan 2023)
Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification
Abstract
Temporal dynamic convolution neural networks (TDY-CNNs) extract speaker embeddings considering the time-varying characteristics of speech and improve text-independent speaker verification performance. In this paper, we optimize TDY-CNNs based on the detailed analysis of the network architecture. The temporal dynamic convolution generates attention weight of basis kernels from features defined by concatenating average channel and frequency data, resulting in a reduction in network parameters by 26%. In addition, the temporal dynamic convolutions replace vanilla convolutions in earlier layers, while the optimized temporal dynamic convolutions of latter layers use a steady kernel regardless of time bin data. As a result, Opt-TDY-ResNet-34( $\times 0.50$ ) shows the best speaker verification performance with EER of 1.07% among speaker verification models without data augmentation including ResNet-based baseline networks and other state-of-the-art networks. Moreover, we validate that Opt-TDY-CNNs adapt to time-bin data through various methods. By comparing the inter and intra phoneme distance of attention weights, it was confirmed that the temporal dynamic convolution uses different kernels depending on the phoneme groups directly related to the time-bin data. In addition, by applying gradient-weighted class activation mapping (Grad-CAM) on speaker verification to obtain speaker activation map (SAM), we showed that temporal dynamic convolution extracts speaker information from frequency characteristics of time bins such as phonemes’ formant frequencies while vanilla convolution extracts vague outline of Mel-spectrogram.
Keywords