IEEE Access (Jan 2023)

Time Delay Estimation for Sound Source Localization Using CNN-Based Multi-GCC Feature Fusion

  • Haitao Liu,
  • Xiuliang Zhang,
  • Penggao Li,
  • Yu Yao,
  • Sheng Zhang,
  • Qian Xiao

DOI
https://doi.org/10.1109/ACCESS.2023.3340108
Journal volume & issue
Vol. 11
pp. 140789 – 140800

Abstract

Read online

Accurate time delay estimation is critical in sound source localization methods that rely on time difference of arrival. Background noise and reverberation often introduce errors in time delay estimation. Generalized cross-correlation (GCC) functions, paired with different weighting functions, can adapt to various sound field environments for time delay estimation. To create a highly accurate time delay estimation method suitable for universal sound field conditions, this paper proposes a novel approach, which involves training multi-class weighted generalized cross-correlation features using a convolutional neural network. Various weighted GCC functions are employed to extract time delay features for the same microphone pairs. These time delay features from multi-class weighted GCC are fused to create a feature matrix. The feature matrix is then input into a convolutional neural network composed of convolutional layers and fully connected layers for training and prediction. In the network, time delay estimation is achieved using two different methods: regression and classification, with mean squared error and cross-entropy serving as loss functions, respectively. The proposed method is tested and validated through simulation scenarios featuring various signal-to-noise ratios and reverberation conditions. Time delay estimation results are compared with recent state-of-the-art (SOTA) methods, assessing accuracy, root mean square error, and mean absolute error. The results demonstrate that the proposed method achieves an impressive 3.36% enhancement in overall delay estimation accuracy (within 10cm), reduces the absolute error by 11.53%, and significantly decreases the estimated root mean square error by 16.07% compared to existing SOTA methods. Furthermore, the proposed model offers the advantages of compact size and efficient computational performance when compared to existing methods. These findings underscore the exceptional comprehensive performance of the proposed model in sound source localization applications.

Keywords