工程科学与技术 (Mar 2025)

Weak Speech Signal Detection in High Interference Environment Based on Distributed Acoustic Sensing

  • Chensi ZHANG,
  • Maoning WANG,
  • Yuzhong ZHONG,
  • Jianwei ZHANG,
  • Yancai LIU,
  • Haiwei YAN,
  • Wei WANG,
  • Shiwei YAN

Journal volume & issue
Vol. 57
pp. 29 – 39

Abstract

Read online

Objective The distributed acoustic sensing (DAS) system can be applied to personnel search and rescue and voice signal localization in the event of tunnel collapse accidents. However, as the front end of the speech signal processing system, existing voice activity detection (VAD) algorithms do not yield satisfactory results in detecting human voices from DAS speech data. Conducting experiments in a real tunnel environment presents several challenges: (1) the inability to manually annotate extensive DAS speech data makes it difficult to obtain labeled data for supervised training, and (2) due to the noisy on-site environment and limited signal acquisition methods, DAS-collected speech signals are accompanied by substantial and complex high-energy noise, causing some VAD algorithms to lose robustness. Therefore, this study proposes a robust VAD algorithm (ST–ACF) based on short-term autocorrelation features.Methods The algorithm investigates the acoustic characteristics of DAS speech by combining pitch information and the autocorrelation function to detect relevant harmonic features of speech frames. This enables the VAD algorithm to extract all actual human voices, even in DAS system environments with an extremely low signal-to-noise ratio (SNR), less than −10 dB. Due to the significant interference caused by strong noise in DAS speech, the ST–ACF algorithm consists of denoising and speech detection channels. DAS noise primarily consists of continuous high-frequency noise and sudden high-energy noise. In the denoising channel, based on the study of the periodicity of pitch information in speech, a dual-channel time window is designed to denoise these two typical types of noise. Feature analysis of DAS speech data revealed that pitch periodicity and non-stationarity in speech and these two types of noise exhibit distinct patterns. Continuous high-frequency noise lacks harmonic properties, presenting stable non-pitched characteristics. Sudden high-energy noise has periodicity, and there is a traceable changing trend in pitch at the moment of eruption. Speech shares similarities with the latter, but due to the continuity of human speech, pitch period changes in speech are shorter. Therefore, ST–ACF uses spectral flatness (SFT) as an indicator to determine the presence of pitch in speech frames. A dual-channel time window is designed to capture short-term pitch changes. In a continuous time window, the SFT value change curve of DAS speech is fitted into a cosine function, revealing multiple “valleys” in speech segments, indicating the presence of multiple pitch frames, which is not a characteristic of noise. The incorporation of a time window in ST–ACF enables more accurate detection of different noise types, eliminating strong noise interference in VAD. ST–ACF improves the spectral local harmonicity feature (SLH) in the speech detection channel. Despite SLH’s stability in low SNR conditions, its overall performance is suboptimal. Given that SLH feature values in speech exhibit larger absolute values and variability than noise, the ST–ACF algorithm optimizes SLH by considering two dimensions of frame variation, SLH feature values and variability, by multiplying them to maximize the distinction between speech and noise. Considering the significant differences in variability scales between different sound types, normalization of all variability values is required before computation. The improved ST–ACF considers the magnitude of its values and accounts for their changing trends, enhancing the algorithm’s capability to process critical data and improving its accuracy in distinguishing noise from speech onset.Results and Discussions The performance of VAD was evaluated using the frame error rate (FER). The dataset employed for testing comprises two parts: (1) authentic speech signals collected by the DAS system in the Ya’an Shuikou Tunnel, with an average SNR of −10.3 dB, and (2) simulated data generated by combining noise from the NOISEX–92 dataset with human speech from the TIMIT database. The results indicate that ST–ACF exhibits minimal susceptibility to high-energy noise environments, demonstrating robustness even at –10 dB, with a FER of only 19.74%. Compared to the –5 dB environment, the FER fluctuates by approximately 2%. Following optimization, ST–ACF achieves a 5.91% performance improvement compared to SLH. This significant enhancement is also evident in the DAS dataset, where ST–ACF attains its best performance, showing a remarkable 21.11% improvement. ST–ACF maintains robust performance across different noise sets, proving its capability to handle complex environments. The comparison of ST–ACF performance under different noise sets demonstrates that the time window strategy, based on assessing the invariance of audio features over a period, effectively eliminates stationary noise. LTSV, which follows a similar concept, performs well in high-frequency stationary noise. Due to the uniqueness of the proposed time window, ST–ACF demonstrates the ability to handle non-stationary noise. Even when subjected to the most challenging gunshot noise, ST–ACF maintains an FER of less than 25%. The introduction of the time window is crucial to improving ST–ACF’s performance, contributing to a 5.07% enhancement in the DAS dataset, whereas optimizing the autocorrelation function results in a modest 1.89% improvement. This is primarily because the time window removes a significant portion of noise, preventing interference from complex noise in VAD, while optimizing the autocorrelation function preserves the integrity of speech extraction.Conclusions The maximum correlation between DAS noise and speech can be identified across multiple dimensions, facilitating targeted detection for each by integrating speech algorithm principles and analyzing DAS speech data. This research approach addresses certain shortcomings in VAD algorithms. ST–ACF successfully achieves its intended objectives, fully extracting effective speech from DAS data while preserving the integrity of the speech signal. ST–ACF exhibits remarkable performance in low SNR environments, highlighting its potential for application in diverse and complex scenarios. This fulfills its intended function and paves the way for future research in speech signal processing based on DAS.

Keywords