Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

Yassin Terraf; Youssef Iraqi

doi:10.1109/ACCESS.2024.3356730

IEEE Access (Jan 2024)

Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

Yassin Terraf,
Youssef Iraqi

Affiliations

Yassin Terraf: ORCiD; College of Computing, University Mohammed VI Polytechnic, Ben Guerir, Morocco
Youssef Iraqi: ORCiD; College of Computing, University Mohammed VI Polytechnic, Ben Guerir, Morocco

DOI: https://doi.org/10.1109/ACCESS.2024.3356730
Journal volume & issue: Vol. 12
pp. 14094 – 14115

Abstract

Read online

Speaker identification in challenging acoustic environments, influenced by noise, reverberation, and emotional fluctuations, requires improved feature extraction techniques. Although existing methods effectively extract distinct acoustic features, they show limitations in these adverse settings. To overcome these limitations, we propose the Temporal Context-Enhanced Features (TCEF) approach, which provides a consistent audio representation for better performance under various acoustic conditions. TCEF leverages a context window to average features in adjacent frames, effectively reducing short-term variations caused by noise, reverberation, fluctuations in emotional speech, and those in neutral recordings. This approach improves the distinctive features of a speaker voice, improving speaker identification in challenging and neutral acoustic environments. To evaluate the performance of TCEF against conventional features, One-Dimensional Convolutional Neural Network (1D-CNN) was used for a detailed frame-level analysis and Long Short-Term Memory (LSTM) for a comprehensive sequence-level analysis.We used four datasets to assess the effectiveness of the TCEF approach. The GRID and RAVDESS datasets represent neutral and emotional speech, respectively. To test the robustness of our system under adverse acoustic conditions, we created two additional datasets: GRID-NR and RAVDESS-NR. These are modified versions of the original GRID and RAVDESS, incorporating added noise and reverberation. Performance evaluation results showed that TCEF significantly outperformed existing feature extraction methods in identifying speakers in diverse acoustic environments.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords