Whispered Speech Detection Using Glottal Flow-Based Features

Khomdet Phapatanaburi; Wongsathon Pathonsuwan; Longbiao Wang; Patikorn Anchuen; Talit Jumphoo; Prawit Buayai; Monthippa Uthansakul; Peerapong Uthansakul

doi:10.3390/sym14040777

Symmetry (Apr 2022)

Whispered Speech Detection Using Glottal Flow-Based Features

Khomdet Phapatanaburi,
Wongsathon Pathonsuwan,
Longbiao Wang,
Patikorn Anchuen,
Talit Jumphoo,
Prawit Buayai,
Monthippa Uthansakul,
Peerapong Uthansakul

Affiliations

Khomdet Phapatanaburi: Department of Telecommunication Engineering, Faculty of Engineering and Technology, Rajamangala University of Technology Isan (RMUTI), Nakhon Ratchasima 30000, Thailand
Wongsathon Pathonsuwan: School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand
Longbiao Wang: Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
Patikorn Anchuen: Navaminda Kasatriyadhiraj Royal Air Force Academy, Bangkok 10220, Thailand
Talit Jumphoo: School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand
Prawit Buayai: Graduate Faculty of Interdisciplinary Research, University of Yamanashi, Kofu 400-8511, Japan
Monthippa Uthansakul: School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand
Peerapong Uthansakul: School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand

DOI: https://doi.org/10.3390/sym14040777
Journal volume & issue: Vol. 14, no. 4
p. 777

Abstract

Read online

Recent studies have reported that the performance of Automatic Speech Recognition (ASR) technologies designed for normal speech notably deteriorates when it is evaluated by whispered speech. Therefore, the detection of whispered speech is useful in order to attenuate the mismatch between training and testing situations. This paper proposes two new Glottal Flow (GF)-based features, namely, GF-based Mel-Frequency Cepstral Coefficient (GF-MFCC) as a magnitude-based feature and GF-based relative phase (GF-RP) as a phase-based feature for whispered speech detection. The main contribution of the proposed features is to extract magnitude and phase information obtained by the GF signal. In the GF-MFCC, Mel-frequency cepstral coefficient (MFCC) feature extraction is modified using the estimated GF signal derived from the iterative adaptive inverse filtering as the input to replace the raw speech signal. In a similar way, the GF-RP feature is the modification of the relative phase (RP) feature extraction by using the GF signal instead of the raw speech signal. The whispered speech production provides lower amplitude from the glottal source than normal speech production, thus, the whispered speech via Discrete Fourier Transformation (DFT) provides the lower magnitude and phase information, which make it different from a normal speech. Therefore, it is hypothesized that two types of our proposed features are useful for whispered speech detection. In addition, using the individual GF-MFCC/GF-RP feature, the feature-level and score-level combination are also proposed to further improve the detection performance. The performance of the proposed features and combinations in this study is investigated using the CHAIN corpus. The proposed GF-MFCC outperforms MFCC, while GF-RP has a higher performance than the RP. Further improved results are obtained via the feature-level combination of MFCC and GF-MFCC (MFCC&GF-MFCC)/RP and GF-RP(RP&GF-RP) compared with using either one alone. In addition, the combined score of MFCC&GF-MFCC and RP&GF-RP gives the best frame-level accuracy of 95.01% and the utterance-level accuracy of 100%.

Published in Symmetry

ISSN: 2073-8994 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/symmetry/

About the journal

Abstract

Keywords