Proceedings on Engineering Sciences (Sep 2024)
A CONTACTLESS SPEAKER IDENTIFICATION APPROACH USING FEATURE-LEVEL FUSION OF SPEECH AND FACE CUES WITH DCNN
Abstract
This paper evaluates the effectiveness of feature-level fusion through the concatenation method, of two independent and emerging modalities, speech and face. The major benefit of face modality (physiological) is that the data acquisition does not require much user cooperation or awareness, as seen in airports or public places in mass. Speech (physiological and behavioural) based recognition, for disabled and illiterate people, is the most convenient and reliable user identification technique due to the ease with which a contactless speech-receiving device can be accessed. Furthermore, it should be noted that adverse conditions, such as low illumination for facial recognition and a noisy environment for speech recognition during data acquisition, are not interdependent and function autonomously. Consequently, the acoustic and distinctive facial features are the paramount (fused) features in achieving higher user identification accuracy. This paper aims to explore the state-of-the-art techniques for data fusion, dimensionality reduction, feature extraction (speech-face) and classifier. Based on the above findings, we have proposed an efficient feature level fusion of speech and face cues with the deep convolutional neural network as a classifier for the VidTIMIT database. We have tested the effectiveness of the proposed approach in terms of identification accuracy with different training sample sizes and numbers of users. The proposed user identification approach achieves an accuracy of 97.31%, an EER of 3.62% and outperforms the unimodal biometric system for speech and face by 3.83% and 1.59 % respectively. Additionally, the proposed approach outperformed a few existing methodologies. Thus, we can infer that even in the presence of adverse conditions, such an approach can ameliorate the user identification-based solution.
Keywords