A CONTACTLESS SPEAKER IDENTIFICATION APPROACH USING FEATURE-LEVEL FUSION OF SPEECH AND FACE CUES WITH DCNN

Khushboo Jha; Aruna Jain; Sumit Srivastava

doi:10.24874/PES06.03.018

Proceedings on Engineering Sciences (Sep 2024)

A CONTACTLESS SPEAKER IDENTIFICATION APPROACH USING FEATURE-LEVEL FUSION OF SPEECH AND FACE CUES WITH DCNN

Khushboo Jha ,
Aruna Jain ,
Sumit Srivastava

Affiliations

Khushboo Jha: ORCiD; Birla Institute of Technology Mesra, Ranchi-835215, India
Aruna Jain: Birla Institute of Technology Mesra, Ranchi-835215, India
Sumit Srivastava: ORCiD; Birla Institute of Technology Mesra, Ranchi-835215, India

DOI: https://doi.org/10.24874/PES06.03.018
Journal volume & issue: Vol. 6
pp. 1047 – 1056

Abstract

Read online

This paper evaluates the effectiveness of feature-level fusion through the concatenation method, of two independent and emerging modalities, speech and face. The major benefit of face modality (physiological) is that the data acquisition does not require much user cooperation or awareness, as seen in airports or public places in mass. Speech (physiological and behavioural) based recognition, for disabled and illiterate people, is the most convenient and reliable user identification technique due to the ease with which a contactless speech-receiving device can be accessed. Furthermore, it should be noted that adverse conditions, such as low illumination for facial recognition and a noisy environment for speech recognition during data acquisition, are not interdependent and function autonomously. Consequently, the acoustic and distinctive facial features are the paramount (fused) features in achieving higher user identification accuracy. This paper aims to explore the state-of-the-art techniques for data fusion, dimensionality reduction, feature extraction (speech-face) and classifier. Based on the above findings, we have proposed an efficient feature level fusion of speech and face cues with the deep convolutional neural network as a classifier for the VidTIMIT database. We have tested the effectiveness of the proposed approach in terms of identification accuracy with different training sample sizes and numbers of users. The proposed user identification approach achieves an accuracy of 97.31%, an EER of 3.62% and outperforms the unimodal biometric system for speech and face by 3.83% and 1.59 % respectively. Additionally, the proposed approach outperformed a few existing methodologies. Thus, we can infer that even in the presence of adverse conditions, such an approach can ameliorate the user identification-based solution.

Published in Proceedings on Engineering Sciences

ISSN: 2620-2832 (Print); 2683-4111 (Online)
Publisher: University of Kragujevac
Country of publisher: Serbia
LCC subjects: Technology: Engineering (General). Civil engineering (General)
Website: http://pesjournal.net

About the journal

Abstract

Keywords