Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Mar 2018)
AUDIO-VISUAL SPEECH PROCESSING AND ANALYSIS BASED ON SUBSPACE PROJECTIONS
Abstract
Subject of Research.The paper deals with the problems of the mutual reconstruction (transformation) of acoustic and visual components (modalities) of speech. Audio recording of voice represents the acoustic component whereas the parallel video recording of the speaker’s face comprises the visual component. Because of the different physical nature of these modalities, their mutual analysis is accompanied by numerous difficulties. Reconstruction methods can be used to overcome these difficulties. Method. The proposed approach is based on Principal Component Analysis (PCA), Multiple Linear Regression (MLR), Partial Least Squares regression (PLS regression) and K-means clustering algorithm. Moreover, attention is paid to data preprocessing. Mel-frequency cepstral coefficients (MFCCs) are used as acoustic features, and twenty key points, which represent the mouth contour, comprise visual features. Main Results. The experiments on the reconstruction of the mouth contour from the MFCCs are presented. The experiments were carried out on VidTIMIT dataset of audio-visual phrase recordings in English. Four variants of the proposed approach were tested and evaluated. They are based on PCA and PLS regression with clustering and without it. Quantitative (objective) and qualitative (subjective) assessment confirmed the efficiency of the proposed approach. The implementation based on PLS regression with preliminary clustering led to the best results. Practical Relevance. The proposed approach can be used to develop various bimodal biometric systems, voice-driven virtual “avatars”, mobile access control systems and other useful human-computer interaction solutions. Moreover, it is shown that, given the proper implementation, PCA and PLS reduce significantly the computational complexity of the reconstruction operation. In addition, the clustering step can be omitted to increase additionally the processing speed at the cost of slightly lower reconstruction quality.
Keywords