IEEE Access (Jan 2023)
Audio-to-Visual Cross-Modal Generation of Birds
Abstract
Audio and visual modal data are essential elements of precise investigation in many fields. Sometimes it is difficult to obtain visual data while auditory data is easily available. In this case, generating visual data using audio data will be very helpful. This paper proposes a novel audio-to-visual cross-modal generation approach. The proposed sound encoder extracts the features of the auditory data and a generative model generates images using those audio features. This model is expected to learn (i) valid feature representation and (ii) associations between generated images and audio inputs to generate realistic and well-classified images. A new dataset is collected for this research called the Audio-Visual Corresponding Bird (AVC-B) dataset which contains the sounds and corresponding images of 10 different bird species. The experimental results show that the proposed method can generate class-appropriate images and achieve better classification results than the state-of-the-art methods.
Keywords