IEEE Access (Jan 2025)
VT2Music: A Multimodal Framework for Text-Visual Guided Music Generation and Comprehensive Performance Analysis
Abstract
Recent years have witnessed significant advances in text-to-music generation technology through deep learning approaches, particularly using latent diffusion models (LDM), yet there remains a notable absence of artificial intelligence (AI) music composition systems capable of generating music from other modalities. Given the intricate relationship between visual perception and auditory experience in human cognition, exploring music generation from multimodal data holds considerable promise for creating more diverse and enriched musical experiences. To address this research gap, we propose VT2Music (various things to music), a multimodal music generation model based on diffusion transformers (DiT), capable of generating semantically aligned music from textual and visual modality data. Our framework not only supports music generation from single modalities (text, image, or video), but also accepts combined multimodal inputs (such as text+image or text+video). Through both objective and subjective evaluations, we demonstrate that VT2Music can generate music that reasonably aligns with the semantic and emotional content of the input, achieving performance comparable to current mainstream music generation models across multiple assessment metrics. This study represents an initial exploration into the possibilities of multimodal music generation, with future work aimed at enhancing the model’s visual feature comprehension and musical naturalness.
Keywords