VT2Music: A Multimodal Framework for Text-Visual Guided Music Generation and Comprehensive Performance Analysis

Jiaxiang Zheng; Moxi Cao; Chongbin Zhang

doi:10.1109/access.2025.3572954

IEEE Access (Jan 2025)

VT2Music: A Multimodal Framework for Text-Visual Guided Music Generation and Comprehensive Performance Analysis

Jiaxiang Zheng,
Moxi Cao,
Chongbin Zhang

Affiliations

Jiaxiang Zheng: ORCiD; Department of Global Cultural Convergence, Graduate School, Kangwon National University, Chuncheon, Gangwon-do, South Korea
Moxi Cao: ORCiD; Department of Global Cultural Convergence, Graduate School, Kangwon National University, Chuncheon, Gangwon-do, South Korea
Chongbin Zhang: ORCiD; School of Modern Music and Technology, Nanjing University of the Arts, Nanjing, Jiangsu, China

DOI: https://doi.org/10.1109/access.2025.3572954
Journal volume & issue: Vol. 13
pp. 92641 – 92662

Abstract

Read online

Recent years have witnessed significant advances in text-to-music generation technology through deep learning approaches, particularly using latent diffusion models (LDM), yet there remains a notable absence of artificial intelligence (AI) music composition systems capable of generating music from other modalities. Given the intricate relationship between visual perception and auditory experience in human cognition, exploring music generation from multimodal data holds considerable promise for creating more diverse and enriched musical experiences. To address this research gap, we propose VT2Music (various things to music), a multimodal music generation model based on diffusion transformers (DiT), capable of generating semantically aligned music from textual and visual modality data. Our framework not only supports music generation from single modalities (text, image, or video), but also accepts combined multimodal inputs (such as text+image or text+video). Through both objective and subjective evaluations, we demonstrate that VT2Music can generate music that reasonably aligns with the semantic and emotional content of the input, achieving performance comparable to current mainstream music generation models across multiple assessment metrics. This study represents an initial exploration into the possibilities of multimodal music generation, with future work aimed at enhancing the model’s visual feature comprehension and musical naturalness.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords