IEEE Access (Jan 2024)
UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
Abstract
Diffusion models have demonstrated substantial success in controllable generation for continuous modalities, positioning them as highly suitable for tasks such as human motion generation. However, existing approaches are typically limited to single-task applications, such as text-to-motion generation, and often lack versatility and editing capabilities. To overcome these limitations, we propose UniMotion-DM, a unified framework for both text-motion generation and editing based on diffusion models. UniMotion-DM integrates three core components: 1) a Contrastive Text-Motion Variational Autoencoder (CTMV), which aligns text and motion in a shared latent space using contrastive learning; 2) a controllable diffusion model tailored to the CTMV representation for generating and editing multimodal content; and 3) a Multimodal Conditional Representation and Editing (MCRE) module that leverages CLIP embeddings to enable precise and flexible control across various tasks. The ability of UniMotion-DM to seamlessly handle text-to-motion generation, motion captioning, motion completion, and multimodal editing results in significant improvements in both quantitative and qualitative evaluations. Beyond conventional domains such as gaming and virtual reality, we emphasize UniMotion-DM’s potential in underexplored fields such as healthcare and creative industries. For example, UniMotion-DM could be used to generate personalized physical therapy routines or assist designers in rapidly prototyping motion-based narratives. By addressing these emerging applications, UniMotion-DM paves the way for utilizing multimodal generative models in interdisciplinary and socially impactful areas.
Keywords