AMT-Net: Adversarial Motion Transfer Network With Disentangled Shape and Pose for Realistic Image Animation

Nega Asebe Teka; Kumie Gedamu Alemu; Maregu Assefa; Feidu Akmel; Zhenting Zhou; Weijie Wu; Jianwen Chen

doi:10.1109/access.2025.3571760

IEEE Access (Jan 2025)

AMT-Net: Adversarial Motion Transfer Network With Disentangled Shape and Pose for Realistic Image Animation

Nega Asebe Teka,
Kumie Gedamu Alemu,
Maregu Assefa,
Feidu Akmel,
Zhenting Zhou,
Weijie Wu,
Jianwen Chen

Affiliations

Nega Asebe Teka: ORCiD; School of Information and Communication Engineering, UESTC, Chengdu, China
Kumie Gedamu Alemu: ORCiD; Sichuan Artificial Intelligence Research Institute, UESTC, Yibin, China
Maregu Assefa: School of Computing and Mathematics, Khalifa University, Abu Dhabi, United Arab Emirates
Feidu Akmel: School of Information and Communication Engineering, UESTC, Chengdu, China
Zhenting Zhou: School of Information and Communication Engineering, UESTC, Chengdu, China
Weijie Wu: School of Information and Communication Engineering, UESTC, Chengdu, China
Jianwen Chen: School of Information and Communication Engineering, UESTC, Chengdu, China

DOI: https://doi.org/10.1109/access.2025.3571760
Journal volume & issue: Vol. 13
pp. 92712 – 92729

Abstract

Read online

Computer vision advancements allow motion transfer for animating static objects in images. However, current methods rely on manually collected motion labels and struggle with accurate shape and pose representation, particularly for human bodies, due to occlusions and background variations. Thus, we propose an Adversarial Motion Transfer Network with a disentangled Shape and Pose representation for realistic image Animation (AMT-Net), utilizing an encoder-decoder adversarial structure. Specifically, we design a pose and shape learning module that captures the independent shape and pose information by training a discriminator with adversarial loss techniques, enhancing the generation of coherent animated frames. Furthermore, a motion estimation module is introduced to generate masks for objects in consecutive frames and identify occluded parts by creating occlusion maps from these masks and dense motion vectors. To evaluate the effectiveness of our approach, we conducted extensive experiments using four publicly available datasets, including VoxCeleb, TaiChiHD, TED-Talks, and MGif. The results emphasize the importance of landmark detection for video annotation and smooth transitions, while the independent shape and pose module helps capture precise representations.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords