Diff-KT: Text-Driven Image Editing by Knowledge Enhancement and Mask Transformer

Hong Zhao; Wengai Li; Zhaobin Chang; Ce Yang

doi:10.1109/ACCESS.2024.3442296

IEEE Access (Jan 2024)

Diff-KT: Text-Driven Image Editing by Knowledge Enhancement and Mask Transformer

Hong Zhao,
Wengai Li,
Zhaobin Chang,
Ce Yang

Affiliations

Hong Zhao: Department of Computing and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China
Wengai Li: ORCiD; Department of Computing and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China
Zhaobin Chang: Department of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China
Ce Yang: Department of Computing and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China

DOI: https://doi.org/10.1109/ACCESS.2024.3442296
Journal volume & issue: Vol. 12
pp. 112948 – 112965

Abstract

Read online

Recent advancements in text-to-image generation have demonstrated significant progress, especially with diffusion-based models conditioned on textual prompts, which excel in image quality and diversity. However, these methods often encounter a semantic gap between image and text modalities and suffer from imprecise localization during text-based image editing. To address these challenges, we propose the Diffusion-based Knowledge-enhanced Mask Transformer (Diff-KT) text-to-image model. Diff-KT leverages knowledge enhancement strategies to incorporate fine-grained textual and visual knowledge of key scene elements, thereby improving the fidelity and textual consistency of generated images. Furthermore, it enhances the controllability of textual influences on image generation by using masks to precisely target areas in the image for editing. To facilitate deeper fusion of visual and textual information, we introduce a multimodal pre-trained model CoCa, to extract joint representations of images and text, enhancing the detailed expression in generated images. Diff-KT improves the correlation between text and generated images and enhances image localization precision within the diffusion model, resulting in high-quality images. Experimental results validate the advantages of the Diff-KT model, demonstrating higher correlation between generated images and text prompts, as well as more accurate localization during text-guided image editing, underscoring its practical value.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords