IEEE Access (Jan 2024)
Diff-KT: Text-Driven Image Editing by Knowledge Enhancement and Mask Transformer
Abstract
Recent advancements in text-to-image generation have demonstrated significant progress, especially with diffusion-based models conditioned on textual prompts, which excel in image quality and diversity. However, these methods often encounter a semantic gap between image and text modalities and suffer from imprecise localization during text-based image editing. To address these challenges, we propose the Diffusion-based Knowledge-enhanced Mask Transformer (Diff-KT) text-to-image model. Diff-KT leverages knowledge enhancement strategies to incorporate fine-grained textual and visual knowledge of key scene elements, thereby improving the fidelity and textual consistency of generated images. Furthermore, it enhances the controllability of textual influences on image generation by using masks to precisely target areas in the image for editing. To facilitate deeper fusion of visual and textual information, we introduce a multimodal pre-trained model CoCa, to extract joint representations of images and text, enhancing the detailed expression in generated images. Diff-KT improves the correlation between text and generated images and enhances image localization precision within the diffusion model, resulting in high-quality images. Experimental results validate the advantages of the Diff-KT model, demonstrating higher correlation between generated images and text prompts, as well as more accurate localization during text-guided image editing, underscoring its practical value.
Keywords