IEEE Access (Jan 2024)
Generating Diverse Image Variations With Diffusion Models by Combining Intra-Image Self-Attention and Decoupled Cross-Attention
Abstract
In this paper, we present a novel integration of Decoupled Cross-Attention and Intra-Image Self-Attention within a diffusion model framework to generate diverse and coherent image variations. Our approach leverages the Decoupled Cross-Attention mechanism from IP-Adapter to align the input image more closely with its textual description, while Intra-Image Self-Attention operates on latent representations extracted through Denoising Diffusion Implicit Models inversion to capture fine-grained dependencies within the image. By utilizing noise interpolation in the diffusion process, we effectively blend the influences of both attention mechanisms, allowing for precise control over global and local features. This integration significantly improves both the semantic fidelity and visual diversity of generated images, making it highly suitable for applications that require detailed and contextually rich image synthesis. Through a three-step process—latent extraction, attention refinement, and noise interpolation—our method demonstrates superior performance compared to traditional models, consistently producing image variations that are visually appealing and aligned with input prompts. Our experiments show that the proposed method significantly outperforms traditional models in generating nuanced image variations, proving its effectiveness and potential for enhancing creative industries and personalized media production.
Keywords