Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network

Yali Cai; Xiaoru Wang; Zhihong Yu; Fu Li; Peirong Xu; Yueli Li; Lixian Li

doi:10.1109/ACCESS.2019.2958864

IEEE Access (Jan 2019)

Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network

Yali Cai,
Xiaoru Wang,
Zhihong Yu,
Fu Li,
Peirong Xu,
Yueli Li,
Lixian Li

Affiliations

Yali Cai: ORCiD; Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, China
Xiaoru Wang: ORCiD; Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, China
Zhihong Yu: ORCiD; Intel China Research Center, Beijing, China
Fu Li: ORCiD; Department of Electrical and Computer Engineering, Portland States University, Portland, OR, USA
Peirong Xu: ORCiD; Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, China
Yueli Li: ORCiD; Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, China
Lixian Li: ORCiD; Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2019.2958864
Journal volume & issue: Vol. 7
pp. 183706 – 183716

Abstract

Read online

Recent generative adversarial network based methods have shown promising results for the charming but challenging task of synthesizing images from text descriptions. These approaches can generate images with general shape and color but often produce distorted global structures with unnatural local semantic details. It is due to ineffectiveness of convolutional neural networks in capturing the high-level semantic information for pixel-level image synthesis. In this paper, we propose a Dual Attentional Generative Adversarial Network (DualAttn-GAN) in which the dual attention modules are introduced to enhance local details and global structures by attending to related features from relevant words and different visual regions. As one of the dual modules, the textual attention module is designed to explore the fine-grained interaction between vision and language. On the other hand, visual attention module models internal representations of vision from channel and spatial axes, which can better capture the global structures. Meanwhile, we apply an attention embedding module to merge multi-path features. Furthermore, we present an inverted residual structure to boost representation power of CNNs and apply spectral normalization to stabilize GAN training. With extensive experimental validation on two benchmark datasets, our method significantly improves state-of-the-art models over the evaluation metrics of inception score and Fréchet inception distance.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords