Jisuanji kexue (Feb 2022)

Text-to-Image Generation Technology Based on Transformer Cross Attention

  • TAN Xin-yue, HE Xiao-hai, WANG Zheng-yong, LUO Xiao-dong, QING Lin-bo

DOI
https://doi.org/10.11896/jsjkx.210600085
Journal volume & issue
Vol. 49, no. 2
pp. 107 – 115

Abstract

Read online

In recent years,the research on the methods of text to image based on generative adversarial network (GAN) continues to grow in popularity and have made some progress.The key of text-to-image generation technology is to build a bridge between the text information and the visual information,and promote the model to generate realistic images consistent with the corresponding text description.At present,the mainstream method is to complete the encoding of the descriptions of the input text by pre-training the text encoder,but these methods do not consider the semantic alignment with the corresponding image in the text encoder,and adopt the independent encoding of the input text,ignoring the semantic gap between the language space and the image space.To address the problem,in this paper,a generative adversarial network based on the cross-attention encoder (CAE-GAN) is proposed.The network uses a cross-attention encoder to translate and align text information with visual information,and captures the cross-modal mapping relationship between text and image information,so as to improve the fidelity of the gene-rated images and the matching degree with input text description.The experimental results show that,compared with the DM-GAN model,the inception score (IS) of CAE-GAN model increases by 2.53% and 1.54% on CUB and coco datasets,respectively.The fréchet inception distance score decreases by 15.10% and 5.54%,respectively,indicating that the details and the quality of the images generated by the CAE-GAN model are more perfect.

Keywords