IEEE Access (Jan 2019)
Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model
Abstract
The generation of the textual description of the differences in images is a relatively new concept that requires the fusion of both computer vision and natural language techniques. In this paper, we present a novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to perform visual feature extractions, compute the feature distances, and generate new sentences describing the measured distances. After extracting the features of the images, a contrastive function is used to compute their weighted L1 distance which is learned and selectively attended to determine salient sections of the feature at every time step. The attended feature region is adequately matched to corresponding words iteratively until a sentence is completed. We propose the application of upsampling network to enlarge the features' field of view, this provides a robust pixel-based discrepancy computation. Our extensive experiments indicate that the FCC model outperforms other learning models on the benchmark Spot-the-Diff datasets by generating succinct and meaningful textual differences in images.
Keywords