Intelligent Systems with Applications (Nov 2022)
Fusion of vectored text descriptors with auto extracted deep CNN features for improved image classification
Abstract
In today's age, massive image data is generated rapidly. This influx has made labeling images tedious and, in turn, made it harder to retrieve images through searching algorithms that rely only on labels, keywords, or other meta-data in the images. Modern Content-Based Image Retrieval (CBIR) techniques rely on the visual features within the image to return relevant results to a search query. Deep Convolutional Neural Network (DCNN) models made great strides in the last decade. This paper relies on these complex pre-trained models to extract visual features from images. The proposed work has used pre-trained models like VGG16, MobileNet, Inceptionv3, and Xception for this task. Some studies in the CBIR space also suggest increased accuracy when both visual and textual features are considered. This paper proposes a novel three-step process for obtaining textual features. Firstly, the proposed model receives keywords for each image using Google Cloud Vision API. Secondly, the proposed model replaces each keyword with a 300-dimensional embedding vector obtained using word2vec, trained on the Google News dataset. Finally, the proposed model trains a combination of the Deep Semantic Similarity Model (DSSM) and Long Short-Term Memory (LSTM) model to reduce the 300-dimensional vector to a 64-dimensional vector. Using these new shortened word vectors, the proposed model computes the cosine similarity to replace each keyword of an image with five of its synonyms. Here, these additional steps increased the accuracy compared to simply using a word embedding technique. Finally, the proposed model combined the visual and textual feature vectors and observed that this feature set showed maximum classification accuracy of 98.33%, which is also compared with and found relatively better than other similar model results.