Multimodal few-shot classification without attribute embedding

Jun Qing Chang; Deepu Rajan; Nicholas Vun

doi:10.1186/s13640-024-00620-9

EURASIP Journal on Image and Video Processing (Jan 2024)

Multimodal few-shot classification without attribute embedding

Jun Qing Chang,
Deepu Rajan,
Nicholas Vun

Affiliations

Jun Qing Chang: Nanyang Technological University
Deepu Rajan: Nanyang Technological University
Nicholas Vun: Nanyang Technological University

DOI: https://doi.org/10.1186/s13640-024-00620-9
Journal volume & issue: Vol. 2024, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes.

Published in EURASIP Journal on Image and Video Processing

ISSN: 1687-5176 (Print); 1687-5281 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: https://jivp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords