IEEE Access (Jan 2024)

Toward Unsupervised Visual Reasoning: Do Off-the-Shelf Features Know How to Reason?

  • Monika Wysoczanska,
  • Tom Monnier,
  • Tomasz Trzcinski,
  • David Picard

DOI
https://doi.org/10.1109/ACCESS.2024.3406261
Journal volume & issue
Vol. 12
pp. 76367 – 76378

Abstract

Read online

Recent advances in visual representation learning allowed for the construction of a plethora of powerful features that are ready to use for numerous downstream tasks. Contrary to existing representation evaluations typically based on image or pixel-wised classification tasks, the goal of this work is to assess how well these features preserve meaningful information about the objects contained in a given image, such as their spatial locations, their visual properties, or their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. Our underlying assumption is that reasoning performances are strongly correlated with the quality of visual representations. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module of limited capacity and trained on the frozen visual representations to be evaluated in a spirit similar to standard feature evaluations relying on shallow networks. This involves constraining the complexity of the reasoning module as well as the size of its input. Using the proposed evaluation framework, we compare two types of visual representations, namely dense local features, and object-centric ones, against the performances of a perfect image representation using the ground truth. We make three key findings: 1) all considered, visual representations are far from extracting perfect visual information from a reasoning standpoint, 2) object-centric features better preserve the critical information necessary to perform basic reasoning, and 3) none of the two types of visual representation prevents from learning spurious correlations when confronted to a smaller training set. These findings stand in opposition to the excellent performances obtained by such off-the-shelf representations in typical evaluation protocols.

Keywords