Toward Unsupervised Visual Reasoning: Do Off-the-Shelf Features Know How to Reason?

Monika Wysoczanska; Tom Monnier; Tomasz Trzcinski; David Picard

doi:10.1109/ACCESS.2024.3406261

IEEE Access (Jan 2024)

Toward Unsupervised Visual Reasoning: Do Off-the-Shelf Features Know How to Reason?

Monika Wysoczanska,
Tom Monnier,
Tomasz Trzcinski,
David Picard

Affiliations

Monika Wysoczanska: ORCiD; Faculty of Electronics and Information Technology, Warsaw University of Technology, Warsaw, Poland
Tom Monnier: LIGM, École des Ponts, CNRS, Univ Gustave Eiffel, Marne-la-Vallée, France
Tomasz Trzcinski: ORCiD; Faculty of Electronics and Information Technology, Warsaw University of Technology, Warsaw, Poland
David Picard: ORCiD; LIGM, École des Ponts, CNRS, Univ Gustave Eiffel, Marne-la-Vallée, France

DOI: https://doi.org/10.1109/ACCESS.2024.3406261
Journal volume & issue: Vol. 12
pp. 76367 – 76378

Abstract

Read online

Recent advances in visual representation learning allowed for the construction of a plethora of powerful features that are ready to use for numerous downstream tasks. Contrary to existing representation evaluations typically based on image or pixel-wised classification tasks, the goal of this work is to assess how well these features preserve meaningful information about the objects contained in a given image, such as their spatial locations, their visual properties, or their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. Our underlying assumption is that reasoning performances are strongly correlated with the quality of visual representations. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module of limited capacity and trained on the frozen visual representations to be evaluated in a spirit similar to standard feature evaluations relying on shallow networks. This involves constraining the complexity of the reasoning module as well as the size of its input. Using the proposed evaluation framework, we compare two types of visual representations, namely dense local features, and object-centric ones, against the performances of a perfect image representation using the ground truth. We make three key findings: 1) all considered, visual representations are far from extracting perfect visual information from a reasoning standpoint, 2) object-centric features better preserve the critical information necessary to perform basic reasoning, and 3) none of the two types of visual representation prevents from learning spurious correlations when confronted to a smaller training set. These findings stand in opposition to the excellent performances obtained by such off-the-shelf representations in typical evaluation protocols.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords