iScience (Jul 2024)

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training

  • Jon Walbrin,
  • Nikita Sossounov,
  • Morteza Mahdiani,
  • Igor Vaz,
  • Jorge Almeida

Journal volume & issue
Vol. 27, no. 7
p. 110297

Abstract

Read online

Summary: Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation). Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs. Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge. We discuss the possible sources of this benefit relative to other models (e.g., multimodal vs. image-only pre-training, dataset size, architecture).

Keywords