Sensors (Oct 2022)

Visual Relationship Detection with Multimodal Fusion and Reasoning

  • Shouguan Xiao,
  • Weiping Fu

DOI
https://doi.org/10.3390/s22207918
Journal volume & issue
Vol. 22, no. 20
p. 7918

Abstract

Read online

Visual relationship detection aims to completely understand visual scenes and has recently received increasing attention. However, current methods only use the visual features of images to train the semantic network, which does not match human habits in which we know obvious features of scenes and infer covert states using common sense. Therefore, these methods cannot predict some hidden relationships of object-pairs from complex scenes. To address this problem, we propose unifying vision–language fusion and knowledge graph reasoning to combine visual feature embedding with external common sense knowledge to determine the visual relationships of objects. In addition, before training the relationship detection network, we devise an object–pair proposal module to solve the combination explosion problem. Extensive experiments show that our proposed method outperforms the state-of-the-art methods on the Visual Genome and Visual Relationship Detection datasets.

Keywords