IEEE Access (Jan 2024)

Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models

  • Junjie Wang,
  • Yatai Ji,
  • Yuxiang Zhang,
  • Yanru Zhu,
  • Tetsuya Sakai

DOI
https://doi.org/10.1109/ACCESS.2023.3347192
Journal volume & issue
Vol. 12
pp. 420 – 434

Abstract

Read online

In the field of multimodal understanding and generation, tackling inherent uncertainties is essential for mitigating ambiguous interpretations across multiple targets. We introduce the Probability Distribution Encoder (PDE), a versatile, plug-and-play module that utilizes sequence-level and feature-level interactions to model these uncertainties as probabilistic distributions. Furthermore, we demonstrate its adaptability by seamlessly integrating PDE into established frameworks. Compared to previous methods, our probabilistic approach substantially enriches multimodal semantic understanding. In addition to specific tasks, the unlabeled data contains rich prior knowledge, especially multimodal uncertainties. However, current pre-training methods are designed based on point representations, which hinders the effective functioning of our distribution representations. Therefore, we incorporate this uncertainty modeling into three new pre-training strategies: Distribution-based Vision-Language Contrastive Learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). Empirical experiments show that our models achieve State-of-the-Art (SOTA) results in a range of downstream tasks, including image-text retrieval, visual question answering, visual reasoning, visual entailment and video captioning. Furthermore, the qualitative results reveal several superior properties conferred by our methods, such as improved semantic expressiveness over point representations, and the ability to generate diverse yet accurate predictions.

Keywords