TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

Juan Wang; Zhijie Wang; Tomo Miyazaki; Yaohou Fan; Shinichiro Omachi

doi:10.3390/s24196166

Sensors (Sep 2024)

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

Juan Wang,
Zhijie Wang,
Tomo Miyazaki,
Yaohou Fan,
Shinichiro Omachi

Affiliations

Juan Wang: Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan
Zhijie Wang: RIKEN AIP, Tokyo 1030027, Japan
Tomo Miyazaki: Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan
Yaohou Fan: Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan
Shinichiro Omachi: Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan

DOI: https://doi.org/10.3390/s24196166
Journal volume & issue: Vol. 24, no. 19
p. 6166

Abstract

Read online

Three-dimensional (3D) Scene Understanding achieves environmental perception by extracting and analyzing point cloud data with wide applications including virtual reality, robotics, etc. Previous methods align the 2D image feature from a pre-trained CLIP model and the 3D point cloud feature for the open vocabulary scene understanding ability. We believe that existing methods have the following two deficiencies: (1) the 3D feature extraction process ignores the challenges of real scenarios, i.e., point cloud data are very sparse and even incomplete; (2) the training stage lacks direct text supervision, leading to inconsistency with the inference stage. To address the first issue, we employ a Masked Consistency training policy. Specifically, during the alignment of 3D and 2D features, we mask some 3D features to force the model to understand the entire scene using only partial 3D features. For the second issue, we generate pseudo-text labels and align them with the 3D features during the training process. In particular, we first generate a description for each 2D image belonging to the same 3D scene and then use a summarization model to fuse these descriptions into a single description of the scene. Subsequently, we align 2D-3D features and 3D-text features simultaneously during training. Massive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art approaches.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords