Sensors (Jan 2025)
Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection
Abstract
The multi-sensor fusion, such as LiDAR and camera-based 3D object detection, is a key technology in autonomous driving and robotics. However, traditional 3D detection models are limited to recognizing predefined categories and struggle with unknown or novel objects. Given the complexity of real-world environments, research into open-vocabulary 3D object detection is essential. Therefore, this paper aims to address two key issues in this area: how to localize and classify novel objects. We propose Cross-modal Collaboration and Robust Feature Classifier to improve localization accuracy and classification robustness for novel objects. The Cross-modal Collaboration involves the collaborative localization between LiDAR and camera. In this approach, 2D images provide preliminary regions of interest for novel objects in the 3D point cloud, while the 3D point cloud offers more precise positional information to the 2D images. Through iterative updates between two modalities, the preliminary region and positional information are refined, achieving the accurate localization of novel objects. The Robust Feature Classifier aims to accurately classify novel objects. To prevent them from being misidentified as background or other incorrect categories, this method maps the semantic vectors of new categories into multiple sets of visual features distinguished from the background. And it clusters these visual features based on each individual semantic vector to maintain inter-class separability. Our method achieves state-of-the-art performance on various scenarios and datasets.
Keywords