Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection

Zhijian Wang; Jie Liu; Yixiao Sun; Xiang Zhou; Boyan Sun; Dehong Kong; Jay Xu; Xiaoping Yue; Wenyu Zhang

doi:10.7717/peerj-cs.2656

PeerJ Computer Science (Jan 2025)

Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection

Zhijian Wang,
Jie Liu,
Yixiao Sun,
Xiang Zhou,
Boyan Sun,
Dehong Kong,
Jay Xu,
Xiaoping Yue,
Wenyu Zhang

Affiliations

Zhijian Wang: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China
Jie Liu: Anshan Power Supply Company, Liaoning Electric Power Limited Company of State Grid, Anshan, Liaoning, China
Yixiao Sun: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China
Xiang Zhou: Inner Mongolia Electronic Information Vocational Technical College, Huhehaote, Neimenggu, China
Boyan Sun: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China
Dehong Kong: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China
Jay Xu: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China
Xiaoping Yue: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China
Wenyu Zhang: School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, China

DOI: https://doi.org/10.7717/peerj-cs.2656
Journal volume & issue: Vol. 11
p. e2656

Abstract

Read online Read online

Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other’s limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords