MonoIS3DLoc: Simulation to Reality Learning Based Monocular Instance Segmentation to 3D Objects Localization From Aerial View

Dinh Tuan Tran; Dung Duc Tran; Minh Anh Nguyen; Quyen Van Pham; Nobutaka Shimada; Joo-Ho Lee; Anh Quang Nguyen

doi:10.1109/ACCESS.2023.3288027

IEEE Access (Jan 2023)

MonoIS3DLoc: Simulation to Reality Learning Based Monocular Instance Segmentation to 3D Objects Localization From Aerial View

Dinh Tuan Tran,
Dung Duc Tran,
Minh Anh Nguyen,
Quyen Van Pham,
Nobutaka Shimada,
Joo-Ho Lee,
Anh Quang Nguyen

Affiliations

Dinh Tuan Tran: ORCiD; College of Information Science and Engineering, Ritsumeikan University, Shiga, Kusatsu, Japan
Dung Duc Tran: ORCiD; School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi, Vietnam
Minh Anh Nguyen: ORCiD; School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi, Vietnam
Quyen Van Pham: School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi, Vietnam
Nobutaka Shimada: ORCiD; College of Information Science and Engineering, Ritsumeikan University, Shiga, Kusatsu, Japan
Joo-Ho Lee: ORCiD; College of Information Science and Engineering, Ritsumeikan University, Shiga, Kusatsu, Japan
Anh Quang Nguyen: ORCiD; School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi, Vietnam

DOI: https://doi.org/10.1109/ACCESS.2023.3288027
Journal volume & issue: Vol. 11
pp. 64170 – 64184

Abstract

Read online

3D object detection and localization based on only a monocular camera always faces its fundamental ill-posed issue to estimate 3D information. In combination with deep neural networks, recent researches have shown encouraging results to tackle this issue. However, most of them are only applied to street-view cameras based on several available small-size datasets and the 3D prediction accuracy of these methods is still low in comparison to traditional estimation methods using stereo-cameras. With the development of drone delivery applications in city spaces, it is also necessary to have a similar method to detect objects and estimate their 3D position from an aerial view. We proposed a novel Simulation to Reality approach to predict the object’s 3D position from an aerial view. An instance segmentation of an object is used as an intermediate representation not only to create a very large dataset for training by simulation but also to minimize the gap between simulation and reality. We designed a feed-forward neural network to predict the 3D position from instance segmentation and integrated it with a range-attention classification to improve accuracy, especially for 3D object detection at far distances. To evaluate our methods, we created two simulation datasets: one for cross-validation with other state-of-the-art methods and the other one for practical experiments on a real drone with a monocular camera. The experiment’s results demonstrate that we not only achieve better accuracy than the state-of-the-art methods using the monocular camera by testing on the same KITTI-3D dataset but also reach close to the accuracy of a stereo-based technique. Since our model is lightweight, we successfully deployed it on a companion computer on the real drone and the results of practical experiences are promising.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords