IEEE Access (Jan 2023)
MonoIS3DLoc: Simulation to Reality Learning Based Monocular Instance Segmentation to 3D Objects Localization From Aerial View
Abstract
3D object detection and localization based on only a monocular camera always faces its fundamental ill-posed issue to estimate 3D information. In combination with deep neural networks, recent researches have shown encouraging results to tackle this issue. However, most of them are only applied to street-view cameras based on several available small-size datasets and the 3D prediction accuracy of these methods is still low in comparison to traditional estimation methods using stereo-cameras. With the development of drone delivery applications in city spaces, it is also necessary to have a similar method to detect objects and estimate their 3D position from an aerial view. We proposed a novel Simulation to Reality approach to predict the object’s 3D position from an aerial view. An instance segmentation of an object is used as an intermediate representation not only to create a very large dataset for training by simulation but also to minimize the gap between simulation and reality. We designed a feed-forward neural network to predict the 3D position from instance segmentation and integrated it with a range-attention classification to improve accuracy, especially for 3D object detection at far distances. To evaluate our methods, we created two simulation datasets: one for cross-validation with other state-of-the-art methods and the other one for practical experiments on a real drone with a monocular camera. The experiment’s results demonstrate that we not only achieve better accuracy than the state-of-the-art methods using the monocular camera by testing on the same KITTI-3D dataset but also reach close to the accuracy of a stereo-based technique. Since our model is lightweight, we successfully deployed it on a companion computer on the real drone and the results of practical experiences are promising.
Keywords