This work is focused on the preliminary stage of the 3D drone tracking challenge, namely the precise detection of drones on images obtained from a synchronized multi-camera system. The YOLOv5 deep network with different input resolutions is trained and tested on the basis of real, multimodal data containing synchronized video sequences and precise motion capture data as a ground truth reference. The bounding boxes are determined based on the 3D position and orientation of an asymmetric cross attached to the top of the tracked object with known translation to the object’s center. The arms of the cross are identified by the markers registered by motion capture acquisition. Besides the classical mean average precision (mAP), a measure more adequate in the evaluation of detection performance in 3D tracking is proposed, namely the average distance between the centroids of matched references and detected drones, including false positive and false negative ratios. Moreover, the videos generated in the AirSim simulation platform were taken into account in both the training and testing stages.