Journal of King Saud University: Computer and Information Sciences (Feb 2024)
Self-attention and long-range relationship capture network for underwater object detection
Abstract
Underwater object detection has been shown to exhibit significant potential for exploring underwater environments. However, underwater datasets often suffer from degeneration due to uneven underwater light distribution, complex underwater environment, and crowded underwater dynamic background. Thus, object detection performance would be degraded accordingly. In this paper, a large kernel convolutional object detection network based on self-attention and long-range relationship capture is proposed. Firstly, a hybrid dilated large kernel attention mechanism is proposed, which adopts the idea of hybrid dilated convolution and combines the advantages of large kernel attention mechanism and self-attention. This attention mechanism can avoid self-attention defects while achieving self-attention adaptiveness and long-range relevance. Secondly, a feature enhancement block called residual reconstructed module is proposed, which captures long-range dependencies in the network and extracts more critical contextual information, thus solving the problem of network degradation and accuracy degradation. Thirdly, an adaptive spatial feature fusion object detection head is constructed, which can directly learn how to filter different features at different feature layers spatially; useless information is filtered out, and only useful information is kept for combination to enhance the detection capability of the network further. Finally, network for underwater object detection is proposed based on the above three techniques. Extensive experiments were conducted on the well-known datasets of RUOD, Aquarium, URPC, and MS COCO. Compared to the prior state-of-the-art methods, the experimental findings demonstrate that the proposed approach obtains the highest mAP of 88.7%, 86.5%, 98.9%, and 71.4%, respectively. This represents an improvement of 1.2, 1.5, 8.5, and 0.2 percentage, in that order. The proposed model shows the capacity to function by applying self-attention to local details, as well as the capacity to grasp global long-range relationships, prioritize essential data, and spatially filter irrelevant information.