Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

Wei Zhang; Miaoxin Cai; Tong Zhang; Guoqiang Lei; Yin Zhuang; Xuerui Mao

doi:10.1109/JSTARS.2024.3488034

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

Wei Zhang,
Miaoxin Cai,
Tong Zhang,
Guoqiang Lei,
Yin Zhuang,
Xuerui Mao

Affiliations

Wei Zhang: ORCiD; Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing, China
Miaoxin Cai: ORCiD; National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing, China
Tong Zhang: ORCiD; National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing, China
Guoqiang Lei: ORCiD; Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing, China
Yin Zhuang: ORCiD; National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing, China
Xuerui Mao: ORCiD; Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing, China

DOI: https://doi.org/10.1109/JSTARS.2024.3488034
Journal volume & issue: Vol. 17
pp. 20050 – 20063

Abstract

Read online

Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords