Scientific Reports (Nov 2024)
ReMamba: a hybrid CNN-Mamba aggregation network for visible-infrared person re-identification
Abstract
Abstract Visible-Infrared Person Re-identification (VI-ReID) has been consistently challenged by the significant intra-class variations and cross-modality differences between different cameras. Therefore, the key lies in how to extract discriminative modality-shared features. Existing VI-ReID methods based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have shortcomings in capturing global features and controlling computational complexity, respectively. To tackle these challenges, we propose a hybrid network framework called ReMamba. Specifically, we first use a CNN as the backbone network to extract multi-level features. Then, we introduce the Visual State Space (VSS) model, which is responsible for integrating the local features output by the CNN from lower to higher levels. These local features serve as a complement to global information and thereby enhancing the local details clarity of the global features. Considering the potential redundancy and semantic differences between local and global features, we design an adaptive feature aggregation module that automatically filters and effectively aggregates both types of features, incorporating an auxiliary aggregation loss to optimize the aggregation process. Furthermore, to better constrain cross-modality features and intra-modal features, we design a modal consistency identity constraint loss to alleviate cross-modality differences and extract modality-shared information. Extensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that our proposed ReMamba outperforms state-of-the-art VI-ReID methods.