Channel-shuffled transformers for cross-modality person re-identification in video

Rangwan Kasantikul; Worapan Kusakunniran; Qiang Wu; Zhiyong Wang

doi:10.1038/s41598-025-00063-w

Scientific Reports (Apr 2025)

Channel-shuffled transformers for cross-modality person re-identification in video

Rangwan Kasantikul,
Worapan Kusakunniran,
Qiang Wu,
Zhiyong Wang

Affiliations

Rangwan Kasantikul: Faculty of Information and Communication Technology, Mahidol University
Worapan Kusakunniran: Faculty of Information and Communication Technology, Mahidol University
Qiang Wu: School of Electrical and Data Engineering, University of Technology Sydney
Zhiyong Wang: School of Computer Science, The University of Sydney

DOI: https://doi.org/10.1038/s41598-025-00063-w
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Effective implementation of person re-identification (Re-ID) across different modalities (such as daylight vs night-vision) is crucial for Surveillance applications. Information from multiple frames is essential for effective re-identification, where visual components from individual frames become less reliable. While transformers can enhance the temporal information extraction, the large number of channels required for effective feature encoding introduces scaling challenges. This could lead to overfitting and instability during training. Therefore, we proposed a novel Channel-Shuffled Temporal Transformer (CSTT) for processing multi-frame sequences in conjunction with a ResNet backbone to form Hybrid Channel-Shuffled Transformer Net (HCSTNET). Replacing fully connected layers in standard multi-head attention with ShuffleNet-like structures is important for integration of transformer attention with a ResNet backbone. Applying ShuffleNet-like structures reduces overfitting through parameter reduction with channel-grouping, and further improves learned attention using channel-shuffling. According to our tests with the SYSU-MM01 dataset in comparison against simple averaging of multiple frames, only the temporal transformer with channel-shuffling achieved a measurable improvement over the baseline. We have also investigated the optimal partitioning of feature maps therein.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords