IEEE Access (Jan 2019)
Spatial-Transformed Regional Quality Estimation Network for Large-Variance Person Re-Identification
Abstract
The video-based person re-identification aims to search a query video in a large video gallery. The major restrictions on the performance of this task are image misalignment and partial noises caused by detection error, occlusion, blur, and illumination. The misalignment between the different frames in a video caused by excessive background or part missing may play the devil with pedestrian matching. In addition, the influence of partial noises on pedestrian matching performance is also unfavorable. Since different spatial regions of a single frame have various qualities, and the quality of the same region also varies across frames in a tracklet. A good way to address the problem is to effectively aggregate complementary information from all frames in a sequence, using better regions from other frames to compensate for the influence of an image region with poor quality. To achieve this, we propose a novel Spatial-transformed Regional Quality Estimation Network (SRQEN), where a well-designed spatial-transformed unit is used to automatically learn the alignment from an identification procedure and another ingenious training mechanism enables the effective learning to extract the complementary region-based information between different frames. Visual examples indicate that pedestrians are better aligned with SRQEN and the proposed method can learn complementary information. Extensive experiments show that compared with other feature extraction methods, we achieved comparable results of 93.5%, 79.8%, and 74.85% on the PRID 2011, iLIDS-VID, and MARS, respectively.
Keywords