IET Image Processing (Dec 2023)
Pose‐guided adversarial video prediction for image‐to‐video person re‐identification
Abstract
Abstract The image‐to‐video (I2V) person re‐identification (Re‐ID) is a cross‐modality pedestrian retrieval task, whose crux is to reduce the large modality discrepancy between images and videos. To this end, this paper proposes to predict the following video frames from a single image. Thus, the I2V person Re‐ID can be transformed to video‐to‐video (V2V) Re‐ID. Considering that predicting video frames from a single image is an ill‐posed problem, this paper proposes two strategies to improve the quality of the predicted videos. First, a pose‐guided video prediction pipeline is proposed. The given single image and pedestrian pose are encoded via image encoder and pose encoder, respectively; then, the image feature and pose feature are concatenated as the input of the video decoder. The authors minimize the difference between the predicted video and true video, and simultaneously minimize the difference between the true pose and predicted pose. Second, the conditional adversarial training strategy is employed to generate high‐quality video frames. Specifically, the discriminator takes the source image as condition and distinguishes whether the input frames are fake or true following frames of the source image. Experimental results demonstrate that the pose‐guided adversarial video prediction can effectively improve accuracy of I2V Re‐ID.
Keywords