IEEE Access (Jan 2024)
Pose Calibrated Feature Aggregation for Video Face Set Recognition in Unconstrained Environments
Abstract
This paper presents Pose Calibrated Feature Aggregation Network (PCFAN), an architecture for set/video face recognition. Using stacked attention blocks and a multi-modal architecture, it automatically assigns adaptive weights to every instance in the set, based on both the recognition embeddings and the associated face metadata. It uses these weights to produce a single, compact feature vector for the set. The model automatically learns to advocate for features from images with more favourable qualities and poses, which inherently hold more information. Our block can be inserted on top of any standard recognition model for set prediction and improved performance, particularly in unconstrained scenarios where subject pose and image quality vary considerably between frames. We test our approach on three challenging video face-recognition datasets, IJB-A, IJB-B, and YTF, and report state-of-the-art results. Moreover, a comparison with top aggregation methods as our baselines demonstrates that PCFAN is the superior approach.
Keywords