IEEE Access (Jan 2024)

Improving Visual Pedestrian Attributes Discernment With Textual Reconstruction

  • Yejun Lee,
  • Jinah Kim,
  • Jungchan Cho,
  • Jhonghyun An

DOI
https://doi.org/10.1109/ACCESS.2024.3491830
Journal volume & issue
Vol. 12
pp. 164178 – 164189

Abstract

Read online

Recently, multi-modal research combining visual and textual information has emerged in Pedestrian Attribute Recognition (PAR). In this field, textual information has primarily been addressed through text modeling using tokenizers and textual encoders. However, separately learned visual and text encoders often find correlations between visual and textual features that may be insufficient at the human cognitive level. To address this issue, we drew inspiration from the way people describe pedestrian attributes and developed a method that mimics this cognitive process. This approach enhances visual encoders’ ability to discriminate by generating sentences from images, masking important words, and then reconstructing them. Our method, which improves visual pedestrian attributes using textual information, demonstrates significant performance enhancements on the RAP and PA100k datasets, as well as on zero-shot datasets like RAP2zs and PETAzs, which do not overlap with the training and test sets. These improvements yield more meaningful results.

Keywords