IET Image Processing (Aug 2023)

CSIT: Channel Spatial Integrated Transformer for human pose estimation

  • Shaohua Li,
  • Haixiang Zhang,
  • Hanjie Ma,
  • Jie Feng,
  • Mingfeng Jiang

DOI
https://doi.org/10.1049/ipr2.12850
Journal volume & issue
Vol. 17, no. 10
pp. 3002 – 3011

Abstract

Read online

Abstract Human keypoints detection is different from general detection tasks and requires networks that can learn visual information and anatomical constraints. Since CNN is excellent in extracting texture features of images and transformer can learn the correlation among keypoints well, many CTPNets (CNN+transformer type human pose estimation networks) have emerged. However, these networks are unconcerned with the processing of the features extracted from the CNN and naturally expand only from the channel dimension, ignoring the spatial features in the visual information that are essential for complex detection tasks like pose estimation. So the channel spatial integrated transformer for human pose estimation, termed CSIT, is proposed. The visual information are summarized as texture and spatial information, and a parallel network is used to expand the feature maps in the channel and spatial dimensions to learn texture features and spatial features respectively. In addition, anatomically constrained information is learned by keypoint embeddings. At the end of the network, the 1D vector representation method with more advanced performance and more compatible with transformer's characteristics is used to predict keypoints. Experiments show that CSIT outperforms the mainstream CTPNets on the COCO test‐dev dataset, and also show satisfactory results on the MPII dataset.

Keywords