Sensors (Sep 2022)

Cofopose: Conditional 2D Pose Estimation with Transformers

  • Evans Aidoo,
  • Xun Wang,
  • Zhenguang Liu,
  • Edwin Kwadwo Tenagyei,
  • Kwabena Owusu-Agyemang,
  • Seth Larweh Kodjiku,
  • Victor Nonso Ejianya,
  • Esther Stacy E. B. Aggrey

DOI
https://doi.org/10.3390/s22186821
Journal volume & issue
Vol. 22, no. 18
p. 6821

Abstract

Read online

Human pose estimation has long been a fundamental problem in computer vision and artificial intelligence. Prominent among the 2D human pose estimation (HPE) methods are the regression-based approaches, which have been proven to achieve excellent results. However, the ground-truth labels are usually inherently ambiguous in challenging cases such as motion blur, occlusions, and truncation, leading to poor performance measurement and lower levels of accuracy. In this paper, we propose Cofopose, which is a two-stage approach consisting of a person and keypoint detection transformers for 2D human pose estimation. Cofopose is composed of conditional cross-attention, a conditional DEtection TRansformer (conditional DETR), and an encoder-decoder in the transformer framework; this allows it to achieve person and keypoint detection. In a significant departure from other approaches, we use conditional cross-attention and fine-tune conditional DETR for our person detection, and encoder-decoders in the transformers for our keypoint detection. Cofopose was extensively evaluated using two benchmark datasets, MS COCO and MPII, achieving an improved performance with significant margins over the existing state-of-the-art frameworks.

Keywords