Surgical Gesture Recognition in Laparoscopic Tasks Based on the Transformer Network and Self-Supervised Learning

Athanasios Gazis; Pantelis Karaiskos; Constantinos Loukas

doi:10.3390/bioengineering9120737

Bioengineering (Nov 2022)

Surgical Gesture Recognition in Laparoscopic Tasks Based on the Transformer Network and Self-Supervised Learning

Athanasios Gazis,
Pantelis Karaiskos,
Constantinos Loukas

Affiliations

Athanasios Gazis: Laboratory of Medical Physics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece
Pantelis Karaiskos: Laboratory of Medical Physics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece
Constantinos Loukas: Laboratory of Medical Physics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece

DOI: https://doi.org/10.3390/bioengineering9120737
Journal volume & issue: Vol. 9, no. 12
p. 737

Abstract

Read online

In this study, we propose a deep learning framework and a self-supervision scheme for video-based surgical gesture recognition. The proposed framework is modular. First, a 3D convolutional network extracts feature vectors from video clips for encoding spatial and short-term temporal features. Second, the feature vectors are fed into a transformer network for capturing long-term temporal dependencies. Two main models are proposed, based on the backbone framework: C3DTrans (supervised) and SSC3DTrans (self-supervised). The dataset consisted of 80 videos from two basic laparoscopic tasks: peg transfer (PT) and knot tying (KT). To examine the potential of self-supervision, the models were trained on 60% and 100% of the annotated dataset. In addition, the best-performing model was evaluated on the JIGSAWS robotic surgery dataset. The best model (C3DTrans) achieves an accuracy of 88.0%, a 95.2% clip level, and 97.5% and 97.9% (gesture level), for PT and KT, respectively. The SSC3DTrans performed similar to C3DTrans when training on 60% of the annotated dataset (about 84% and 93% clip-level accuracies for PT and KT, respectively). The performance of C3DTrans on JIGSAWS was close to 76% accuracy, which was similar to or higher than prior techniques based on a single video stream, no additional video training, and online processing.

Published in Bioengineering

ISSN: 2306-5354 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology; Science: Biology (General)
Website: https://www.mdpi.com/journal/bioengineering

About the journal

Abstract

Keywords