Frontiers in Oncology (Aug 2022)

Multi-center verification of the influence of data ratio of training sets on test results of an AI system for detecting early gastric cancer based on the YOLO-v4 algorithm

  • Tao Jin,
  • Yancai Jiang,
  • Boneng Mao,
  • Xing Wang,
  • Bo Lu,
  • Ji Qian,
  • Hutao Zhou,
  • Tieliang Ma,
  • Yefei Zhang,
  • Sisi Li,
  • Yun Shi,
  • Yun Shi,
  • Zhendong Yao

DOI
https://doi.org/10.3389/fonc.2022.953090
Journal volume & issue
Vol. 12

Abstract

Read online

ObjectiveConvolutional Neural Network(CNN) is increasingly being applied in the diagnosis of gastric cancer. However, the impact of proportion of internal data in the training set on test results has not been sufficiently studied. Here, we constructed an artificial intelligence (AI) system called EGC-YOLOV4 using the YOLO-v4 algorithm to explore the optimal ratio of training set with the power to diagnose early gastric cancer.DesignA total of 22,0918 gastroscopic images from Yixing People’s Hospital were collected. 7 training set models were established to identify 4 test sets. Respective sensitivity, specificity, Youden index, accuracy, and corresponding thresholds were tested, and ROC curves were plotted.Results1. The EGC-YOLOV4 system completes all tests at an average reading speed of about 15 ms/sheet; 2. The AUC values in training set 1 model were 0.8325, 0.8307, 0.8706, and 0.8279, in training set 2 model were 0.8674, 0.8635, 0.9056, and 0.9249, in training set 3 model were 0.8544, 0.8881, 0.9072, and 0.9237, in training set 4 model were 0.8271, 0.9020, 0.9102, and 0.9316, in training set 5 model were 0.8249, 0.8484, 0.8796, and 0.8931, in training set 6 model were 0.8235, 0.8539, 0.9002, and 0.9051, in training set 7 model were 0.7581, 0.8082, 0.8803, and 0.8763.ConclusionEGC-YOLOV4 can quickly and accurately identify the early gastric cancer lesions in gastroscopic images, and has good generalization.The proportion of positive and negative samples in the training set will affect the overall diagnostic performance of AI.In this study, the optimal ratio of positive samples to negative samples in the training set is 1:1~ 1:2.

Keywords