Digital Diagnostics (Jun 2023)
Learning radiologists’ annotation styles with multi-annotator labeling for improved neural network performance
Abstract
BACKGROUND: One of the common problems in labeling medical images is inter-observer variability. The same image can be labeled differently by doctors. The main reasons are the human factor, differences in experience and qualifications, different radiology schools, poor image quality, and unclear instructions. The influence of some factors can be reduced by proper organization of the annotation; however, the opinion of doctors frequently differs. AIM: The study aimed to test whether a neural network with an additional module can learn the style and labeling features of different radiologists and whether such modeling can improve the final metrics of object detection on radiological images. METHODS: For training artificial intelligence systems in radiology, cross-labeling, i.e., annotation of the same image by several doctors, is frequently used. The easiest way is to use labeling from each doctor as an independent example when training the model. Some methods use different rules or algorithms to combine annotation before training. Finally, Guan et al. use separate classification heads to model the labeling style of different doctors. Unfortunately, this method is not suitable for more complex tasks, such as detecting objects on an image. For this analysis, a machine learning model designed to detect objects of different classes on mammographic scans was used. This model is a neural network based on Deformable DETR architecture. A dataset consisting of 7,756 mammographic breast scans and 12,543 unique annotations from 19 doctors was used to train the neural network. For validation and testing, a dataset consisting of 700 and 300 Bi-Rads-labeled scans, respectively, was taken. In all data sets, the proportion of images with pathology was in the 15%20% range. A unique index was assigned to each of the 19 doctors, and a special module at each iteration of the neural network training found a vector corresponding to this index. The vector was expanded to the size of the feature map of each level of the feature pyramid, and then attached by separate channels to the maps. Thus, the encoder and the decoder of the detector had access to the information about which doctor labeled the scan. The vectors were updated using the back-propagation method. Three methods were chosen for comparison: Basic model: Combining labels by different doctors using the voting method. New stylistic module: For predictions on the test dataset, a single doctors index was taken, which showed the best metrics on the validation dataset. New stylistic module: The indexes of the five doctors with the best metrics on the validation dataset were used for predictions on the test dataset. Weighted Boxes Fusion was chosen to combine the predictions. The area under the receiver operating characteristic curve (ROC-AUC) was used as the primary metric on the test dataset (Bi-Rads 3, 4, and 5 categories were referred to pathology). The sum of maximum probabilities of detected malignant objects (malignant masses and calcinates) by cranio-caudal and medio-lateral oblique projections was assumed as the probability of malignancy for each method. RESULTS: The following ROC-AUC metrics were obtained for the three methods: 0.82, 0.87, and 0.89. CONCLUSIONS: The information about the labeling doctor allows the neural network to learn and model the labeling style of different doctors more effectively. In addition, this method may obtain an estimate of the uncertainty of the networks prediction. The use of embedding from different doctors, leading to different predictions, may mean that this data is difficult for an artificial intelligence system to process.
Keywords