Informatics in Medicine Unlocked (Jan 2023)
Comparison of BERT implementations for natural language processing of narrative medical documents
Abstract
Background and objectives: Bidirectional Encoder Representations from Transformers (BERT) word embedding models have been successfully used for many natural language processing (NLP) tasks, including medical named entity recognition. However, there are many more linguistically complicated concepts in healthcare documentation, often reflecting medical decision-making processes or complex patient characteristics, where performance of transformer-based models has not been as well investigated. Furthermore, the dataset on which a BERT model has been pre-trained could affect performance. Methods: We compared accuracy of identification of three linguistically complex medical concepts – a) discussion of bariatric surgery between patients and their healthcare providers; b) non-acceptance of statin treatment recommendation by patients; and c) tobacco use status documentation – by three BERT implementations: regular BERT; BioBERT and ClinicalBERT. For each of the three NLP tasks, all three BERT implementations were trained on a manually annotated training dataset of outpatient provider notes and then evaluated on a held-out manually annotated test dataset. All datasets were obtained from the electronic health record system of Mass General Brigham. Filtering by keywords was used to improve class balance by undersampling the null class. Results: Prevalence of study labels (concepts) ranged from 1.3% to 11.8% and was similar between training and held-out validation datasets within each task-model combination. Over 80% of NLP tasks achieved recall and 75% of tasks achieved precision between 0.4 and 0.9. Among different study evaluation categories, F1 score ranged from 0.0 to 0.860. Macro-averaged F1 score ranged from 0.466 to 0.854.Overall, ClinicalBERT achieved best performance (by F1-macro score) in the Bariatric Surgery task, BioBERT in the Tobacco Use task and regular BERT in the Statin Non-Acceptance task. The mean macro-F1 score across all task-model pairs was 0.761 for ClinicalBERT, 0.735 for BioBERT and 0.699 for regular BERT. Conclusions: BERT implementations trained on documents from biomedical domain – both BioBERT and ClinicalBERT – achieve superior NLP performance for identifying a range of complex medical concepts compared to regular BERT. Neither of the two biomedical BERT implementations we tested attained clearly greater accuracy than the other.