PeerJ Computer Science (Feb 2024)
exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies
Abstract
Background Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. Objective To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) “What kind of rejection does the patient show?”; (2) “What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?” Methods Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT’s tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model’s vocabulary. All three models were fine-tuned with information retrieval heads. Results The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. Conclusion ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer’s vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
Keywords