Identifying Alcohol-Related Information From Unstructured Bilingual Clinical Notes With Multilingual Transformers

Han Kyul Kim; Yujin Park; Yeju Park; Eunji Choi; Sodam Kim; Hahyun You; Ye Seul Bae

doi:10.1109/ACCESS.2023.3245523

IEEE Access (Jan 2023)

Identifying Alcohol-Related Information From Unstructured Bilingual Clinical Notes With Multilingual Transformers

Han Kyul Kim,
Yujin Park,
Yeju Park,
Eunji Choi,
Sodam Kim,
Hahyun You,
Ye Seul Bae

Affiliations

Han Kyul Kim: ORCiD; Daniel J. Epstein Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, CA, USA
Yujin Park: ORCiD; Department of Biomedical Engineering, Seoul National University College of Medicine, Seoul, South Korea
Yeju Park: Office of Hospital Information, Seoul National University Hospital, Seoul, South Korea
Eunji Choi: Office of Hospital Information, Seoul National University Hospital, Seoul, South Korea
Sodam Kim: Office of Hospital Information, Seoul National University Hospital, Seoul, South Korea
Hahyun You: Department of Biomedical Engineering, Seoul National University College of Medicine, Seoul, South Korea
Ye Seul Bae: ORCiD; Office of Hospital Information, Seoul National University Hospital, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3245523
Journal volume & issue: Vol. 11
pp. 16066 – 16075

Abstract

Read online

As a key modifiable risk factor, alcohol consumption is clinically crucial information that allows medical professionals to further understand their patients’ medical conditions and suggest appropriate lifestyle modifying interventions. However, identifying alcohol-related information from unstructured free-text clinical notes is often challenging. Not only are the formats of the notes inconsistent, but they also include a massive amount of non-alcohol-related information. Furthermore, for medical institutions outside of English-speaking countries, these clinical notes contain both a mixture of English and local languages, inducing additional difficulty in the extraction. Thanks to the increasing availability of electronic medical record (EMR), several previous works explored the idea of using natural language processing (NLP) to train machine learning models that automatically identify alcohol-related information from unstructured clinical notes. However, all these previous works are limited to English clinical notes, thereby able to leverage various large-scale external ontologies during the text preprocessing. Furthermore, they rely on simple NLP techniques such as the bag-of-words models that suffer from high dimensionality and out-of-vocabulary issues. Addressing these issues, we adopt fine-tuning multilingual transformers. By leveraging their linguistically rich contextual information learned during their pre-training, we are able to extract alcohol-related information from unstructured clinical notes without preprocessing the clinical notes on any external ontologies. Furthermore, our work is the first to explore the use of transformers in bilingual clinical notes to extract alcohol-related information. Even with minimal text preprocessing, we achieve extraction accuracy of 84.70% in terms of macro F-1 score.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords