MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Han Kyul Kim; Sae Won Choi; Ye Seul Bae; Jiin Choi; Hyein Kwon; Christine P. Lee; Hae-Young Lee; Taehoon Ko

doi:10.3390/app10217831

Applied Sciences (Nov 2020)

MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Han Kyul Kim,
Sae Won Choi,
Ye Seul Bae,
Jiin Choi,
Hyein Kwon,
Christine P. Lee,
Hae-Young Lee,
Taehoon Ko

Affiliations

Han Kyul Kim: Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Sae Won Choi: Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Ye Seul Bae: Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Jiin Choi: Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Hyein Kwon: Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Christine P. Lee: Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea
Hae-Young Lee: Department of Internal Medicine, Seoul National University Hospital, Seoul 03080, Korea
Taehoon Ko: Department of Medical Informatics, The Catholic University of Korea, Seoul 03080, Korea

DOI: https://doi.org/10.3390/app10217831
Journal volume & issue: Vol. 10, no. 21
p. 7831

Abstract

Read online

With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or affiliations, rely on different sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an effective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to find standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only effective, but also a practical term mapping method for text data standardization and pre-processing.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords