On building a diabetes centric knowledge base via mining the web

Fan Gong; Yilei Chen; Haofen Wang; Hao Lu

doi:10.1186/s12911-019-0771-6

BMC Medical Informatics and Decision Making (Apr 2019)

On building a diabetes centric knowledge base via mining the web

Fan Gong,
Yilei Chen,
Haofen Wang,
Hao Lu

Affiliations

Fan Gong: Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine
Yilei Chen: Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine
Haofen Wang: Shanghai Leyan Technologies Co. Ltd
Hao Lu: Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine

DOI: https://doi.org/10.1186/s12911-019-0771-6
Journal volume & issue: Vol. 19, no. S2
pp. 185 – 197

Abstract

Read online

Abstract Background Diabetes has become one of the hot topics in life science researches. To support the analytical procedures, researchers and analysts expend a mass of labor cost to collect experimental data, which is also error-prone. To reduce the cost and to ensure the data quality, there is a growing trend of extracting clinical events in form of knowledge from electronic medical records (EMRs). To do so, we first need a high-coverage knowledge base (KB) of a specific disease to support the above extraction tasks called KB-based Extraction. Methods We propose an approach to build a diabetes-centric knowledge base (a.k.a. DKB) via mining the Web. In particular, we first extract knowledge from semi-structured contents of vertical portals, fuse individual knowledge from each site, and further map them to a unified KB. The target DKB is then extracted from the overall KB based on a distance-based Expectation-Maximization (EM) algorithm. Results During the experiments, we selected eight popular vertical portals in China as data sources to construct DKB. There are 7703 instances and 96,041 edges in the final diabetes KB covering diseases, symptoms, western medicines, traditional Chinese medicines, examinations, departments, and body structures. The accuracy of DKB is 95.91%. Besides the quality assessment of extracted knowledge from vertical portals, we also carried out detailed experiments for evaluating the knowledge fusion performance as well as the convergence of the distance-based EM algorithm with positive results. Conclusions In this paper, we introduced an approach to constructing DKB. A knowledge extraction and fusion pipeline was first used to extract semi-structured data from vertical portals and individual KBs were further fused into a unified knowledge base. After that, we develop a distance based Expectation Maximization algorithm to extract a subset from the overall knowledge base forming the target DKB. Experiments showed that the data in DKB are rich and of high-quality.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords