On building a diabetes centric knowledge base via mining the web

BMC Medical Informatics and Decision Making. 2019;19(S2):185-197 DOI 10.1186/s12911-019-0771-6


Journal Homepage

Journal Title: BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)

Publisher: BMC

LCC Subject Category: Medicine: Medicine (General): Computer applications to medicine. Medical informatics

Country of publisher: United Kingdom

Language of fulltext: English

Full-text formats available: PDF, HTML



Fan Gong (Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine)
Yilei Chen (Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine)
Haofen Wang (Shanghai Leyan Technologies Co. Ltd)
Hao Lu (Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine)


Open peer review

Editorial Board

Instructions for authors

Time From Submission to Publication: 23 weeks


Abstract | Full Text

Abstract Background Diabetes has become one of the hot topics in life science researches. To support the analytical procedures, researchers and analysts expend a mass of labor cost to collect experimental data, which is also error-prone. To reduce the cost and to ensure the data quality, there is a growing trend of extracting clinical events in form of knowledge from electronic medical records (EMRs). To do so, we first need a high-coverage knowledge base (KB) of a specific disease to support the above extraction tasks called KB-based Extraction. Methods We propose an approach to build a diabetes-centric knowledge base (a.k.a. DKB) via mining the Web. In particular, we first extract knowledge from semi-structured contents of vertical portals, fuse individual knowledge from each site, and further map them to a unified KB. The target DKB is then extracted from the overall KB based on a distance-based Expectation-Maximization (EM) algorithm. Results During the experiments, we selected eight popular vertical portals in China as data sources to construct DKB. There are 7703 instances and 96,041 edges in the final diabetes KB covering diseases, symptoms, western medicines, traditional Chinese medicines, examinations, departments, and body structures. The accuracy of DKB is 95.91%. Besides the quality assessment of extracted knowledge from vertical portals, we also carried out detailed experiments for evaluating the knowledge fusion performance as well as the convergence of the distance-based EM algorithm with positive results. Conclusions In this paper, we introduced an approach to constructing DKB. A knowledge extraction and fusion pipeline was first used to extract semi-structured data from vertical portals and individual KBs were further fused into a unified knowledge base. After that, we develop a distance based Expectation Maximization algorithm to extract a subset from the overall knowledge base forming the target DKB. Experiments showed that the data in DKB are rich and of high-quality.