Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation

Chao Jiang; Victoria Ngo; Richard Chapman; Yue Yu; Hongfang Liu; Guoqian Jiang; Nansu Zong

doi:10.2196/38584

Journal of Medical Internet Research (Jul 2022)

Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation

Chao Jiang,
Victoria Ngo,
Richard Chapman,
Yue Yu,
Hongfang Liu,
Guoqian Jiang,
Nansu Zong

Affiliations

Chao Jiang: ORCiD
Victoria Ngo: ORCiD
Richard Chapman: ORCiD
Yue Yu: ORCiD
Hongfang Liu: ORCiD
Guoqian Jiang: ORCiD
Nansu Zong: ORCiD

DOI: https://doi.org/10.2196/38584
Journal volume & issue: Vol. 24, no. 7
p. e38584

Abstract

Read online

BackgroundMultiple types of biomedical associations of knowledge graphs, including COVID-19–related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. ObjectiveData quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model’s performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. MethodsThe proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. ResultsThe performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. ConclusionsOur preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.

Published in Journal of Medical Internet Research

ISSN: 1438-8871 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Public aspects of medicine
Website: https://www.jmir.org

About the journal