Interpretation knowledge extraction for genetic testing via question-answer model

Wenjun Wang; Huanxin Chen; Hui Wang; Lin Fang; Huan Wang; Yi Ding; Yao Lu; Qingyao Wu

doi:10.1186/s12864-024-10978-9

BMC Genomics (Nov 2024)

Interpretation knowledge extraction for genetic testing via question-answer model

Wenjun Wang,
Huanxin Chen,
Hui Wang,
Lin Fang,
Huan Wang,
Yi Ding,
Yao Lu,
Qingyao Wu

Affiliations

Wenjun Wang: School of Software Engineering, South China University of Technology
Huanxin Chen: School of Software Engineering, South China University of Technology
Hui Wang: Shenzhen Cladogram Technology Co., Ltd
Lin Fang: Shenzhen Cladogram Technology Co., Ltd
Huan Wang: Industrial Technology Research Center, Guangdong Institute of Scientific & Technical Information
Yi Ding: Hunan University of Arts and Science
Yao Lu: Shenzhen Cladogram Technology Co., Ltd
Qingyao Wu: School of Software Engineering, South China University of Technology

DOI: https://doi.org/10.1186/s12864-024-10978-9
Journal volume & issue: Vol. 25, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background Sequencing-based genetic testing is widely used in biomedical research, including pathogenic microorganism detection with metagenomic next-generation sequencing (mNGS). The application of sequencing results to clinical diagnosis and treatment relies on various interpretation knowledge bases. Currently, the existing knowledge bases are primarily built through manual knowledge extraction. This method requires professionals to read extensive literature and extract relevant knowledge from it, which is time-consuming and costly. Furthermore, manual extraction unavoidably introduces subjective biases. In this study, we aimed to automatically extract knowledge for interpreting mNGS results. Method We propose a novel approach to automatically extract pathogenic microorganism knowledge based on the question-answer (QA) model. First, we construct a MicrobeDB dataset since there is no available pathogenic microorganism QA dataset for training the model. The created dataset contains 3,161 samples from 618 published papers covering 224 pathogenic microorganisms. Then, we fine-tune the selected baseline model based on MicrobeDB. Finally, we utilize ChatGPT to enhance the diversity of training data, and employ data expansion to increase training data volume. Results Our method achieves an Exact Match (EM) and F1 score of 88.39% and 93.18%, respectively, on the MicrobeDB test set. We also conduct ablation studies on the proposed data augmentation method. In addition, we perform comparative experiments with the ChatPDF tool based on the ChatGPT API to demonstrate the effectiveness of the proposed method. Conclusions Our method is effective and valuable for extracting pathogenic microorganism knowledge.

Published in BMC Genomics

ISSN: 1471-2164 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Biology (General): Genetics
Website: http://bmcgenomics.biomedcentral.com

About the journal

Abstract

Keywords