Scientific Data (Apr 2024)
An open source knowledge graph ecosystem for the life sciences
- Tiffany J. Callahan,
- Ignacio J. Tripodi,
- Adrianne L. Stefanski,
- Luca Cappelletti,
- Sanya B. Taneja,
- Jordan M. Wyrwa,
- Elena Casiraghi,
- Nicolas A. Matentzoglu,
- Justin Reese,
- Jonathan C. Silverstein,
- Charles Tapley Hoyt,
- Richard D. Boyce,
- Scott A. Malec,
- Deepak R. Unni,
- Marcin P. Joachimiak,
- Peter N. Robinson,
- Christopher J. Mungall,
- Emanuele Cavalleri,
- Tommaso Fontana,
- Giorgio Valentini,
- Marco Mesiti,
- Lucas A. Gillenwater,
- Brook Santangelo,
- Nicole A. Vasilevsky,
- Robert Hoehndorf,
- Tellen D. Bennett,
- Patrick B. Ryan,
- George Hripcsak,
- Michael G. Kahn,
- Michael Bada,
- William A. Baumgartner,
- Lawrence E. Hunter
Affiliations
- Tiffany J. Callahan
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus
- Ignacio J. Tripodi
- Computer Science Department, Interdisciplinary Quantitative Biology, University of Colorado Boulder
- Adrianne L. Stefanski
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus
- Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano
- Sanya B. Taneja
- Intelligent Systems Program, University of Pittsburgh
- Jordan M. Wyrwa
- Department of Physical Medicine and Rehabilitation, School of Medicine, University of Colorado Anschutz Medical Campus
- Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano
- Nicolas A. Matentzoglu
- Semanticly
- Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory
- Jonathan C. Silverstein
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine
- Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School
- Richard D. Boyce
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine
- Scott A. Malec
- Division of Translational Informatics, University of New Mexico School of Medicine
- Deepak R. Unni
- SIB Swiss Institute of Bioinformatics
- Marcin P. Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory
- Peter N. Robinson
- Berlin Institute of Health at Charité-Universitatsmedizin
- Christopher J. Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory
- Emanuele Cavalleri
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano
- Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano
- Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano
- Marco Mesiti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano
- Lucas A. Gillenwater
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus
- Brook Santangelo
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus
- Nicole A. Vasilevsky
- Data Collaboration Center, Critical Path Institute
- Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology
- Tellen D. Bennett
- Department of Biomedical Informatics, University of Colorado School of Medicine
- Patrick B. Ryan
- Janssen Research and Development
- George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center
- Michael G. Kahn
- Department of Biomedical Informatics, University of Colorado School of Medicine
- Michael Bada
- Division of General Internal Medicine, University of Colorado School of Medicine
- William A. Baumgartner
- Division of General Internal Medicine, University of Colorado School of Medicine
- Lawrence E. Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus
- DOI
- https://doi.org/10.1038/s41597-024-03171-w
- Journal volume & issue
-
Vol. 11,
no. 1
pp. 1 – 22
Abstract
Abstract Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.