Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Trevor S. Frisby; Shawn J. Baker; Guillaume Marçais; Quang Minh Hoang; Carl Kingsford; Christopher J. Langmead

doi:10.1186/s12859-021-04096-6

BMC Bioinformatics (Apr 2021)

Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Trevor S. Frisby,
Shawn J. Baker,
Guillaume Marçais,
Quang Minh Hoang,
Carl Kingsford,
Christopher J. Langmead

Affiliations

Trevor S. Frisby: Computational Biology Department, Carnegie Mellon University
Shawn J. Baker: Computational Biology Department, Carnegie Mellon University
Guillaume Marçais: Computational Biology Department, Carnegie Mellon University
Quang Minh Hoang: Computer Science Department, Carnegie Mellon University
Carl Kingsford: Computational Biology Department, Carnegie Mellon University
Christopher J. Langmead: Computational Biology Department, Carnegie Mellon University

DOI: https://doi.org/10.1186/s12859-021-04096-6
Journal volume & issue: Vol. 22, no. 1
pp. 1 – 19

Abstract

Read online

Abstract Background Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. Results We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. Conclusion Harvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords