Variable selection for disease progression models: methods for oncogenetic trees and application to cancer and HIV

Katrin Hainke; Sebastian Szugat; Roland Fried; Jörg Rahnenführer

doi:10.1186/s12859-017-1762-1

BMC Bioinformatics (Aug 2017)

Variable selection for disease progression models: methods for oncogenetic trees and application to cancer and HIV

Katrin Hainke,
Sebastian Szugat,
Roland Fried,
Jörg Rahnenführer

Affiliations

Katrin Hainke: Department of Statistics, TU Dortmund University
Sebastian Szugat: Department of Statistics, TU Dortmund University
Roland Fried: Department of Statistics, TU Dortmund University
Jörg Rahnenführer: Department of Statistics, TU Dortmund University

DOI: https://doi.org/10.1186/s12859-017-1762-1
Journal volume & issue: Vol. 18, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Background Disease progression models are important for understanding the critical steps during the development of diseases. The models are imbedded in a statistical framework to deal with random variations due to biology and the sampling process when observing only a finite population. Conditional probabilities are used to describe dependencies between events that characterise the critical steps in the disease process. Many different model classes have been proposed in the literature, from simple path models to complex Bayesian networks. A popular and easy to understand but yet flexible model class are oncogenetic trees. These have been applied to describe the accumulation of genetic aberrations in cancer and HIV data. However, the number of potentially relevant aberrations is often by far larger than the maximal number of events that can be used for reliably estimating the progression models. Still, there are only a few approaches to variable selection, which have not yet been investigated in detail. Results We fill this gap and propose specifically for oncogenetic trees ten variable selection methods, some of these being completely new. We compare them in an extensive simulation study and on real data from cancer and HIV. It turns out that the preselection of events by clique identification algorithms performs best. Here, events are selected if they belong to the largest or the maximum weight subgraph in which all pairs of vertices are connected. Conclusions The variable selection method of identifying cliques finds both the important frequent events and those related to disease pathways.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords