Evolutionary methods for variable selection in the epidemiological modeling of cardiovascular diseases

Christina Brester; Jussi Kauhanen; Tomi-Pekka Tuomainen; Sari Voutilainen; Mauno Rönkkö; Kimmo Ronkainen; Eugene Semenkin; Mikko Kolehmainen

doi:10.1186/s13040-018-0180-x

BioData Mining (Aug 2018)

Evolutionary methods for variable selection in the epidemiological modeling of cardiovascular diseases

Christina Brester,
Jussi Kauhanen,
Tomi-Pekka Tuomainen,
Sari Voutilainen,
Mauno Rönkkö,
Kimmo Ronkainen,
Eugene Semenkin,
Mikko Kolehmainen

Affiliations

Christina Brester: Department of Environmental and Biological Sciences, University of Eastern Finland
Jussi Kauhanen: Institute of Public Health and Clinical Nutrition, University of Eastern Finland
Tomi-Pekka Tuomainen: Institute of Public Health and Clinical Nutrition, University of Eastern Finland
Sari Voutilainen: Institute of Public Health and Clinical Nutrition, University of Eastern Finland
Mauno Rönkkö: Department of Environmental and Biological Sciences, University of Eastern Finland
Kimmo Ronkainen: Institute of Public Health and Clinical Nutrition, University of Eastern Finland
Eugene Semenkin: Institute of Computer Science and Telecommunications, Reshetnev Siberian State University of Science and Technology
Mikko Kolehmainen: Department of Environmental and Biological Sciences, University of Eastern Finland

DOI: https://doi.org/10.1186/s13040-018-0180-x
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background The redundancy of information is becoming a critical issue for epidemiologists. High-dimensional datasets require new effective variable selection methods to be developed. This study implements an advanced evolutionary variable selection method which is applied for cardiovascular predictive modeling. The epidemiological follow-up study KIHD (Kuopio Ischemic Heart Disease Risk Factor Study) was used to compare the designed variable selection method based on an evolutionary search with conventional stepwise selection. The sample contains in total 433 predictor variables and a response variable indicating incidents of cardiovascular diseases for 1465 study subjects. Results The effectiveness of variable selection methods was investigated in combination with two models: Generalized Linear Logistic Regression and Support Vector Machine. We managed to decrease the number of variables from 433 to 38 and save the predictive ability of the models used. Their performance was evaluated with an F-score metric. At most, we gained 65.6% and 67.4% of the F-score before and after variable selection respectively. All the results were averaged over 5-folds of a cross-validation procedure. Conclusions The presented evolutionary variable selection method allows a reduced set of variables to be chosen which are relevant to predicting cardiovascular diseases. A reference list of the most meaningful variables is introduced to be used as a basis for new epidemiological studies. In general, the multicollinearity of variables enables different combinations of predictors to be used and the same performance of models to be attained.

Published in BioData Mining

ISSN: 1756-0381 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Analysis
Website: https://biodatamining.biomedcentral.com/

About the journal

Abstract

Keywords