Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer
Juan C. Laria,
M. Carmen Aguilera-Morillo,
Enrique Álvarez,
Rosa E. Lillo,
Sara López-Taruella,
María del Monte-Millán,
Antonio C. Picornell,
Miguel Martín,
Juan Romo
Affiliations
Juan C. Laria
UC3M-BS Santander Big Data Institute, 28903 Getafe, Spain
M. Carmen Aguilera-Morillo
UC3M-BS Santander Big Data Institute, 28903 Getafe, Spain
Enrique Álvarez
Department of Medical Oncology, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, 28007 Madrid, Spain
Rosa E. Lillo
UC3M-BS Santander Big Data Institute, 28903 Getafe, Spain
Sara López-Taruella
Department of Medical Oncology, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, 28007 Madrid, Spain
María del Monte-Millán
Department of Medical Oncology, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, 28007 Madrid, Spain
Antonio C. Picornell
Department of Medical Oncology, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, 28007 Madrid, Spain
Miguel Martín
Department of Medical Oncology, Hospital General Universitario Gregorio Marañón, Instituto de Investigación Sanitaria Gregorio Marañón, 28007 Madrid, Spain
Juan Romo
UC3M-BS Santander Big Data Institute, 28903 Getafe, Spain
Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of defining a list of genes that will characterize an expression profile remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.