Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

Matthew McTeer; Robin Henderson; Quentin M. Anstee; Paolo Missier

doi:10.3390/math12050777

Mathematics (Mar 2024)

Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

Matthew McTeer,
Robin Henderson,
Quentin M. Anstee,
Paolo Missier

Affiliations

Matthew McTeer: School of Computing, Faculty of Science, Agriculture & Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
Robin Henderson: School of Mathematics, Statistics and Physics, Faculty of Science, Agriculture & Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
Quentin M. Anstee: Translational & Clinical Research Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
Paolo Missier: School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK

DOI: https://doi.org/10.3390/math12050777
Journal volume & issue: Vol. 12, no. 5
p. 777

Abstract

Read online

Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.

Published in Mathematics

ISSN: 2227-7390 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/mathematics

About the journal

Abstract

Keywords