Biomolecules (May 2025)
Robust Method for Confidence Interval Estimation in Outlier-Prone Datasets: Application to Molecular and Biophysical Data
Abstract
Estimating confidence intervals in small or noisy datasets is a recurring challenge in biomolecular research, particularly when data contain outliers or exhibit high variability. This study introduces a robust statistical method that combines a hybrid bootstrap procedure with Steiner’s most frequent value (MFV) approach to estimate confidence intervals without removing outliers or altering the original dataset. The MFV technique identifies the most representative value while minimizing information loss, making it well suited for datasets with limited sample sizes or non-Gaussian distributions. To demonstrate the method’s robustness, we intentionally selected a dataset from outside the biomolecular domain: a fast-neutron activation cross-section of the 109Ag(n, 2n)108mAg reaction from nuclear physics. This dataset presents large uncertainties, inconsistencies, and known evaluation difficulties. Confidence intervals for the cross-section were determined using a method called the MFV–hybrid parametric bootstrapping (MFV-HPB) framework. In this approach, the original data points were repeatedly resampled, and new values were simulated based on their uncertainties before the MFV was calculated. Despite the dataset’s complexity, the method yielded a stable MFV estimate of 709 mb with a 68.27% confidence interval of [691, 744] mb, illustrating the method’s ability to provide interpretable results in challenging scenarios. Although the example is from nuclear science, the same statistical issues commonly arise in biomolecular fields, such as enzymatic kinetics, molecular assays, and diagnostic biomarker studies. The MFV-HPB framework provides a reliable and generalizable approach for extracting central estimates and confidence intervals in situations where data are difficult to collect, replicate, or interpret. Its resilience to outliers, independence from distributional assumptions, and compatibility with small-sample scenarios make it particularly valuable in molecular medicine, bioengineering, and biophysics.
Keywords