Scientific Reports (Aug 2025)
Advanced machine learning framework for thyroid cancer epidemiology in Iran through integration of environmental socioeconomic and health system predictors
Abstract
Abstract The global escalation of thyroid cancer (TC) incidence, coupled with pronounced provincial and gender-based disparities in Iran, underscores an urgent public health challenge that remains underexplored through integrative analyses of environmental, socioeconomic, and healthcare factors. This study addresses this critical gap by employing an advanced multi-model machine learning (ML) framework to elucidate the spatiotemporal determinants of TC incidence across Iran’s 31 provinces, offering novel insights to inform evidence-based public health strategies. Leveraging data from the Iranian National Population-based Cancer Registry (INPCR) spanning 2014–2017, we synthesized a comprehensive dataset comprising 55 variables sourced from diverse public repositories. Age-standardized incidence rates (ASRs) were meticulously computed and stratified by sex and province, followed by the application of nine ML models for feature selection including Random Forest, XG-Boost, Cat-Boost, and various regression techniques. The significance of identified predictors was rigorously validated using SHAP (SHapley Additive exPlanations) analysis across Random Forest, XG-Boost, and Cat-Boost frameworks. The analysis disclosed considerable variation in TC incidence across the population. The overall four-year ASR was 11.13 per 100,000, with females exhibiting a markedly higher rate of 35.1 per 100,000, significantly exceeding that of males at 9.6 per 100,000. Prominent predictors included Sunshine Duration (SHAP values: − 0.046 overall, 0.015 in females, − 0.005 in males), Provincial-Education-rates, Elevation in Meter, Laboratory-availability, and community Marriage-rates. Significant provincial disparities were observed in the mean ASR of TC across the entire population, notably exemplified by Yazd’s elevated mean ASR of 9.2 per 100,000 in contrast to Semnan’s markedly lower rate of 1.2 per 100,000 over the period 2014–2017. The ML models demonstrated moderate to robust predictive accuracy (R²: 0.21–0.86), underscoring distinct sex-specific risk profiles. This pioneering study illuminates the pivotal roles of climatic, socioeconomic, healthcare access, and environmental factors in shaping TC incidence in Iran, revealing significant regional and gender-specific variations. These findings advocate for the development of targeted public health interventions aimed at mitigating environmental exposures and rectifying healthcare disparities, thereby enhancing the precision and efficacy of TC prevention strategies.
Keywords