Scientific Data (Jun 2024)
Machine Learning-Enhanced Extraction of Biomarkers for High-Grade Serous Ovarian Cancer from Proteomics Data
Abstract
Abstract Comprehensive biomedical proteomic datasets are accumulating exponentially, warranting robust analytics to deconvolute them for identifying novel biological insights. Here, we report a strategic machine learning (ML)-based feature extraction workflow that was applied to unveil high-performing protein markers for high-grade serous ovarian carcinoma (HGSOC) from publicly available ovarian cancer tissue and serum proteomics datasets. Diagnosis of HGSOC, an aggressive form of ovarian cancer, currently relies on diagnostic methods based on tissue biopsy and/or non-specific biomarkers such as the cancer antigen 125 (CA125) and human epididymis protein 4 (HE4). Our newly developed ML-based approach enabled the identification of new serum proteomic biomarkers for HGSOC. The performance verification of these marker combinations using two independent cohorts affirmed their outperformance against known biomarkers for ovarian cancer including clinically used serum markers with >97% AUC. Our analysis also added novel biological insights such as enriched cancer-related processes associated with HGSOC.