BMJ Health & Care Informatics (May 2025)
Evaluating machine learning algorithms for predicting HIV status among young Thai men who have sex with men
Abstract
Objective This study aimed to develop machine learning (ML) models to predict HIV status and assessed the factors associated with HIV infection among young men who have sex with men (MSM) under the Universal Health Coverage (UHC) programme in Thailand.Methods Young MSM aged 15–24 years who underwent HIV testing through the UHC programme from 2015 to 2022 were included. Data were divided into training (70%) and testing (30%) sets, with the Synthetic Minority Oversampling Technique (SMOTE) applied to address data set imbalance. ML models, including logistic regression, k-nearest neighbour (KNN), random forest, extreme gradient boosting (XGB) and AdaBoost, were used to predict HIV infection.Results Among 146 813 young MSM, 11% were diagnosed with HIV. While KNN initially outperformed other ML models, the sensitivity of all models using the original data set was low due to imbalanced data. After applying SMOTE, the XGB model showed the best performance with an accuracy of 0.72, sensitivity of 0.73, specificity of 0.72 and the area under the curve of 0.72. The top predictors of HIV infection were the year of HIV testing (68%), age (55%) and targeted HIV testing (54%).Discussion This study demonstrates the potential of ML models, particularly XGB, in predicting HIV infection among young MSM in Thailand under the UHC programme. The application of SMOTE improved model sensitivity, addressing data imbalance and enhancing predictive accuracy.Conclusions ML models have the potential to enhance HIV risk assessment and inform targeted prevention strategies for high-risk populations.