Intelligent Medicine (Aug 2024)
Impact of data balancing a multiclass dataset before the creation of association rules to study bacterial vaginosis
Abstract
Background: Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost. This study explored the data balancing process with the intention of improving the quality of association rules. The article aimed to balance the unbalanced multiclass dataset to improve association rule creation. Methods: A dataset with 201 observations and 58 variables was analyzed. A preconstructed dataset was used. The authors collected the data between August 2016 and October 2018 in Tabasco, Mexico. The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco. To determine the best k-value, the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique (SMOTE), random over-sampling examples (ROSE), and adaptive syntetic sampling approach for imbalanced learning (ADASYN) algorithms. The Apriori algorithm created the rules and to select rules with statistical significance, the is.redundant(), is.significant(), and is.maximal() functions and quality metric Fisher’s exact tes were used. The biological validation was carried out by the expert (bacteriologist). Results: The ADASYN algorithm at K=9 the out of the bag (OOB) error was zero, this was the best K-values. In the balancing process the ADASYN algorithm show best the performance. From the dataset balanced with ADASYN, the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test, and the biological validation reported 13 rules. Gram - bacteria Atopobium vaginae, Gardnerella vaginalis, Megasphaera filotipo 1, Mycoplasma hominis and Ureaplasma parvum were detected by the apriori algorithm from the balanced dataset. Conclusion: Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.