Science and Technology Indonesia (Apr 2024)

LSTM-CNN Hybrid Model Performance Improvement with BioWordVec for Biomedical Report Big Data Classification

  • Dian Kurniasari,
  • Warsono,
  • Mustofa Usman,
  • Favorisen Rosyking Lumbanraja,
  • Wamiliana

DOI
https://doi.org/10.26554/sti.2024.9.2.273-283
Journal volume & issue
Vol. 9, no. 2
pp. 273 – 283

Abstract

Read online

The rise in mortality rates due to leukemia has fueled the swift expansion of publications concerning the disease. The increase in publications has dramatically affected the enhancement of biomedical literature, further complicating the manual extraction of pertinent material on leukemia. Text classification is an approach used to retrieve pertinent and top-notch information from the biomedical literature. This research suggests employing an LSTM-CNN hybrid model to tackle imbalanced data classification in a dataset of PubMed abstracts centred on leukemia. Random Undersampling and Random Oversampling techniques are merged to tackle the data imbalance problem. The classification model’s performance is improved by utilizing a pre trained word embedding created explicitly for the biomedical domain, BioWordVec. Model evaluation indicates that hybrid resampling techniques with domain-specific pre-trained word embeddings can enhance model performance in classification tasks, achieving accuracy, precision, recall, and f1-score of 99.55%, 99%, 100%, and 99%, respectively. The results suggest that this research could be an alternative technique to help obtain information about leukemia.

Keywords