IEEE Access (Jan 2023)

Addressing Imbalance Problem for Multi Label Classification of Scholarly Articles

  • Aiman Hafeez,
  • Tariq Ali,
  • Asif Nawaz,
  • Saif Ur Rehman,
  • Azhar Imran Mudasir,
  • Abdulaziz A. Alsulami,
  • Ali Alqahtani

DOI
https://doi.org/10.1109/ACCESS.2023.3293852
Journal volume & issue
Vol. 11
pp. 74500 – 74516

Abstract

Read online

Scientific document classification is an important field of machine learning. Currently, scientific document category identification is done manually. There are already defined taxonomies available for categorizing scientific documents, such as the Association for Computing Machinery Computing Classification System (ACM CCS) and Bibsonomy. These taxonomies facilitate authors in the categories of their manuscripts. The incorporation of research work from a variety of domains in the assignment takes on the form of a Multi-Label Classification (MLC). Using MLC, it is possible to assign more than one class to a single document. To address the problem of MLC in its entirety, two distinct methods are used (Problem Transformation and Algorithm Adaptation). The MLC dataset is transformed into one or more single-label datasets through the application of the problem transformation technique. Whereas, a single classifier is modified during the algorithm adaptation process so that it can predict multiple labels. Currently, document classification is done using various techniques in the literature, but none of them paid much attention to the problem of imbalance in Multi-Label Datasets (MLD). However, many effective techniques for dealing with imbalance are available in the literature. The goal of this study is to find an effective technique for balancing datasets before multi-label classification to get better predictions for the classes with fewer instances. Six MLDs, nine transformation techniques and seven classifiers are evaluated in this research work. The proposed research will result in a more accurate recommendation of a research topic for a document. For imbalanced MLDs, LPROS is the best resampling technique using statistical tests. When compared to the other classifiers, the BRkNN classifier is better for MLC. This research will facilitate the classification of documents into their respective classes which can be used by various citation indexes.

Keywords