IEEE Access (Jan 2024)

Accurate Information Type Classification for Software Issue Discussions With Random Oversampling

  • Boburmirzo Muhibullaev,
  • Jindae Kim

DOI
https://doi.org/10.1109/ACCESS.2024.3398732
Journal volume & issue
Vol. 12
pp. 65373 – 65385

Abstract

Read online

An Issue Tracking System (ITS) plays a crucial role in software development and provides valuable information for understanding issue management. In an ITS, software developers often discuss issues that are reported during software development. Recent studies analyzed such issue discussions and identified information types of issue comments that appeared in the discussions. Automatic classification of the information types can help developers understand and locate required information more easily, but existing techniques cannot provide accurate classification. In this study, we propose a more accurate technique to classify information types of issue comments. The key to increasing classification performance is employing random oversampling to deal with imbalances among training instances of different information types. With random oversampling, we trained a classifier using logistic regression with hyperparameter tuning and achieved an average 0.95 F1-score, which was much higher than 0.53 of the compared existing technique. We also considered two other key aspects of the technique to fully investigate the potential performance improvement. We expanded an existing issue comment dataset by adding 4,098 more instances, almost double the size of the dataset. We analyzed the influence of hyperparameters on classification performance and found that using values within an appropriate range is important to achieve high performance.

Keywords