Accurate Information Type Classification for Software Issue Discussions With Random Oversampling

Boburmirzo Muhibullaev; Jindae Kim

doi:10.1109/ACCESS.2024.3398732

IEEE Access (Jan 2024)

Accurate Information Type Classification for Software Issue Discussions With Random Oversampling

Boburmirzo Muhibullaev,
Jindae Kim

Affiliations

Boburmirzo Muhibullaev: ORCiD; Department of Computer Science and Engineering, Seoul National University of Science and Technology, Nowon-gu, Seoul, Republic of Korea
Jindae Kim: ORCiD; Department of Computer Science and Engineering, Seoul National University of Science and Technology, Nowon-gu, Seoul, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3398732
Journal volume & issue: Vol. 12
pp. 65373 – 65385

Abstract

Read online

An Issue Tracking System (ITS) plays a crucial role in software development and provides valuable information for understanding issue management. In an ITS, software developers often discuss issues that are reported during software development. Recent studies analyzed such issue discussions and identified information types of issue comments that appeared in the discussions. Automatic classification of the information types can help developers understand and locate required information more easily, but existing techniques cannot provide accurate classification. In this study, we propose a more accurate technique to classify information types of issue comments. The key to increasing classification performance is employing random oversampling to deal with imbalances among training instances of different information types. With random oversampling, we trained a classifier using logistic regression with hyperparameter tuning and achieved an average 0.95 F1-score, which was much higher than 0.53 of the compared existing technique. We also considered two other key aspects of the technique to fully investigate the potential performance improvement. We expanded an existing issue comment dataset by adding 4,098 more instances, almost double the size of the dataset. We analyzed the influence of hyperparameters on classification performance and found that using values within an appropriate range is important to achieve high performance.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords