Natural Language Processing Application on Commit Messages: A Case Study on HEP Software

Yue Yang; Elisabetta Ronchieri; Marco Canaparo

doi:10.3390/app122110773

Applied Sciences (Oct 2022)

Natural Language Processing Application on Commit Messages: A Case Study on HEP Software

Yue Yang,
Elisabetta Ronchieri,
Marco Canaparo

Affiliations

Yue Yang: Department of Statistical Sciences, University of Bologna, 40126 Bologna, Italy
Elisabetta Ronchieri: Department of Statistical Sciences, University of Bologna, 40126 Bologna, Italy
Marco Canaparo: INFN CNAF, 40126 Bologna, Italy

DOI: https://doi.org/10.3390/app122110773
Journal volume & issue: Vol. 12, no. 21
p. 10773

Abstract

Read online

Version Control and Source Code Management Systems, such as GitHub, contain a large amount of unstructured historical information of software projects. Recent studies have introduced Natural Language Processing (NLP) to help software engineers retrieve information from a very large collection of unstructured data. In this study, we have extended our previous study by increasing our datasets and machine learning and clustering techniques. We have followed a complex methodology made up of various steps. Starting from the raw commit messages we have employed NLP techniques to build a structured database. We have extracted their main features and used them as input of different clustering algorithms. Once each entry was labelled, we applied supervised machine learning techniques to build a prediction and classification model. We have developed a machine learning-based model to automatically classify commit messages of a software project. Our model exploits a ground-truth dataset that includes commit messages obtained from various GitHub projects belonging to the High Energy Physics context. The contribution of this paper is two-fold: it proposes a ground-truth database and it provides a machine learning prediction model that automatically identifies the more change-prone areas of code. Our model has obtained a very high average accuracy (0.9590), precision (0.9448), recall (0.9382), and F1-score (0.9360).

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords