An analytical code quality methodology using Latent Dirichlet Allocation and Convolutional Neural Networks

Shaymaa E. Sorour; Hanan E. Abdelkader; Karam M. Sallam; Ripon K. Chakrabortty; Michael J. Ryan; Amr Abohany

Journal of King Saud University: Computer and Information Sciences (Sep 2022)

An analytical code quality methodology using Latent Dirichlet Allocation and Convolutional Neural Networks

Shaymaa E. Sorour,
Hanan E. Abdelkader,
Karam M. Sallam,
Ripon K. Chakrabortty,
Michael J. Ryan,
Amr Abohany

Affiliations

Shaymaa E. Sorour: Faculty of Specific Education, Kafrelsheikh University, Egypt
Hanan E. Abdelkader: Faculty of Specific Education, Mansoura University, Egypt
Karam M. Sallam: Faculty of Science and Information Technology, University of Canberra, Australia; Faculty of Computers and Informatics, Zagazig University, Egypt; Corresponding author at: Faculty of Computers and Informatics, Zagazig University, Egypt.
Ripon K. Chakrabortty: School of Engineering and IT, University of New South Wales, Canberra, Australia
Michael J. Ryan: School of Engineering and IT, University of New South Wales, Canberra, Australia
Amr Abohany: Faculty of Computing and Information, Kafrelsheikh University, Egypt

Journal volume & issue: Vol. 34, no. 8
pp. 5979 – 5997

Abstract

Read online

Recently, Code Quality (CQ) has become critical in a wide range of organizations and in many areas from academia to industry. CQ, in terms of readability, security, and testability, is a major goal throughout the software development process because it affects overall Software Quality (SQ) in terms of subsequent releases, maintenance, and updates. It is particularly important for the development of safety critical systems. Existing studies on CQ have several shortcomings in that they are based on incomplete information about the source code, and tend to focus on only one feature, which is likely to determine the performance of the model. Moreover, these considerations often limit obtaining high accuracy because there is no strong relationship between the input data and the output data. Thus, it is necessary to design an effective and efficient SQ measurement system for measuring multiple quality factors. To that end, we propose a deep learning framework that employed a Latent Dirichlet Allocation (LDA) with Convolutional Neural Networks (CNN), called CNN-LDA, to classify input data into topics that are related to CQ features and to identify hidden patterns and correlations in programming data. Three SQ metrics (i.e., readability, security, and testability) and machine learning techniques (e.g., random forest (RF) and support vector machine (SVM)) are taken into account to validate the proposed model. The proposed CNN-LDA outperformed its peers across the vast majority of datasets examined. The average overall F-measure for readability, security, and testability are 94%,94% and 93%. The average overall accuracy for readability, security, and testability are 93%,93% and 92%. The superiority of LDA-CNN over the other classifiers was very clear based on a Wilcoxon’s non-parametric statistical test (α=0.05).

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords