A systematic mapping study of source code representation for deep learning in software engineering

Hazem Peter Samoaa; Firas Bayram; Pasquale Salza; Philipp Leitner

doi:10.1049/sfw2.12064

IET Software (Aug 2022)

A systematic mapping study of source code representation for deep learning in software engineering

Hazem Peter Samoaa,
Firas Bayram,
Pasquale Salza,
Philipp Leitner

Affiliations

Hazem Peter Samoaa: Software Engineering and Interaction Design Division Chalmers | University of Gothenburg Gothenburg Sweden
Firas Bayram: Department of Mathematics and Computer Science Karlstad University Karlstad Sweden
Pasquale Salza: Software Evolution & Architecture Lab University of Zurich Zurich Switzerland
Philipp Leitner: Software Engineering and Interaction Design Division Chalmers | University of Gothenburg Gothenburg Sweden

DOI: https://doi.org/10.1049/sfw2.12064
Journal volume & issue: Vol. 16, no. 4
pp. 351 – 385

Abstract

Read online

Abstract The usage of deep learning (DL) approaches for software engineering has attracted much attention, particularly in source code modelling and analysis. However, in order to use DL, source code needs to be formatted to fit the expected input form of DL models. This problem is known as source code representation. Source code can be represented via different approaches, most importantly, the tree‐based, token‐based, and graph‐based approaches. We use a systematic mapping study to investigate i detail the representation approaches adopted in 103 studies that use DL in the context of software engineering. Thus, studies are collected from 2014 to 2021 from 14 different journals and 27 conferences. We show that each way of representing source code can provide a different, yet orthogonal view of the same source code. Thus, different software engineering tasks might require different (combinations of) code representation approaches, depending on the nature and complexity of the task. Particularly, we show that it is crucial to define whether the DL approach requires lexical, syntactical, or semantic code information. Our analysis shows that a wide range of different representations and combinations of representations (hybrid representations) are used to solve a wide range of common software engineering problems. However, we also observe that current research does not generally attempt to transfer existing representations or models to other studies even though there are other contexts in which these representations and models may also be useful. We believe that there is potential for more reuse and the application of transfer learning when applying DL to software engineering tasks.

Published in IET Software

ISSN: 1751-8806 (Print); 1751-8814 (Online)
Publisher: Hindawi-IET
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/ietsfw

About the journal