Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network

Anna Kurtukova; Aleksandr Romanov; Alexander Shelupanov; Anastasia Fedotova

doi:10.3390/fi14100287

Future Internet (Sep 2022)

Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network

Anna Kurtukova,
Aleksandr Romanov,
Alexander Shelupanov,
Anastasia Fedotova

Affiliations

Anna Kurtukova: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Aleksandr Romanov: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Alexander Shelupanov: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia
Anastasia Fedotova: Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia

DOI: https://doi.org/10.3390/fi14100287
Journal volume & issue: Vol. 14, no. 10
p. 287

Abstract

Read online

This paper is a continuation of our previous work on solving source code authorship identification problems. The analysis of heterogeneous source code is a relevant issue for copyright protection in commercial software development. This is related to the specificity of development processes and the usage of collaborative development tools (version control systems). As a result, there are source codes written according to different programming standards by a team of programmers with different skill levels. Another application field is information security—in particular, identifying the author of computer viruses. We apply our technique based on a hybrid of Inception-v1 and Bidirectional Gated Recurrent Units architectures on heterogeneous source codes and consider the most common commercial development complex cases that negatively affect the authorship identification process. The paper is devoted to the possibilities and limitations of the author’s technique in various complex cases. For situations where a programmer was proficient in two programming languages, the average accuracy was 87%; for proficiency in three or more—76%. For the artificially generated source code case, the average accuracy was 81.5%. Finally, the average accuracy for source codes generated from commits was 84%. The comparison with state-of-the-art approaches showed that the proposed method has no full-functionality analogs covering actual practical cases.

Published in Future Internet

ISSN: 1999-5903 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/futureinternet/

About the journal

Abstract

Keywords