Code stylometry vs formatting and minification

Stefano Balla; Maurizio Gabbrielli; Stefano Zacchiroli

doi:10.7717/peerj-cs.2142

PeerJ Computer Science (Sep 2024)

Code stylometry vs formatting and minification

Stefano Balla,
Maurizio Gabbrielli,
Stefano Zacchiroli

Affiliations

Stefano Balla: DISI, University of Bologna, Bologna, Italy
Maurizio Gabbrielli: DISI, University of Bologna, Bologna, Italy
Stefano Zacchiroli: LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France

DOI: https://doi.org/10.7717/peerj-cs.2142
Journal volume & issue: Vol. 10
p. e2142

Abstract

Read online Read online

The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords