Studia Universitatis Babes-Bolyai: Series Informatica (Dec 2023)

Deobfuscating JavaScript Code Using Character-Based Tokenization

  • Alexandru-Gabriel SÎRBU

DOI
https://doi.org/10.24193/subbi.2023.2.01
Journal volume & issue
Vol. 68, no. 2

Abstract

Read online

The JavaScript code deployed goes through the process of minification, in which variables are renamed using single-character names and spaces are removed in order for the files to have a smaller size, thus loading faster. Because of this, the code becomes unintelligible, making it harder to be analyzed manually. Since JavaScript experts can under- stand it, machine learning approaches to deobfuscate the minified file are possible. Thus, we propose a technique that finds a fitting name for each obfuscated variable, which is both intuitive and meaningful based on the usage of that variable, based on a Sequence-to-Sequence model, which generates the name character by character to cover all the possible variable names. The proposed approach achieves an average exact name generation accuracy of 70.53%, outperforming the state-of-the-art by 12%.

Keywords