网络与信息安全学报 (Jun 2024)

Automated deobfuscation and family classification system for Excel 4.0 macros

  • Chenguang LI, Xiuzhang YANG, Guojun PENG

DOI
https://doi.org/10.11959/issn.2096-109x.2024040
Journal volume & issue
Vol. 10, no. 3
pp. 66 – 80

Abstract

Read online

In recent years, a surge has been witnessed in cyber-attacks that leverage malicious Excel 4.0 macros (XLM) within documents. Malicious XLM codes often undergo complex obfuscation, posing a substantial challenge for conventional analysis methods and detection systems to discern the actual functionality within a vast array of samples. Consequently, an automated system for deobfuscating XLM and extracting key Indicators of Compromise (IOCs), named XLMRevealer, was developed to counter the diverse obfuscation strategies employed in malicious samples. XLMRevealer was architected upon abstract syntax trees and execution simulation, encompassing 138 comprehensive macro function handlers. Based on that, Word and Token features tailored to XLM code peculiarities were extracted, capturing multi-level, fine-grained features through feature fusion. XLMRevealer incorporated a CNNBiLSTM model to discern familial correlations across dimensions, facilitating family classification. Finally, a dataset comprising 2346 samples from five distinct sources was constructed for both deobfuscation and family classification experiments. Results indicated that XLMRevealer achieved a 71.3% deobfuscation success rate, outperforming XLMMacroDeobfuscator and SYMBEXCEL by 20.8% and 15.8%, respectively. Its efficiency was stable, with an average processing time of only 0.512 seconds. The family classification accuracy for deobfuscated XLM codes stood at 94.88%, surpassing all baseline models and underscoring the efficacy of Word and Token feature integration. Furthermore, to assess the impact of deobfuscation on family classification and account for variability in obfuscation techniques across families, experiments were conducted on both the original and uniformly obfuscated XLM codes. The accuracies were 89.58% and 53.61%, respectively, demonstrating the model's capability to learn obfuscation features and confirming the significant enhancement deobfuscation provides for family classification.

Keywords