Jisuanji kexue (Apr 2023)

Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation

  • WANG Taiyan, PAN Zulie, YU Lu, SONG Jingbin

DOI
https://doi.org/10.11896/jsjkx.220300271
Journal volume & issue
Vol. 50, no. 4
pp. 288 – 297

Abstract

Read online

Binary code similarity detection has been widely used in vulnerability searching,malware detection,advanced program analysis and other fields in recent years,while program code is similar to natural language in a degree,researchers start to use pre-training and other natural language processing related technologies to improve accuracy.A binary code similarity detection method based on pre-training assembly instruction representation is proposed to deal with the accuracy bottleneck due to insufficient consideration of instruction probability features.It includes tokenization method for multi-arch assembly instructions,and pre-trai-ning tasks that considering control flow,data flow,instruction logic and probability of occurrence,to achieve better vectorized representation of instructions.Downstream binary code similarity detection task is improved by combining pre-training method to gain accuracy boost.Experiments show that,compared with the existing methods,the proposed method improves instruction representing performance by 23.7% at the maximum,and improves block searching ability and similarity detection performance by up to 33.97% and 400% respectively.

Keywords