Unleashing the power of pseudo-code for binary code similarity analysis

Weiwei Zhang; Zhengzi Xu; Yang Xiao; Yinxing Xue

doi:10.1186/s42400-022-00121-0

Cybersecurity (Dec 2022)

Unleashing the power of pseudo-code for binary code similarity analysis

Weiwei Zhang,
Zhengzi Xu,
Yang Xiao,
Yinxing Xue

Affiliations

Weiwei Zhang: School of Computer Science and Engineering, University of Science and Technology of China
Zhengzi Xu: School of Computer Science and Engineering, Nanyang Technological University
Yang Xiao: Institute of Information Engineering, Chinese Academy of Sciences
Yinxing Xue: School of Computer Science and Engineering, University of Science and Technology of China

DOI: https://doi.org/10.1186/s42400-022-00121-0
Journal volume & issue: Vol. 5, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Code similarity analysis has become more popular due to its significant applicantions, including vulnerability detection, malware detection, and patch analysis. Since the source code of the software is difficult to obtain under most circumstances, binary-level code similarity analysis (BCSA) has been paid much attention to. In recent years, many BCSA studies incorporating AI techniques focus on deriving semantic information from binary functions with code representations such as assembly code, intermediate representations, and control flow graphs to measure the similarity. However, due to the impacts of different compilers, architectures, and obfuscations, binaries compiled from the same source code may vary considerably, which becomes the major obstacle for these works to obtain robust features. In this paper, we propose a solution, named UPPC (Unleashing the Power of Pseudo-code), which leverages the pseudo-code of binary function as input, to address the binary code similarity analysis challenge, since pseudo-code has higher abstraction and is platform-independent compared to binary instructions. UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function. We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures (X86, ARM), different optimization options (O0-O3), different compilers (GCC, Clang), and four obfuscation strategies. The experimental results show that the accuracy of UPPC in function search is 33.2% higher than that of existing methods.

Published in Cybersecurity

ISSN: 2523-3246 (Online)
Publisher: SpringerOpen
Country of publisher: Singapore
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://cybersecurity.springeropen.com/

About the journal

Abstract

Keywords