Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models

Norimitsu Ito; Masaki Hashimoto; Akira Otsuka

doi:10.1109/ACCESS.2023.3316215

IEEE Access (Jan 2023)

Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models

Norimitsu Ito,
Masaki Hashimoto,
Akira Otsuka

Affiliations

Norimitsu Ito: ORCiD; Police Info-Communications Research Center, National Police Academy, Fuchu, Tokyo, Japan
Masaki Hashimoto: ORCiD; Institute of Information Security, Yokohama, Kanagawa, Japan
Akira Otsuka: ORCiD; Institute of Information Security, Yokohama, Kanagawa, Japan

DOI: https://doi.org/10.1109/ACCESS.2023.3316215
Journal volume & issue: Vol. 11
pp. 102796 – 102805

Abstract

Read online

Binary code similarity detection is an effective analysis technique for vulnerability, bug, and plagiarism detection in software for which the source code cannot be obtained. The recent proliferation of IoT devices has also increased the demand for similarity detection across different architectures. However, there are currently not many examples of feature extraction methods using neural machine translation (NMT) models being applied to similarity detection in basic block units across different architectures. In this research, we propose new methods that extract features at a higher speed and detect similarities across different architectures with higher accuracy than existing methods for basic block feature extraction using neural machine translation models. We assume that the intermediate representation of the NMT model, which learned the translation of basic blocks across different architectures, includes the semantics of the instructions in the basic block. Hence we adopted the intermediate representation as the features of the basic blocks. Then, we applied the linear transformation used in bilingual word embedding to match the embedding space of basic blocks across different architectures. This enables the similarity detection in basic block units across different architectures with higher accuracy than the distance learning method used in existing research to match the embedding space. In the evaluation experiment, we compare the Precision at k (P@k) on the same dataset with existing research methods and our method achieved the highest accuracy of 92%. In addition, We also compare the time required for feature extraction using GPUs, and found that it was up to 16 times faster.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords