Mathematical Biosciences and Engineering (May 2021)

Cross-platform binary code similarity detection based on NMT and graph embedding

  • Xiaodong Zhu,
  • Liehui Jiang,
  • Zeng Chen

DOI
https://doi.org/10.3934/mbe.2021230
Journal volume & issue
Vol. 18, no. 4
pp. 4528 – 4551

Abstract

Read online

Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on using platform-independent characteristic strands intersecting or control flow graph (CFG) matching to compute the similarity and have shortages in terms of efficiency and scalability. The existing deep-learning-based methods improve the efficiency but have a low accuracy and still using manually constructed features. Aiming at these problems, a cross-platform binary code similarity detection method based on neural machine translation (NMT) and graph embedding is proposed in this manuscript. We train an NMT model and a graph embedding model to automatically extract two parts of semantics of the binary code and represent it as a high-dimension vector, named an embedding. Then the similarity of two binary functions can be measured by the distance between their corresponding embeddings. We implement a prototype named SimInspector. Our comparative experiment result shows that SimInspector outperforms the state-of-the-art approach, Gemini, by about 6% with respect to similarity detection accuracy, and maintains a good efficiency.

Keywords