IEEE Access (Jan 2019)
Protocol Specification Extraction Based on Contiguous Sequential Pattern Algorithm
Abstract
As the amount of Internet traffic increases due to newly emerging applications and their malicious behaviors, the amount of traffic that must be analyzed is rapidly increasing. Many protocols that occur under these situations are unknown and undocumented. For efficient network management and security, a deep understanding of these protocols is required. Although many protocols reverse engineering methods have been introduced in the literature, there is still no single standardized method to completely extract a protocol specification, and each of the existing methods has some limitations. In this paper, we propose a novel protocol reverse engineering method to extract an intuitive and clear protocol specification. The proposed method extracts field formats, message formats, and flow formats as protocol syntax by using a contiguous sequential pattern algorithm three times hierarchically and defining four types of the field formats. Moreover, the proposed methods can extracts protocol semantics and a protocol finite state machine. The proposed method sufficiently compresses input messages into a small number of message formats in order to easily identify the intuitive structure of an unknown protocol. We implemented our method in a prototype system and evaluated the method to infer message formats of HTTP (a text protocol) and DNS (a binary protocol). The experimental results show that the proposed method infers HTTP with 100% correctness and 99% coverage. For DNS, the proposed method achieves 100% correctness and coverage.
Keywords