Training a language model to learn the syntax of commands

Zafar Hussain; Jukka K. Nurminen; Perttu Ranta-aho

Array (Sep 2024)

Training a language model to learn the syntax of commands

Zafar Hussain,
Jukka K. Nurminen,
Perttu Ranta-aho

Affiliations

Zafar Hussain: University of Helsinki, Pietari Kalmin katu 5, Helsinki, 00560, Finland; Corresponding author.
Jukka K. Nurminen: University of Helsinki, Pietari Kalmin katu 5, Helsinki, 00560, Finland
Perttu Ranta-aho: WithSecure, Tammasaarenkatu 7, Helsinki, 00180, Finland

Journal volume & issue: Vol. 23
p. 100355

Abstract

Read online

To protect systems from malicious activities, it is important to differentiate between valid and harmful commands. One way to achieve this is by learning the syntax of the commands, which is a complex task because of the expansive and evolving nature of command syntax. To address this, we harnessed the power of a language model. Our methodology involved constructing a specialized vocabulary from our commands dataset, and training a custom tokenizer with a Masked Language Model head, resulting in the development of a BERT-like language model. This model exhibits proficiency in learning command syntax by predicting masked tokens. In comparative analyses, our language model outperformed the Markov Model in categorizing commands using clustering algorithms (DBSCAN, HDBSCAN, OPTICS). The language model achieved higher Silhouette scores (0.72, 0.88, 0.85) compared to the Markov Model (0.53, 0.25, 0.06) and demonstrated significantly lower noise levels (2.63%, 5.39%, 8.49%) versus the Markov Model’s higher noise rates (9.31%, 29.85%, 50.35%). Further validation with manually crafted syntax and BERTScore assessments consistently produced metrics above 0.90 for precision, recall, and F1-score. Our language model excels at learning command syntax, enhancing protective measures against malicious activities.

Published in Array

ISSN: 2590-0056 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/array

About the journal

Abstract

Keywords