Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol

Matthew Behnke; Nathan Briner; Drake Cullen; Katelynn Schwerdtfeger; Jackson Warren; Ram Basnet; Tenzin Doleck

doi:10.1109/access.2021.3113294

IEEE Access (Jan 2021)

Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol

Matthew Behnke,
Nathan Briner,
Drake Cullen,
Katelynn Schwerdtfeger,
Jackson Warren,
Ram Basnet,
Tenzin Doleck

Affiliations

Matthew Behnke: ORCiD; Department of Computer Science and Engineering, Colorado Mesa University (CMU), Grand Junction, CO, USA
Nathan Briner: ORCiD; Department of Computer Science and Engineering, Colorado Mesa University (CMU), Grand Junction, CO, USA
Drake Cullen: ORCiD; Department of Computer Science and Engineering, Colorado Mesa University (CMU), Grand Junction, CO, USA
Katelynn Schwerdtfeger: Department of Computer Science and Engineering, Colorado Mesa University (CMU), Grand Junction, CO, USA
Jackson Warren: Department of Computer Science and Engineering, Colorado Mesa University (CMU), Grand Junction, CO, USA
Ram Basnet: ORCiD; Department of Computer Science and Engineering, Colorado Mesa University (CMU), Grand Junction, CO, USA
Tenzin Doleck: Simon Fraser University, Burnaby, BC, Canada

DOI: https://doi.org/10.1109/access.2021.3113294
Journal volume & issue: Vol. 9
pp. 129902 – 129916

Abstract

Read online

The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine learning classifiers using stratified 10-fold cross-validation. The classifiers are used to determine the most effective and efficient way of detecting malicious DNS over Hypertext Transfer Protocol Secure (HTTPS) traffic, dubbed DoH traffic. Model performance is evaluated on Non-DoH vs. DoH traffic, then tested on benign vs. malicious DoH traffic. Additionally, this paper seeks to build upon existing research by removing noise and introducing feature selection methods and feature explainability to produce a better model for real-world deployment. After eliminating five overfitting features, our findings indicate that light gradient boosting machine (LGBM) yielded the highest accuracy to training time ratio while approaching 0% error using 20 top features.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords