DNS dataset for malicious domains detection

Cláudio Marques; Silvestre Malta; João Paulo Magalhães

Data in Brief (Oct 2021)

DNS dataset for malicious domains detection

Cláudio Marques,
Silvestre Malta,
João Paulo Magalhães

Affiliations

Cláudio Marques: Corresponding author.; Escola Superior de Tecnologia e Gestão, Politécnico de Viana do Castelo, Viana do Castelo 4900-348, Portugal
Silvestre Malta: ADiT-Lab, Escola Superior de Tecnologia e Gestão, Politécnico de Viana do Castelo, Viana do Castelo 4900-348, Portugal
João Paulo Magalhães: CIICESI, Escola Superior de Tecnologia e Gestão, Politécnico do Porto, Felgueiras, Portugal

Journal volume & issue: Vol. 38
p. 107342

Abstract

Read online

The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords