Dark Side of the Web: Dark Web Classification Based on TextCNN and Topic Modeling Weight

Gun-Yoon Shin; Younghoan Jang; Dong-Wook Kim; Sungjin Park; A-Ran Park; Younghwan Kim; Myung-Mook Han

doi:10.1109/ACCESS.2023.3347737

IEEE Access (Jan 2024)

Dark Side of the Web: Dark Web Classification Based on TextCNN and Topic Modeling Weight

Gun-Yoon Shin,
Younghoan Jang,
Dong-Wook Kim,
Sungjin Park,
A-Ran Park,
Younghwan Kim,
Myung-Mook Han

Affiliations

Gun-Yoon Shin: ORCiD; Department of AI Software, Gachon University, Seongnam-si, Republic of Korea
Younghoan Jang: ORCiD; Department of AI Software, Gachon University, Seongnam-si, Republic of Korea
Dong-Wook Kim: ORCiD; Department of AI Software, Gachon University, Seongnam-si, Republic of Korea
Sungjin Park: Cyber Warfare, LIG Nex1, Seongnam-si, Republic of Korea
A-Ran Park: Cyber Warfare, LIG Nex1, Seongnam-si, Republic of Korea
Younghwan Kim: Cyber Warfare, LIG Nex1, Seongnam-si, Republic of Korea
Myung-Mook Han: ORCiD; Department of AI Software, Gachon University, Seongnam-si, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3347737
Journal volume & issue: Vol. 12
pp. 36361 – 36371

Abstract

Read online

The Dark Web is an internet domain that ensures user anonymity and has increasingly become a focal point for illegal activities and a repository for information on cyberattacks owing to the challenges in tracking its users. This study examined the classification of the Dark Web in relation to these cyber threats. We processed Dark Web texts to extract vector types suitable for machine learning classification. Traditional methods utilizing the entirety of Dark Web texts to generate features result in vectors including all words found on the Dark Web. However, this approach incorporates extraneous information in the vectors, diminishing learning effectiveness and extending processing duration. The research aimed to optimize the classification process by selectively focusing on keywords within each class, thereby curtailing word vector dimensions. This optimization was facilitated by leveraging the anonymity characteristic of the Dark Web and employing topic-modeling-based weight generation. These methods enabled the creation of word vectors with a constrained feature set, enhancing the distinction of Dark Web classes. To further improve classification performance, we integrated TextCNN with topic modeling weights. For validation, we employed two datasets and compared the performance of the model with other text classification algorithms, where the proposed model demonstrated superior effectiveness in Dark Web classification.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords