IEEE Access (Jan 2024)

Dark Side of the Web: Dark Web Classification Based on TextCNN and Topic Modeling Weight

  • Gun-Yoon Shin,
  • Younghoan Jang,
  • Dong-Wook Kim,
  • Sungjin Park,
  • A-Ran Park,
  • Younghwan Kim,
  • Myung-Mook Han

DOI
https://doi.org/10.1109/ACCESS.2023.3347737
Journal volume & issue
Vol. 12
pp. 36361 – 36371

Abstract

Read online

The Dark Web is an internet domain that ensures user anonymity and has increasingly become a focal point for illegal activities and a repository for information on cyberattacks owing to the challenges in tracking its users. This study examined the classification of the Dark Web in relation to these cyber threats. We processed Dark Web texts to extract vector types suitable for machine learning classification. Traditional methods utilizing the entirety of Dark Web texts to generate features result in vectors including all words found on the Dark Web. However, this approach incorporates extraneous information in the vectors, diminishing learning effectiveness and extending processing duration. The research aimed to optimize the classification process by selectively focusing on keywords within each class, thereby curtailing word vector dimensions. This optimization was facilitated by leveraging the anonymity characteristic of the Dark Web and employing topic-modeling-based weight generation. These methods enabled the creation of word vectors with a constrained feature set, enhancing the distinction of Dark Web classes. To further improve classification performance, we integrated TextCNN with topic modeling weights. For validation, we employed two datasets and compared the performance of the model with other text classification algorithms, where the proposed model demonstrated superior effectiveness in Dark Web classification.

Keywords