IEEE Access (Jan 2021)

High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models

  • Jeong-Jae Kim,
  • Byung-Won On,
  • Ingyu Lee

DOI
https://doi.org/10.1109/ACCESS.2021.3086586
Journal volume & issue
Vol. 9
pp. 85240 – 85254

Abstract

Read online

The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest. Our experimental results with movies and cellphones data sets show that the average $F_{1}$ -score of the deep learning models (FNN, CNN, Bi-LSTM, and SeqGAN) trained with our proposed algorithm shows up to 0.9992 in $F_{1}$ -score.

Keywords