IEEE Access (Jan 2021)

Efficient Sentiment-Aware Web Crawling Methods for Constructing Sentiment Dictionary

  • Byung-Won On,
  • Jun-Young Jo,
  • Hyunkwang Shin,
  • Jangwon Gim,
  • Gyu Sang Choi,
  • Soo-Mok Jung

DOI
https://doi.org/10.1109/ACCESS.2021.3129187
Journal volume & issue
Vol. 9
pp. 161208 – 161223

Abstract

Read online

In traditional web crawling, all web pages crawled are first stored to databases. As a result, this approach can store unnecessary web pages and requires additional running time for the construction of a sentiment dictionary in a particular domain because sentiment words should be identified by scanning all web pages in the database. To address these problems, we first define the sentiment-aware web crawling problem and then propose two hash-based methods for the implementation. One is based on hash join and the other is bucket-sorted hash join. In particular, we propose a novel bucket-sorted hash join for the efficient sentiment-aware web crawling method. Our experimental results show that the proposed web crawling method using bucket-sorted hash join outperforms existing web crawling methods by significantly reducing the running time and storage space. In the proposed method, the time taken to execute the sentiment-aware task per web page is 0.016 seconds and the database space can be saved by 59% compared to the existing web crawling methods.

Keywords