IEEE Access (Jan 2019)
SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
Abstract
In this paper we present a Structure-driven Incremental Forum crawler (SInFo) that targets the latest content in crawling cycles. On a Web forum, user generated content is almost never changed or deleted, but it is constantly added. There is a wide spectrum of forum technologies that have different representations and navigational paths to lead the user to the latest content. Targeting the latest content is not a trivial task, since adding some new content to a forum often results in shifting the old content between pages. Ignoring the way forum content is distributed and sorted can lead to repetitive visits to the pages with the same data from previous crawls while incrementally crawling. The main goal of SInFo is to avoid transfer of duplicate content in forum incremental crawling, using the generic approach regardless of the forum technology. The problem is reduced to discovering and utilizing the following forum technology features: (1) forum index and thread page content and sort representation and, (2) available forum technology navigational options between pages. With the proposed methods and techniques, we show how to locate the target page by observing the URL signature format and minimize the number of required downloads to fetch the page containing the latest content. The experiments were conducted on custom technologies and also on a wide range of pre-built forum packages covering more than 80% of representative widely used software packages. SInFo showed high accuracy and low level of duplicates transmission by reaching the average of 92.6% for the new content in each recrawl cycle.
Keywords