IEEE Access (Jan 2020)
VB-PTC: Visual Block Multi-Record Text Extraction Based on Sensor Network Page Type Conversion
Abstract
Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability of sensor networks. In this paper, we propose a visual block construction method based on page type conversion (VB-PTC). This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record complex pages to multi-record simple pages, effectively simplifying the rules of visual block construction. In the aspect of multi-record content extraction, according to the characteristics of different fields, we use different extraction methods, combined with regular expression, natural language processing and symbol density detection methods which greatly improves the accuracy of multi-record content extraction. VB-PTC can be effectively used for information retrieval, content extraction and page rendering tasks.
Keywords