Extracting the Main Content of Web Pages Using the First Impression Area

Geunseong Jung; Sungjae Han; Hansung Kim; Kwanguk Kim; Jaehyuk Cha

doi:10.1109/ACCESS.2022.3229080

IEEE Access (Jan 2022)

Extracting the Main Content of Web Pages Using the First Impression Area

Geunseong Jung,
Sungjae Han,
Hansung Kim,
Kwanguk Kim,
Jaehyuk Cha

Affiliations

Geunseong Jung: ORCiD; Department of Computer Science, Hanyang University, Seoul, South Korea
Sungjae Han: ORCiD; JEI Group, Seoul, South Korea
Hansung Kim: Department of Sociology, Hanyang University, Seoul, South Korea
Kwanguk Kim: ORCiD; Department of Computer Science, Hanyang University, Seoul, South Korea
Jaehyuk Cha: ORCiD; Department of Computer Science, Hanyang University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3229080
Journal volume & issue: Vol. 10
pp. 129958 – 129969

Abstract

Read online

Extracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers’ (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks $\mathrm {\mathbf {F_{0.5}}}$ ) and non-English (up to 42%, matched text blocks $\mathrm {\mathbf {F_{0.5}}}$ ) web pages.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords