Harvesting Patterns from Textual Web Sources with Tolerance Rough Sets

Hoora Rezaei Moghaddam; Sheela Ramanna

doi:10.1016/j.patter.2020.100053

Patterns (Jul 2020)

Harvesting Patterns from Textual Web Sources with Tolerance Rough Sets

Hoora Rezaei Moghaddam,
Sheela Ramanna

Affiliations

Hoora Rezaei Moghaddam: Sightline Innovation Inc., 136 Market Avenue, Unit 300, Winnipeg, MB, R3B 0P4, Canada; Corresponding author
Sheela Ramanna: Department of Applied Computer Science, University of Winnipeg, Winnipeg, Manitoba R3B 2E9, Canada; Corresponding author

DOI: https://doi.org/10.1016/j.patter.2020.100053
Journal volume & issue: Vol. 1, no. 4
p. 100053

Abstract

Read online

Summary: Construction of knowledge repositories from web corpora by harvesting linguistic patterns is of benefit for many natural language-processing applications that rely on question-answering schemes. These methods require minimal or no human intervention and can recursively learn new relational facts-instances in a fully automated and scalable manner. This paper explores the performance of tolerance rough set-based learner with respect to two important issues: scalability and its effect on concept drift, by (1) designing a new version of the semi-supervised tolerance rough set-based pattern learner (TPL 2.0), (2) adapting a tolerance form of rough set methodology to categorize linguistic patterns, and (3) extracting categorical information from a large noisy dataset of crawled web pages. This work demonstrates that the TPL 2.0 learner is promising in terms of precision@30 metric when compared with three benchmark algorithms: Tolerant Pattern Learner 1.0, Fuzzy-Rough Set Pattern Learner, and Coupled Bayesian Sets-based learner. The Bigger Picture: The methods used for the construction of knowledge repositories from web corpus require minimal human intervention and can recursively learn new relational facts in a fully automated and scalable manner. A key issue when mining from such a corpus is the labeling problem: data are abundant on the web but are unlabeled. Even though semi-supervised approaches are promising, they might exhibit low accuracy, because initial labeled examples of relational facts are limited in number and tend to be insufficient to properly constrain the learning process. This phenomenon is called semantic (concept) drift. We extend a recently established theoretical model for learning linguistic patterns based on tolerance rough sets to address the problem of concept drift. The choice of a tolerance rough set-based learner was motivated by the fact that the learner did not require any external constraints to constrain the learning process when compared with three benchmarked algorithms.

Published in Patterns

ISSN: 2666-3899 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.cell.com/patterns

About the journal

Abstract

Keywords