Patterns (Jul 2020)

Harvesting Patterns from Textual Web Sources with Tolerance Rough Sets

  • Hoora Rezaei Moghaddam,
  • Sheela Ramanna

Journal volume & issue
Vol. 1, no. 4
p. 100053

Abstract

Read online

Summary: Construction of knowledge repositories from web corpora by harvesting linguistic patterns is of benefit for many natural language-processing applications that rely on question-answering schemes. These methods require minimal or no human intervention and can recursively learn new relational facts-instances in a fully automated and scalable manner. This paper explores the performance of tolerance rough set-based learner with respect to two important issues: scalability and its effect on concept drift, by (1) designing a new version of the semi-supervised tolerance rough set-based pattern learner (TPL 2.0), (2) adapting a tolerance form of rough set methodology to categorize linguistic patterns, and (3) extracting categorical information from a large noisy dataset of crawled web pages. This work demonstrates that the TPL 2.0 learner is promising in terms of precision@30 metric when compared with three benchmark algorithms: Tolerant Pattern Learner 1.0, Fuzzy-Rough Set Pattern Learner, and Coupled Bayesian Sets-based learner. The Bigger Picture: The methods used for the construction of knowledge repositories from web corpus require minimal human intervention and can recursively learn new relational facts in a fully automated and scalable manner. A key issue when mining from such a corpus is the labeling problem: data are abundant on the web but are unlabeled. Even though semi-supervised approaches are promising, they might exhibit low accuracy, because initial labeled examples of relational facts are limited in number and tend to be insufficient to properly constrain the learning process. This phenomenon is called semantic (concept) drift. We extend a recently established theoretical model for learning linguistic patterns based on tolerance rough sets to address the problem of concept drift. The choice of a tolerance rough set-based learner was motivated by the fact that the learner did not require any external constraints to constrain the learning process when compared with three benchmarked algorithms.

Keywords