Key Concept Identification: A Sentence Parse Tree-Based Technique for Candidate Feature Extraction From Unstructured Texts

Muhammad Aman; Abas bin Md Said; Said Jadid Abdul Kadir; Israr Ullah

doi:10.1109/ACCESS.2018.2875135

IEEE Access (Jan 2018)

Key Concept Identification: A Sentence Parse Tree-Based Technique for Candidate Feature Extraction From Unstructured Texts

Muhammad Aman,
Abas bin Md Said,
Said Jadid Abdul Kadir,
Israr Ullah

Affiliations

Muhammad Aman: ORCiD; Department of Computer and Information Sciences, Universiti Teknologi Petronas, Seri Iskandar, Malaysia
Abas bin Md Said: Department of Computer and Information Sciences, Universiti Teknologi Petronas, Seri Iskandar, Malaysia
Said Jadid Abdul Kadir: Department of Computer and Information Sciences, Universiti Teknologi Petronas, Seri Iskandar, Malaysia
Israr Ullah: ORCiD; Computer Engineering Department, Jeju National University, Jeju, South Korea

DOI: https://doi.org/10.1109/ACCESS.2018.2875135
Journal volume & issue: Vol. 6
pp. 60403 – 60413

Abstract

Read online

The effectiveness of automatic key concept or keyphrase identification from unstructured text documents mainly depends on a comprehensive and meaningful list of candidate features extracted from the documents. However, the conventional techniques for candidate feature extraction limit the performance of keyphrase identification algorithms and need improvement. The objective of this paper is to propose a novel parse tree-based approach for candidate feature extraction to overcome the shortcomings of the existing techniques. Our proposed technique is based on generating a parse tree for each sentence in the input text. Sentence parse trees are then cut into sub-trees to extract branches for candidate phrases (i.e., noun, verb, and so on). The sub-trees are combined using parts-of-speech tagging to generate the flat list of candidate phrases. Finally, filtering is performed using heuristic rules and redundant phrases are eliminated to generate final list of candidate features. Experimental analysis is conducted for validation of the proposed scheme using three manually annotated and publicly available data sets from different domains, i.e., Inspec, 500N-KPCrowed, and SemEval-2010. The proposed technique is fine-tuned to determine the optimal value for the parameter context window size and then it is compared with the existing conventional n-gram and noun-phrase-based techniques. The results show that the proposed technique outperforms the existing approaches and significant improvements of 13.51% and 30.67%, 12.86% and 5.48%, and 13.16% and 31.46% are achieved, in terms of precision, recall, and F-measure when compared with noun-phrasebased scheme and n-gram-based scheme, respectively. These results give us confidence to further validate the proposed technique by developing a keyphrase extraction algorithm in the future.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords