Classifying protein-protein interaction articles from biomedical literature using many relevant features and context-free grammar

Sabenabanu Abdulkadhar; Gurusamy Murugesan; Jeyakumar Natarajan

Journal of King Saud University: Computer and Information Sciences (Jun 2020)

Classifying protein-protein interaction articles from biomedical literature using many relevant features and context-free grammar

Sabenabanu Abdulkadhar,
Gurusamy Murugesan,
Jeyakumar Natarajan

Affiliations

Sabenabanu Abdulkadhar: Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu 641 046, India
Gurusamy Murugesan: Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu 641 046, India
Jeyakumar Natarajan: Corresponding author.; Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu 641 046, India

Journal volume & issue: Vol. 32, no. 5
pp. 553 – 560

Abstract

Read online

Detecting the articles which consist of protein–protein interactions (PPI) is a significant step in biological information extraction. In this paper, we present a hybrid text classification (TC) method to identify protein–protein interaction articles. Our methodology comprises of four modules i) Feature extraction, ii) Semantic similarity based feature selection iii) Ensemble learning and iv) Context free grammar (CFG) based post processing to classify PPI relevant articles. In first module, we extracted many linguistic and domain specific features such as protein names, interaction cues etc., to classify the documents. The second module used similarity based feature selection to extract the relevant efficient features. In third module, we employed AdaBoost based ensemble learning to improve the performance of weak learning classifiers. The final module incorporates CFG based pattern matching to resolve the errors in the classifiers. The performance of our hybrid TC method was trained and tested on BioCreative III corpus in which we attained the precision of 0.5813 and recall of 0.6582. The overall F-score of the system was 0.6228 and our hybrid approach combined with ensemble classifier and CFG post-processing method outperforms most of the state of-the-art systems.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords