Journal of King Saud University: Computer and Information Sciences (Jun 2020)

Classifying protein-protein interaction articles from biomedical literature using many relevant features and context-free grammar

  • Sabenabanu Abdulkadhar,
  • Gurusamy Murugesan,
  • Jeyakumar Natarajan

Journal volume & issue
Vol. 32, no. 5
pp. 553 – 560

Abstract

Read online

Detecting the articles which consist of protein–protein interactions (PPI) is a significant step in biological information extraction. In this paper, we present a hybrid text classification (TC) method to identify protein–protein interaction articles. Our methodology comprises of four modules i) Feature extraction, ii) Semantic similarity based feature selection iii) Ensemble learning and iv) Context free grammar (CFG) based post processing to classify PPI relevant articles. In first module, we extracted many linguistic and domain specific features such as protein names, interaction cues etc., to classify the documents. The second module used similarity based feature selection to extract the relevant efficient features. In third module, we employed AdaBoost based ensemble learning to improve the performance of weak learning classifiers. The final module incorporates CFG based pattern matching to resolve the errors in the classifiers. The performance of our hybrid TC method was trained and tested on BioCreative III corpus in which we attained the precision of 0.5813 and recall of 0.6582. The overall F-score of the system was 0.6228 and our hybrid approach combined with ensemble classifier and CFG post-processing method outperforms most of the state of-the-art systems.

Keywords