IEEE Access (Jan 2021)

Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

  • Christopher A. Flores,
  • Rosa L. Figueroa,
  • Jorge E. Pezoa

DOI
https://doi.org/10.1109/ACCESS.2021.3064000
Journal volume & issue
Vol. 9
pp. 38767 – 38777

Abstract

Read online

Biomedical text classification algorithms, which currently support clinical decision-making processes, call for expensive training texts due to the low availability of labeled corpus and the cost of manual annotation by specialized professionals. The active learning (AL) approach to classification heavily lessens such cost by reducing the number of labeled documents required to achieve specified performance. This article introduces a query strategy and a stopping criterion that transform CREGEX, a regular-expressions-based text classification algorithm, in an AL biomedical text classifier. The query strategy samples the training dataset, trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification. The sustained reduction in the variance of the query strategy scores is used as a stopping criterion. The AL classifier was compared with Support Vector Machine (SVM), Naïve Bayes (NB), and a classifier based on Bidirectional Encoder Representations from Transformers (BERT), using three datasets with biomedical information in Spanish on smoking habits, obesity, and obesity types. The learning curve results indicate that AL in CREGEX allowed to efficiently reduce the number of training examples for equal performance than the rest of the classifiers, obtaining areas under the learning curve greater than 85% in all cases. The stopping criterion applied to the AL process allowed to use, on average, approximately 32% to 50% of the total training examples with differences in performance concerning the maximum value of the learning curve not exceeding 2%. This performance demonstrates the effectiveness of using AL in a biomedical text classifier based on regular expressions, which is attributable to such expressions’ ability to represent intricate sequential patterns in training texts considered most informative.

Keywords