IEEE Access (Jan 2020)

CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

  • Christopher A. Flores,
  • Rosa L. Figueroa,
  • Jorge E. Pezoa,
  • Qing Zeng-Treitler

DOI
https://doi.org/10.1109/ACCESS.2020.2972205
Journal volume & issue
Vol. 8
pp. 29270 – 29280

Abstract

Read online

High accuracy text classifiers are used nowadays in organizing large amounts of biomedical information and supporting clinical decision-making processes. In medical informatics, regular expression-based classifiers have emerged as an alternative to traditional, discriminative classification algorithms due to their ability to model sequential patterns. This article presents CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space. We conceived an algorithm for automatically constructing an informative and discriminative regular-expressions-based feature space, suitable for binary and multiclass discrimination problems. Regular expressions are automatically generated from training texts using a coarse-to-fine text aligning method, which trades off the lexical variants of words, in terms of gender and grammatical number, and the generation of a feature space containing a large number of noisy features. CREGEX carries out feature selection by filtering keywords and also computes a confidence metric to classify test texts. Three de-identified datasets in Spanish, with information on smoking habits, obesity, and obesity types, were used here to assess the performance of CREGEX. For comparison, Support Vector Machine (SVM) and Naïve Bayes (NB) supervised classifiers were also trained with consecutive sequences of tokens (n-grams) as features. Results show that, in all the datasets used for evaluation, CREGEX not only outperformed both the SVM and NB classifiers in terms of accuracy and F-measure (p-value<; 0.05) but also used a fewer amount of training examples to achieve the same performance. Such a superior performance is attributed to the regular expressions' ability to represent complex text patterns.

Keywords