PLoS ONE (Jan 2015)

Assessing the impact of case sensitivity and term information gain on biomedical concept recognition.

  • Tudor Groza,
  • Karin Verspoor

DOI
https://doi.org/10.1371/journal.pone.0119091
Journal volume & issue
Vol. 10, no. 3
p. e0119091

Abstract

Read online

Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition.