In Autumn 2020, DOAJ will be relaunching with a new website with updated functionality, improved search, and a simplified application form. More information is available on our blog. Our API is also changing.

Hide this message

BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

Journal of Biomedical Semantics. 2012;3(1):3 DOI 10.1186/2041-1480-3-3


Journal Homepage

Journal Title: Journal of Biomedical Semantics

ISSN: 2041-1480 (Online)

Publisher: BMC

LCC Subject Category: Medicine: Medicine (General): Computer applications to medicine. Medical informatics

Country of publisher: United Kingdom

Language of fulltext: English

Full-text formats available: PDF, HTML



Liu Haibin

Christiansen Tom

Baumgartner William A

Verspoor Karin


Blind peer review

Editorial Board

Instructions for authors

Time From Submission to Publication: 30 weeks


Abstract | Full Text

<p>Abstract</p> <p>Background</p> <p>The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research.</p> <p>Results</p> <p>In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the <it>LLL05 </it>corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system.</p> <p>Conclusions</p> <p>The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from <it><url></url></it>.</p>