BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

Journal of Biomedical Semantics. 2012;3(1):3 DOI 10.1186/2041-1480-3-3

 

Journal Homepage

Journal Title: Journal of Biomedical Semantics

ISSN: 2041-1480 (Online)

Publisher: BMC

LCC Subject Category: Medicine: Medicine (General): Computer applications to medicine. Medical informatics

Country of publisher: United Kingdom

Language of fulltext: English

Full-text formats available: PDF, HTML

 

AUTHORS


Liu Haibin

Christiansen Tom

Baumgartner William A

Verspoor Karin

EDITORIAL INFORMATION

Blind peer review

Editorial Board

Instructions for authors

Time From Submission to Publication: 30 weeks

 

Abstract | Full Text

<p>Abstract</p> <p>Background</p> <p>The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research.</p> <p>Results</p> <p>In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the <it>LLL05 </it>corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system.</p> <p>Conclusions</p> <p>The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from <it><url>http://biolemmatizer.sourceforge.net</url></it>.</p>