MBA: a literature mining system for extracting biomedical abbreviations

Lei YiMing; Wang ZhiHao; Xu Yun; Zhao YuZhong; Xue Yu

doi:10.1186/1471-2105-10-14

BMC Bioinformatics (Jan 2009)

MBA: a literature mining system for extracting biomedical abbreviations

Lei YiMing,
Wang ZhiHao,
Xu Yun,
Zhao YuZhong,
Xue Yu

Affiliations

Lei YiMing
Wang ZhiHao
Xu Yun
Zhao YuZhong
Xue Yu

DOI: https://doi.org/10.1186/1471-2105-10-14
Journal volume & issue: Vol. 10, no. 1
p. 14

Abstract

Read online

Abstract Background The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter. Results A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus. Conclusion We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., ), but also non-acronym-type abbreviations (e.g., ).

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal