Large-scale online semantic indexing of biomedical articles via an ensemble of multi-label classification models

Yannis Papanikolaou; Grigorios Tsoumakas; Manos Laliotis; Nikos Markantonatos; Ioannis Vlahavas

doi:10.1186/s13326-017-0150-0

Journal of Biomedical Semantics (Sep 2017)

Large-scale online semantic indexing of biomedical articles via an ensemble of multi-label classification models

Yannis Papanikolaou,
Grigorios Tsoumakas,
Manos Laliotis,
Nikos Markantonatos,
Ioannis Vlahavas

Affiliations

Yannis Papanikolaou: Department of Computer Science, Aristotle University
Grigorios Tsoumakas: Department of Computer Science, Aristotle University
Manos Laliotis: Atypon
Nikos Markantonatos: Atypon Hellas
Ioannis Vlahavas: Department of Computer Science, Aristotle University

DOI: https://doi.org/10.1186/s13326-017-0150-0
Journal volume & issue: Vol. 8, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background In this paper we present the approach that we employed to deal with large scale multi-label semantic indexing of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge (2013–2017), a challenge concerned with biomedical semantic indexing and question answering. Methods Our main contribution is a MUlti-Label Ensemble method (MULE) that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ’s super-set, the PubMed articles collection) and the proper parametrization of the algorithms used to deal with this challenging classification task. Results The ensemble method that we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. In our participation in the BioASQ challenge we obtained the first place in 2013 and the second place in the four following years, steadily outperforming MTI, the indexing system of the National Library of Medicine (NLM). Conclusions The results of our experimental comparisons, suggest that employing a statistical significance test to validate the ensemble method’s choices, is the optimal approach for ensembling multi-label classifiers, especially in contexts with many rare labels.

Published in Journal of Biomedical Semantics

ISSN: 2041-1480 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://jbiomedsem.biomedcentral.com

About the journal

Abstract

Keywords