Combining word embeddings to extract chemical and drug entities in biomedical literature

Pilar López-Úbeda; Manuel Carlos Díaz-Galiano; L. Alfonso Ureña-López; M. Teresa Martín-Valdivia

doi:10.1186/s12859-021-04188-3

BMC Bioinformatics (Dec 2021)

Combining word embeddings to extract chemical and drug entities in biomedical literature

Pilar López-Úbeda,
Manuel Carlos Díaz-Galiano,
L. Alfonso Ureña-López,
M. Teresa Martín-Valdivia

Affiliations

Pilar López-Úbeda: Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén
Manuel Carlos Díaz-Galiano: Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén
L. Alfonso Ureña-López: Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén
M. Teresa Martín-Valdivia: Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén

DOI: https://doi.org/10.1186/s12859-021-04188-3
Journal volume & issue: Vol. 22, no. S1
pp. 1 – 17

Abstract

Read online

Abstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. Results For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. Conclusion On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords