PubRunner: A light-weight framework for updating text mining results [version 2; referees: 1 approved, 2 approved with reservations]

Kishore R. Anekalla; J.P. Courneya; Nicolas Fiorini; Jake Lever; Michael Muchow; Ben Busby

doi:10.12688/f1000research.11389.2

F1000Research (Oct 2017)

PubRunner: A light-weight framework for updating text mining results [version 2; referees: 1 approved, 2 approved with reservations]

Kishore R. Anekalla,
J.P. Courneya,
Nicolas Fiorini,
Jake Lever,
Michael Muchow,
Ben Busby

Affiliations

Kishore R. Anekalla: Northwestern University, Chicago, IL, 60611, USA
J.P. Courneya: Health Sciences and Human Services Library, University of Maryland, Baltimore, MD, 21201, USA
Nicolas Fiorini: National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20894, USA
Jake Lever: Canada's Michael Smith Genome Sciences Centre, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
Michael Muchow: National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
Ben Busby: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA

DOI: https://doi.org/10.12688/f1000research.11389.2
Journal volume & issue: Vol. 6

Abstract

Read online

Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.

Bioinformatics

Published in F1000Research

ISSN: 2046-1402 (Online)
Publisher: F1000 Research Ltd
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://f1000research.com

About the journal

Abstract

Keywords