COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

Andre Esteva; Anuprit Kale; Romain Paulus; Kazuma Hashimoto; Wenpeng Yin; Dragomir Radev; Richard Socher

doi:10.1038/s41746-021-00437-0

npj Digital Medicine (Apr 2021)

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

Andre Esteva,
Anuprit Kale,
Romain Paulus,
Kazuma Hashimoto,
Wenpeng Yin,
Dragomir Radev,
Richard Socher

Affiliations

Andre Esteva: Salesforce Research
Anuprit Kale: Salesforce Research
Romain Paulus: Salesforce Research
Kazuma Hashimoto: Salesforce Research
Wenpeng Yin: Salesforce Research
Dragomir Radev: Salesforce Research
Richard Socher: Salesforce Research

DOI: https://doi.org/10.1038/s41746-021-00437-0
Journal volume & issue: Vol. 4, no. 1
pp. 1 – 9

Abstract

Read online

Abstract The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. Throughout 2020, over 400,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset. Here, we present CO-Search, a semantic, multi-stage, search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns a sorted list of the 1000 most relevant documents, and a re-ranker, which further orders them by relevance. The retriever is composed of a deep learning model (Siamese-BERT) that encodes query-level meaning, along with two keyword-based models (BM25, TF-IDF) that emphasize the most important words of a query. The re-ranker assigns a relevance score to each document, computed from the outputs of (1) a question–answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. To account for the relatively limited dataset, we develop a text augmentation technique which splits the documents into pairs of paragraphs and the citations contained in them, creating millions of (citation title, paragraph) tuples for training the retriever. We evaluate our system ( http://einstein.ai/covid ) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across multiple key information retrieval metrics.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal