IEEE Access (Jan 2020)

Generating Biomedical Question Answering Corpora From Q&A Forums

  • Andre Lamurias,
  • Diana Sousa,
  • Francisco M. Couto

DOI
https://doi.org/10.1109/ACCESS.2020.3020868
Journal volume & issue
Vol. 8
pp. 161042 – 161051

Abstract

Read online

Question Answering (QA) is a natural language processing task that aims at obtaining relevant answers to user questions. While some progress has been made in this area, biomedical questions are still a challenge to most QA approaches, due to the complexity of the domain and limited availability of training sets. We present a method to automatically extract question-article pairs from Q&A web forums, which can be used for document retrieval, a crucial step of most QA systems. The proposed framework extracts from selected forums the questions and the respective answers that contain citations. This way, QA systems based on document retrieval can be developed and evaluated using the question-article pairs annotated by users of these forums. We generated the BiQA corpus by applying our framework to three forums, obtaining 7,453 questions and 14,239 question-article pairs. We evaluated how the number of articles associated with each question and the number of votes on each answer affects the performance of baseline document retrieval approaches. Also, we demonstrated that the articles given as answers are significantly similar to the questions and trained a state-of-the-art deep learning model that obtained similar performance to using a dataset manually annotated by experts. The proposed framework can be used to update the BiQA corpus from the same forums as new posts are made, and from other forums that support their answers with documents. The BiQA corpus and the framework used to generate it are available at https://github.com/lasigeBioTM/BiQA.

Keywords