HQA-Data: A historical question answer generation dataset from previous multi perspective conversation

Sabbir Hosen; Jannatul Ferdous Eva; Ayman Hasib; Aloke Kumar Saha; M.F. Mridha; Anwar Hussen Wadud

Data in Brief (Jun 2023)

HQA-Data: A historical question answer generation dataset from previous multi perspective conversation

Sabbir Hosen,
Jannatul Ferdous Eva,
Ayman Hasib,
Aloke Kumar Saha,
M.F. Mridha,
Anwar Hussen Wadud

Affiliations

Sabbir Hosen: Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh
Jannatul Ferdous Eva: Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh
Ayman Hasib: Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh
Aloke Kumar Saha: Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh
M.F. Mridha: Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh; Corresponding author.
Anwar Hussen Wadud: Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh

Journal volume & issue: Vol. 48
p. 109245

Abstract

Read online

This data article contains a quality assurance dataset for training the chatbot and chat analysis model. This dataset focuses on NLP tasks, as a model that serves and delivers a satisfactory response to a user's query. We obtained data from a well- known dataset known as “The Ubuntu Dialogue Corpus” for the purpose of constructing our dataset. Which consists of about one million multi-turn conversations containing around seven million utterances and one hundred million words. We derived a context for each dialogueID from these lengthy Ubuntu Dialogue Corpus conversations. We have generated a number of questions and answers based on these contexts. All of these questions and answers are contained within the context. This dataset includes 9364 contexts, 36,438 question-answer pairs. In addition to academic research, the dataset may be used for activities such as constructing this QA for another language, deep learning, language interpretation, reading comprehension, and open-domain question answering. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/p85z3v45xk.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords