Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

Pegah Safari; Mehrnoush Shamsfard

doi:10.1109/ACCESS.2024.3360863

IEEE Access (Jan 2024)

Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

Pegah Safari,
Mehrnoush Shamsfard

Affiliations

Pegah Safari: ORCiD; Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Mehrnoush Shamsfard: ORCiD; Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran

DOI: https://doi.org/10.1109/ACCESS.2024.3360863
Journal volume & issue: Vol. 12
pp. 19158 – 19180

Abstract

Read online

In this paper, we describe data preparation for our proposed chatbot PerInfEx (Persian Information Extraction chatbot). It aims to interactively chit-chat with users in Persian and by asking the least number of direct questions, extract as much personal information as possible such as user’s age or occupation. Collecting data in considerable size and aligned with our system’s specifics is a crucial step to train data-hungry modules of Natural Language Understating (NLU) and Natural Language Generating (NLG). Initially, for NLU module, we collect 99 free-discussion dialogues and crawl 74 English training conversations as more-general datasets while also manually translate 72 dialogues of ConvAI2 corpus. Moreover, we gamify collection by implementing a chatting website results in 94 dialogues. It detects direct questions and assigns random profiles to participants. They should guess the opponents profile. Also, we propose two augmentation methods: a semi-automatic and a novel fully automatic method, comprehensively evaluated on NLU benchmarks and applied on our datasets. Also, by prompting OpenAI’s GPT-3.5 model, we automatically generate 304 dialogues. The first part of these datasets is manually annotated while we use an active learning method for annotating rest of them. Next, to evaluate data quality, we assess them extrinsically using NLU baseline which results in intent-accuracy = 88.64, slot-F1 = 83.68 and exact-match = 78.22. Also, for NLG module, we automatically translate almost the rest of ConvAI2 corpus (16,217 dialogues) and paraphrase previously sets for its fine-tuning using GPT-3.5 model. Their assessment using our NLG baseline results in perplexity of 15.74 on train and 52.17 on test set.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords