IEEE Access (Jan 2023)

A Transformer-Based Educational Virtual Assistant Using Diacriticized Latin Script

  • Khang Nhut Lam,
  • Loc Huu Nguy,
  • Van Lam Le,
  • Jugal Kalita

DOI
https://doi.org/10.1109/ACCESS.2023.3307635
Journal volume & issue
Vol. 11
pp. 90094 – 90104

Abstract

Read online

A virtual assistant or smart chatbot should be able to understand user questions and respond correctly and usefully, even if the questions are posed ungrammatically with misspellings and other errors. This paper describes the design and construction of a text-to-text virtual assistant in Vietnamese, a language that uses the Latin script with a liberal use of diacritics, for supporting students at a large university with over forty thousand students. The flexible virtual assistant consists of two integrated chatbots, both using Transformers: a) a closed-domain chatbot, trained on over thirty-five thousand factual question-answer pairs to engage in university-related conversation, and b) a second open-domain chatbot, trained on a large movie dialog dataset to engage in general conversation. The integrated virtual assistant classifies a question as either factual or general, and engages the appropriate chatbot to respond in a flexible, appropriate and natural manner. Although Vietnamese uses diacritics copiously, even educated users have a propensity to forgo the use of diacritics, and as a result, to facilitate smooth text-based communication, our design includes extensive pre-processing that uses learned Transformers to restore missing diacritics and correct misspellings. Our Transformer models outperform existing approaches for diacritic restoration and are better than several other methods for spelling correction in Vietnamese. In addition, the closed-domain chatbot performs better than other generative chatbots that have been developed to assist students in a university environment, irrespective of language and location.

Keywords