Applied Sciences (Feb 2023)

DIR: A Large-Scale Dialogue Rewrite Dataset for Cross-Domain Conversational Text-to-SQL

  • Jieyu Li,
  • Zhi Chen,
  • Lu Chen,
  • Zichen Zhu,
  • Hanqi Li,
  • Ruisheng Cao,
  • Kai Yu

DOI
https://doi.org/10.3390/app13042262
Journal volume & issue
Vol. 13, no. 4
p. 2262

Abstract

Read online

Semantic co-reference and ellipsis always lead to information deficiency when parsing natural language utterances with SQL in a multi-turn dialogue (i.e., conversational text-to-SQL task). The methodology of dividing a dialogue understanding task into dialogue utterance rewriting and language understanding is feasible to tackle this problem. To this end, we present a two-stage framework to complete conversational text-to-SQL tasks. To construct an efficient rewriting model in the first stage, we provide a large-scale dialogue rewrite dataset (DIR), which is extended from two cross-domain conversational text-to-SQL datasets, SParC and CoSQL. The dataset contains 5908 dialogues involving 160 domains. Therefore, it not only focuses on conversational text-to-SQL tasks, but is also a valuable corpus for dialogue rewrite study. In experiments, we validate the efficiency of our annotations with a popular text-to-SQL parser, RAT-SQL. The experiment results illustrate 11.81 and 27.17 QEM accuracy improvement on SParC and CoSQL, respectively, when we eliminate the semantic incomplete representations problem by directly parsing the golden rewrite utterances. The experiment results of evaluating the performance of the two-stage frameworks using different rewrite models show that the efficiency of rewrite models is important and still needs improvement. Additionally, as a new benchmark of the dialogue rewrite task, we also report the performance results of different baselines for related studies. Our dataset will be publicly available once this paper is accepted.

Keywords