Combining computational linguistics with sentence embedding to create a zero-shot NLIDB

Yuriy Perezhohin; Fernando Peres; Mauro Castelli

Array (Dec 2024)

Combining computational linguistics with sentence embedding to create a zero-shot NLIDB

Yuriy Perezhohin,
Fernando Peres,
Mauro Castelli

Affiliations

Yuriy Perezhohin: MyNorth AI Research, Alameda Bonifácio Lázaro Lozano n°15- 1°C, 2780-125, Oeiras, Lisboa, Portugal; NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312, Lisboa, Portugal
Fernando Peres: MyNorth AI Research, Alameda Bonifácio Lázaro Lozano n°15- 1°C, 2780-125, Oeiras, Lisboa, Portugal; NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312, Lisboa, Portugal
Mauro Castelli: NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312, Lisboa, Portugal; Corresponding author.

Journal volume & issue: Vol. 24
p. 100368

Abstract

Read online

Accessing relational databases using natural language is a challenging task, with existing methods often suffering from poor domain generalization and high computational costs. In this study, we propose a novel approach that eliminates the training phase while offering high adaptability across domains. Our method combines structured linguistic rules, a curated vocabulary, and pre-trained embedding models to accurately translate natural language queries into SQL. Experimental results on the SPIDER benchmark demonstrate the effectiveness of our approach, with execution accuracy rates of 72.03% on the training set and 70.83% on the development set, while maintaining domain flexibility. Furthermore, the proposed system outperformed two extensively trained models by up to 28.33% on the development set, demonstrating its efficiency. This research presents a significant advancement in zero-shot Natural Language Interfaces for Databases (NLIDBs), providing a resource-efficient alternative for generating accurate SQL queries from plain language inputs.

Published in Array

ISSN: 2590-0056 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/array

About the journal

Abstract

Keywords