MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

Authors

  • Longxu Dou Harbin Institute of Technology
  • Yan Gao Microsoft Research Asia
  • Mingyang Pan Harbin Institute of Technology
  • Dingzirui Wang Harbin Institute of Technology
  • Wanxiang Che Harbin Institute of Technology
  • Dechen Zhan Harbin Institute of Technology
  • Jian-Guang Lou Microsoft Research Asia

DOI:

https://doi.org/10.1609/aaai.v37i11.26499

Keywords:

SNLP: Lexical & Frame Semantics, Semantic Parsing, SNLP: Question Answering, SNLP: Sentence-Level Semantics and Textual Inference, SNLP: Syntax -- Tagging, Chunking & Parsing

Abstract

Text-to-SQL semantic parsing is an important NLP task, which facilitates the interaction between users and the database. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL semantic parsing dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under various settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. Besides the dataset, we also propose a simple schema augmentation framework SAVe (Schema-Augmentation-with-Verification), which significantly boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.

Downloads

Published

2023-06-26

How to Cite

Dou, L., Gao, Y., Pan, M., Wang, D., Che, W., Zhan, D., & Lou, J.-G. (2023). MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 12745-12753. https://doi.org/10.1609/aaai.v37i11.26499

Issue

Section

AAAI Technical Track on Speech & Natural Language Processing