├── .gitignore ├── README.md ├── assets ├── An Example of the Intermediate Representation.svg ├── An Overview of NL2SQL Method in the LLM Era.svg ├── An Overview of NL2SQL360.svg ├── An Overview of the Decoding Strategies.svg ├── An Overview of the Encoding Strategies.svg ├── Dataset_timeline.svg ├── Examples of the nlsql Tasks and Its Challenges.svg ├── Model_Module_Overview.png ├── NL2SQL.jpg ├── NL2SQL_Guidance.svg ├── Statistics of Din-SQL Errors by Our Taxonomy.svg ├── The Differences Between PLM and LLM in NL2SQL task.svg ├── The Evolution of NL2SQL Solutions from the Perspective of Language Models.svg ├── nl2sql_lifecycle.svg ├── readme.md └── river.svg ├── chapter ├── Benchmark.md ├── Error_Analysis.md ├── Evaluation.md ├── Post_Processing.md ├── Pre_Processing.md └── Translation_method.md ├── report ├── ATIS │ └── report.json ├── Academic │ └── report.json ├── Advising │ └── report.json ├── AmbiQT │ └── report.json ├── Archer │ └── report.json ├── BIRD │ └── report.json ├── BULL │ └── report.json ├── BookSQL │ └── report.json ├── CHASE │ └── report.json ├── CSpider │ └── report.json ├── CoSpider │ └── report.json ├── DrSpider │ └── report.json ├── DuSQL │ └── report.json ├── FIBEN │ └── report.json ├── GeoQuery │ └── report.json ├── IMDB │ └── report.json ├── KaggleDBQA │ └── report.json ├── MIMICSQL │ └── report.json ├── MTTEQL │ └── report.json ├── PAUQ │ └── report.json ├── PortugueseSpider │ └── report.json ├── Restaurants │ └── report.json ├── SEDE │ └── report.json ├── SParC │ └── report.json ├── SQUALL │ └── report.json ├── Scholar │ └── report.json ├── ScienceBenchmark │ └── report.json ├── Spider │ └── report.json ├── SpiderDK │ └── report.json ├── SpiderRealistic │ └── report.json ├── SpiderSyn │ └── report.json ├── ViText2SQL │ └── report.json ├── WikiSQL │ └── report.json └── Yelp │ └── report.json ├── slides └── NL2SQL_handbook.pdf └── src └── dataset_analyze ├── __pycache__ ├── dataset.cpython-310.pyc ├── dataset.cpython-39.pyc ├── sql_parser.cpython-310.pyc ├── sql_parser.cpython-311.pyc ├── sql_parser.cpython-39.pyc ├── utils.cpython-310.pyc ├── utils.cpython-311.pyc └── utils.cpython-39.pyc ├── analyze.py ├── dataset.py ├── sql_parser.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | **/data 2 | **/__pycache__ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Text-to-SQL Handbook

2 | 3 | From this repository, you can view the [latest advancements](#-nl2sql-survey--tutorial) in NL2SQL(Text-to-SQL). This handbook corresponds to our survey paper: [A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?](https://arxiv.org/abs/2408.05109). We also provide [tutorial slides](./slides/NL2SQL_handbook.pdf) to summarize the key points of this survey. Based on the trends in the development of language models, we have created a river diagram of NL2SQL methods to trace the evolution of the NL2SQL field. 4 | 5 | If you are a novice, don't worry—we have prepared a practical guide for you, covering a wide range of foundational materials [here](#-practical-guide-for-novice). We summarized NL2SQL related [applications](#-nl2sql-related-applications). 6 | 7 |

8 | 9 |

10 | 11 | ```bibtex 12 | @article{liu2024survey, 13 | title={A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?}, 14 | author={Liu, Xinyu and Shen, Shuyu and Li, Boyan and Ma, Peixian and Jiang, Runzhi and Zhang, Yuxin and Fan, Ju and Li, Guoliang and Tang, Nan and Luo, Yuyu}, 15 | journal={arXiv preprint arXiv:2408.05109}, 16 | year={2024} 17 | } 18 | ``` 19 | 20 | ## 🧭 NL2SQL Introduction 21 | Translating users' natural language queries (NL) into SQL queries can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly improved with the emergence of language models (LMs). In this context, it is crucial to assess our current position, determine the NL2SQL solutions that should be adopted for specific scenarios by practitioners, and identify the research topics that researchers should explore next. 22 | 23 |

24 | 25 |

26 | 27 | ## 📈 NL2SQL Lifecycle 28 | 29 |

30 | 31 |

32 | 33 | + Model: NL2SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; 34 | 35 | + Data: From the collection of training data, data synthesis due to training data scarcity, to NL2SQL benchmarks; 36 | 37 | + Evaluation: Evaluating NL2SQL methods from multiple angles using different metrics and granularities; 38 | 39 | + Error Analysis: analyzing NL2SQL errors to find the root cause and guiding NL2SQL models to evolve. 40 | 41 | ## 🤔 Where Are We? 42 | We categorize the challenges of NL2SQL into five levels, each addressing specific hurdles. The first three levels cover challenges that have been or are currently being addressed, reflecting the progressive development of NL2SQL. The fourth level represents the challenges we aim to tackle in the LLMs stage, while the fifth level outlines our vision for NL2SQL system in the next five years. 43 | 44 | We describe the evolution of NL2SQL solutions from the perspective of language models, categorizing it into four stages. 45 | For each stage of NL2SQL, we analyze the changes in target users and the extent to which challenges are addressed. 46 |

47 | 48 |

49 | 50 | 51 | ## 🧩 Module-based NL2SQL Methods 52 | We summarize the key modules of NL2SQL solutions 53 | utilizing the language model. 54 | + **Pre-processing** serves as an enhancement to the model’s inputs in the NL2SQL parsing process. You can get more details from this chapter: [Pre-Processing](chapter/Pre_Processing.md) 55 | + **NL2SQL translation methods** constitute the core of the NL2SQL solution, responsible for converting input natural language queries into SQL queries. You can get more details from this chapter: [NL2SQL Translation Methods](chapter/Translation_method.md) 56 | + **Post-processing** is a crucial step to refine the generated SQL queries, ensuring they meet user expectations more accurately. You can get more details from this chapter: [Post-Processing](chapter/Post_Processing.md) 57 |

58 | 59 |

60 | 61 | ## 📚 NL2SQL Survey & Tutorial 62 | 63 | 1. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? 64 | [](https://arxiv.org/abs/2408.05109) [](https://github.com/HKUSTDial/NL2SQL_Handbook) 65 | 1. Next-generation databas interfaces: A survey of llm-based text-to-sql. [](https://arxiv.org/abs/2406.08426) 66 | 1. Large Language Model Enhanced Text-to-SQL Generation: A Survey. 67 | [](https://arxiv.org/abs/2410.06011) 68 | 1. From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems. 69 | [](https://arxiv.org/pdf/2410.01066) 70 | 1. A Survey on Employing Large Language Models for Text-to-SQL Tasks. 71 | [](https://arxiv.org/pdf/2407.15186) 72 | 1. Natural language interfaces for tabular data querying and visualization: A survey. 73 | [](https://arxiv.org/abs/2310.17894) 74 | 1. Natural Language Interfaces for Databases with Deep Learning. [](https://dl.acm.org/doi/10.1007/s00778-022-00776-8) 75 | 1. A survey on deep learning approaches for text-to-SQL. 76 | [](https://dl.acm.org/doi/10.1007/s00778-022-00776-8) 77 | 1. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect. 78 | [](https://aclanthology.org/2022.coling-1.190/) 79 | 1. A Deep Dive into Deep Learning Approaches for Text-to-SQL Systems. 80 | [](https://dl.acm.org/doi/10.1145/3448016.3457543) 81 | 1. State of the Art and Open Challenges in Natural Language Interfaces to Data. 82 | [](https://dl.acm.org/doi/10.1145/3318464.3383128) 83 | 1. Natural language to SQL: Where are we today? [](https://www.vldb.org/pvldb/vol13/p1737-kim.pdf) 84 | 85 | ## 📰 NL2SQL Paper List 86 | 1. Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search 87 | [](https://arxiv.org/abs/2502.17248) [](https://alpha-sql-hkust.github.io/) 88 | 1. NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation. [](https://arxiv.org/pdf/2503.11984) [](https://nl2sql-bugs.github.io/) 89 | 1. Sphinteract: Resolving Ambiguities in NL2SQL Through User Interaction. 90 | [](https://www.vldb.org/pvldb/vol18/p1145-zhao.pdf) [](https://github.com/ZhaoFuheng/Sphinteract) 91 | 1. OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment. [](https://arxiv.org/pdf/2502.14913) [](https://github.com/OpenSearch-AI/OpenSearch-SQL) 92 | 1. Reliable Text-to-SQL with Adaptive Abstention. [](https://arxiv.org/abs/2501.10858) 93 | 1. SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference. [](https://dl.acm.org/doi/10.1145/3709727) 94 | 1. Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency. [](https://ipads.se.sjtu.edu.cn/zh/publications/SQLDriller.pdf) 95 | 1. Grounding Natural Language to SQL Translation with Data-Based Self-Explanations. [](https://arxiv.org/abs/2411.02948) [](https://github.com/Kaimary/CycleSQL) 96 | 1. AID-SQL: Adaptive In-Context Learning of Text-to-SQL with Difficulty-Aware Instruction and Retrieval-Augmented Generation. [](https://www.computer.org/csdl/proceedings-article/icde/2025/360300d945/26FZCc99mg0) 97 | 1. CLEAR: A Parser-Independent Disambiguation Framework for NL2SQL. 98 | [](https://www.computer.org/csdl/proceedings-article/icde/2025/360300d302/26FZBD2hBJe) []() 99 | 1. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. 100 | [](https://arxiv.org/pdf/2410.01943v1) 101 | 1. Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows. 102 | [](https://arxiv.org/abs/2411.07763) [](https://github.com/xlang-ai/Spider2) 103 | 1. ROUTE: ROBUST MULTITASK TUNING AND COLLAB- 104 | ORATION FOR TEXT-TO-SQL. [](https://arxiv.org/pdf/2412.10138) 105 | 1. Confidence Estimation for Error Detection in Text-to-SQL Systems. [](https://arxiv.org/abs/2501.09527) 106 | 1. SQLord: A Robust Enterprise Text-to-SQL Solution via Reverse Data Generation and Workflow Decomposition. [](https://dl.acm.org/doi/pdf/10.1145/3701716.3715541) 107 | 1. DBCopilot: Scaling Natural Language Querying to Massive Databases. [](https://arxiv.org/abs/2312.03463) [](https://github.com/tshu-w/DBCopilot) 108 | 1. Boosting Text-to-SOL through Multi- grained Error Identification. 109 | [](https://aclanthology.org/2025.coling-main.289.pdf) 110 | 1. Gen-SQL: Efficient Text-to-SQL By Bridging Natural Language Question And Database Schema With Pseudo-Schema. 111 | [](https://aclanthology.org/2025.coling-main.256/) 112 | 1. Utilising Large Language Models for Adversarial Attacks in Text-to-SQL: A Perpetrator and Victim Approach 113 | [](https://arxiv.org/pdf/2502.20657) 114 | [](https://github.com/XGenerationLab/XiYan-DBDescGen) 115 | 1. You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL. 116 | [](https://arxiv.org/abs/2409.12172) [](https://sig4kg.github.io/archer-bench/) 117 | 1. EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing. [](https://arxiv.org/abs/2503.22402) [](https://elliesql.github.io/) 118 | 1. Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards. [](https://arxiv.org/pdf/2505.04671) [](https://github.com/ruc-datalab/RewardSQL) 119 | 1. SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning. [](https://arxiv.org/pdf/2504.08600) 120 | 1. Is Long Context AIl You Need? Leveraging LLM's ExtendedContext for NL2SQL. 121 | [](https://arxiv.org/abs/2501.12372) 122 | 1. SQLForge: Synthesizing Reliable and Diverse Data to Enhance 123 | Text-to-SQL Reasoning in LLMs. 124 | [](https://arxiv.org/pdf/2505.13725) 125 | 1. Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL. [](https://arxiv.org/pdf/2504.15077) 126 | 1. Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs. [](https://arxiv.org/pdf/2504.00048) 127 | 1. Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL. [](https://arxiv.org/pdf/2503.23157) 128 | 1. OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale. [](https://arxiv.org/pdf/2503.02240) [](https://github.com/RUCKBReasoning/OmniSQL) 129 | 1. SQL-Factory: A Multi-Agent Framework for High-Quality and Large-Scale SQL Generation. [](https://arxiv.org/pdf/2504.14837) 130 | 1. Text2SQL is Not Enough: Unifying AI and Databases with TAG. [](https://arxiv.org/pdf/2408.14717) [](https://github.com/TAG-Research/TAG-Bench) 131 | 1. Automatic database description generation for Text-to-SQL. 132 | [](https://arxiv.org/pdf/2502.20657) 133 | [](https://github.com/XGenerationLab/XiYan-DBDescGen) 134 | 1. MCTS-SQL: An Effective Framework for Text-to-SQL with Monte Carlo Tree Search. 135 | [](https://arxiv.org/abs/2501.16607) 136 | 1. SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL. [](https://arxiv.org/pdf/2502.11741) 137 | 1. FEATHER-SQL: A LIGHTWEIGHT NL2SQL FRAME-WORK WITH DUAL-MODEL COLLABORATION PARADIGM FOR SMALL LANGUAGE MODELS. 138 | [](https://arxiv.org/pdf/2503.17811) 139 | 1. FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model. 140 | [](https://www.mdpi.com/1999-5903/17/1/12) 141 | 1. FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL 142 | [](https://www.mdpi.com/2079-9292/14/6/1214) 143 | 1. Transforming Medical Data Access: The Role and Challenges of Recent Language Models in SQL Query Automation. [](https://www.mdpi.com/1999-4893/18/3/124) 144 | 1. The Dawn of Natural Language to SQL: Are We Fully Ready? 145 | [](https://arxiv.org/abs/2406.01265) [](https://github.com/HKUSTDial/NL2SQL360) 146 | 1. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. 147 | [](https://arxiv.org/abs/2308.15363) [](https://github.com/BeachWang/DAIL-SQL) 148 | 1. Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation. 149 | [](https://arxiv.org/abs/2306.08891) [](https://github.com/ruc-datalab/ZeroNL2SQL) 150 | 1. Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models. 151 | [](https://dl.acm.org/doi/abs/10.14778/3681954.3682017) [](https://github.com/itrummer/schemacompression) 152 | 1. ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems. [](https://arxiv.org/abs/2306.04743) [](https://sciencebenchmark.cloudlab.zhaw.ch/) 153 | 1. CodeS: Towards Building Open-source Language Models for Text-to-SQL. 154 | [](https://arxiv.org/abs/2402.16347) [](https://github.com/RUCKBReasoning/codes) 155 | 1. FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis. 156 | [](https://arxiv.org/abs/2401.10506) [](https://github.com/bigbigwatermalon/FinSQL) 157 | 1. PURPLE: Making a Large Language Model a Better SQL Writer. 158 | [](https://arxiv.org/abs/2403.20014) [](https://github.com/httdty/purple) 159 | 1. METASQL: A Generate-then-Rank Framework for Natural Language to SQL Translation. 160 | [](https://arxiv.org/abs/2402.17144) [](https://github.com/Kaimary/MetaSQL) 161 | 1. Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning. 162 | [](https://aclanthology.org/2024.eacl-long.6/) [](https://sig4kg.github.io/archer-bench/) 163 | 1. Synthesizing Text-to-SQL Data from Weak and Strong LLMs. 164 | [](https://arxiv.org/pdf/2408.03256) [](https://github.com/Yangjiaxi/Sense) 165 | 1. Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark. 166 | [](https://arxiv.org/pdf/2402.12243) [](https://github.com/niklaswretblad/the-effects-of-noise-in-text-to-SQL) 167 | 1. I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation. 168 | [](https://arxiv.org/pdf/2407.14767) [](https://github.com/appier-research/i-need-help) 169 | 1. PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL. 170 | [](https://arxiv.org/pdf/2409.14082) [](https://github.com/lrlbbzl/PTD-SQL) 171 | 1. Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning. 172 | [](https://arxiv.org/pdf/2407.03227) 173 | 1. Data-Centric Text-to-SQL with Large Language Models. 174 | [](https://openreview.net/pdf?id=gDKIjZcg93) 175 | 1. Research and Practice on Database Interaction Based on Natural Language Processing 176 | [](https://arxiv.org/abs/2310.17894) 177 | 1. XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL. 178 | [](https://arxiv.org/pdf/2411.08599) 179 | 1. Structure Guided Large Language Model for SQL Generation. 180 | [](https://arxiv.org/pdf/2402.13284) 181 | 1. A Plug-and-Play Natural Language Rewriter for Natural Language to SQL. 182 | [](https://arxiv.org/pdf/2412.17068) 183 | 1. RSL-SQL: Robust Schema Linking in Text-to-SQL Generation. 184 | [](https://arxiv.org/abs/2403.15879) [](https://github.com/glee4810/TrustSQL) 185 | 1. In-Context Reinforcement Learning based Retrieval-Augmented Generation for Text-to-SQL. 186 | [](https://assets.amazon.science/09/f4/493c574346f895bbb0303282a501/in-context-reinforcement-learning-based-retrieval-augmented-generation-for-text-to-sql.pdf) 187 | 1. TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring. 188 | [](https://arxiv.org/pdf/2411.00073) [](https://github.com/Laqcce-cao/RSL-SQL) 189 | 1. LAIA-SQL: Enhancing Natural Language to SQL Generation in Multi-Table QA via Task Decomposition and Keyword Extraction 190 | [](https://openreview.net/pdf?id=WYdpjwKQma) 191 | 1. Research on Large Model Text-to-SQL Optimization Method for Intelligent Interaction in the Field of Construction Safety. 192 | [](https://ieeexplore.ieee.org/abstract/document/10810146) 193 | 1. SQLh-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging. [](https://arxiv.org/pdf/2408.12733v2) 194 | 1. Grounding Natural Language to SQL Translation with Data-Based Self-Explanations. 195 | [](https://arxiv.org/pdf/2411.02948) [](https://github.com/Kaimary/CycleSQL) 196 | 1. Towards Optimizing SQL Generation via LLM Routing. 197 | [](https://arxiv.org/abs/2411.04319) 198 | 1. E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL. 199 | [](https://arxiv.org/abs/2409.16751) [](https://github.com/HasanAlpCaferoglu/E-SQL) 200 | 1. DB-GPT: Empowering Database Interactions with Private Large Language Models. 201 | [](https://arxiv.org/abs/2312.17449) [](https://github.com/eosphoros-ai/DB-GPT) 202 | 1. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models. 203 | [](https://arxiv.org/pdf/2408.07702) 204 | 1. CHESS: Contextual Harnessing for Efficient SQL Synthesis. 205 | [](https://arxiv.org/abs/2405.16755) [](https://github.com/ShayanTalaei/CHESS) 206 | 1. PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency. 207 | [](https://arxiv.org/abs/2403.09732) [](https://github.com/ruc-datalab/ZeroNL2SQL) 208 | 1. CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions. 209 | [](https://arxiv.org/abs/2405.02712) [](https://github.com/X-LANCE/text2sql-multiturn-GPT) 210 | 1. AMBROSIA: A Benchmark for Parsing Ambiguous Questions into Database Queries. 211 | [](https://arxiv.org/abs/2406.19073) [](https://ambrosia-benchmark.github.io/) 212 | 1. Text-to-SQL Calibration: No Need to Ask—Just Rescale Model Probabilities. 213 | [](https://arxiv.org/pdf/2411.16742) 214 | 1. Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. 215 | [](https://dl.acm.org/doi/abs/10.1145/3589292) [](https://github.com/ruc-datalab/SC-prompt) 216 | 1. CatSQL: Towards Real World Natural Language to SQL Applications. 217 | [](https://www.vldb.org/pvldb/vol16/p1534-fu.pdf) [](https://github.com/asfuhan/CatSQL) 218 | 1. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. 219 | [](https://arxiv.org/abs/2304.11015) [](https://github.com/MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting/tree/main) 220 | 1. Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL. 221 | [](https://openreview.net/pdf?id=FflKTuIRTD) 222 | 1. ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought. 223 | [](https://arxiv.org/abs/2310.17342) [](https://github.com/X-LANCE/text2sql-GPT) 224 | 1. Selective Demonstrations for Cross-domain Text-to-SQL. 225 | [](https://arxiv.org/abs/2310.06302) [](https://github.com/shuaichenchang/ODIS-Text-to-SQL) 226 | 1. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. 227 | [](https://arxiv.org/abs/2302.05965) [](https://github.com/RUCKBReasoning/RESDSQL) 228 | 1. Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. 229 | [](https://arxiv.org/abs/2301.07507) [](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/graphix) 230 | 1. Improving Generalization in Language Model-based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-based Techniques. 231 | [](https://virtual2023.aclweb.org/paper_P4350.html) [](https://github.com/Dakingrai/ood-generalization-semantic-boundary-techniques) 232 | 1. G3R: A Graph-Guided Generate-and-Rerank Framework for Complex and Cross-domain Text-to-SQL Generation. 233 | [](https://aclanthology.org/2023.findings-acl.23/) 234 | 1. Importance of Synthesizing High-quality Data for Text-to-SQL Parsing. 235 | [](https://aclanthology.org/2023.findings-acl.86.pdf) 236 | 1. Know What I don’t Know: Handling Ambiguous and Unknown Questions for Text-to-SQL. 237 | [](https://aclanthology.org/2023.findings-acl.352/) [](https://github.com/wbbeyourself/DTE) 238 | 1. C3: Zero-shot Text-to-SQL with ChatGPT 239 | [](https://arxiv.org/abs/2307.07306) [](https://github.com/bigbigwatermalon/C3SQL) 240 | 1. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. 241 | [](https://arxiv.org/abs/2312.11242) [](https://github.com/wbbeyourself/MAC-SQL) 242 | 1. SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation. 243 | [](https://arxiv.org/abs/2310.18376) [](https://github.com/AdrianBZG/SQLformer) 244 | 245 | ## 📊 NL2SQL Benchmark 246 | We create a timeline of the benchmark's development and mark relevant milestones. You can get more details from this chapter: [📊 Benchmark](chapter/Benchmark.md) 247 |

248 | 249 |

250 | 251 | ## 🎯 Where Are We Going? 252 | 253 | * 🎯Solve Open NL2SQL Problem 254 | * 🎯Develop Cost-effective NL2SQL Methods 255 | * 🎯Make NL2SQL Solutions Trustworthy 256 | * 🎯NL2SQL with Ambiguous and Unspecified NL Queries 257 | * 🎯Adaptive Training Data Synthesis 258 | 259 | ## 📖 Catalog for Our Survey 260 | You can get more information from our subsection. We introduce representative papers on related concepts: 261 | * [Pre-Processing](chapter/Pre_Processing.md) 262 | * [NL2SQL Translation Methods](chapter/Translation_method.md) 263 | * [Post-Processing](chapter/Post_Processing.md) 264 | * [Benchmark](chapter/Benchmark.md) 265 | * [Evaluation](chapter/Evaluation.md) 266 | * [Error Analysis](chapter/Error_Analysis.md) 267 | 268 | ## 💾 Practical Guide for Novice 269 | 270 | ### 📊 How to get data: 271 | * We collect NL2SQL benchmark features and download links for you. You can get more details from this chapter: [Benchmark](chapter/Benchmark.md) 272 | * The analysis code for benchmarks is available in the `src/dataset_analysis` directory. Benchmark analysis reports can be found in the `report/` directory. 273 | 274 | ### 🛠️ How to build an LLM-based NL2SQL model: 275 | 276 | * Litgpt [Repository Link](https://github.com/Lightning-AI/litgpt) 277 | 278 | This repository offers access to over 20 high-performance large language models (LLMs) with comprehensive guides for pretraining, fine-tuning, and deploying at scale. It is designed to be beginner-friendly with from-scratch implementations and no complex abstractions. 279 | 280 | * LLaMA-Factory [Repository Link](https://github.com/hiyouga/LLaMA-Factory) 281 | Unified Efficient Fine-Tuning of 100+ LLMs. Integrating various models with scalable training resources, advanced algorithms, practical tricks, and comprehensive experiment monitoring tools, this setup enables efficient and faster inference through optimized APIs and UIs. 282 | 283 | * Fine-tuning and In-Context learning for BIRD-SQL benchmark [Repository Link](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird#fine-tuning-ft) 284 | 285 | A tutorial for both Fine-tuning and In-Context Learning is provided by the BIRD-SQL benchmark. 286 | 287 | ### 🔎How to evaluate your model: 288 | 289 | We collect NL2SQL evaluation metrics for you. You can get more details from this chapter: [Evaluation](chapter/Evaluation.md) 290 | 291 | * NLSQL360 [Repository Link](https://github.com/HKUSTDial/NL2SQL360) 292 | 293 | NL2SQL360 is a testbed for fine-grained evaluation of NL2SQL solutions. Our testbed integrates existing NL2SQL benchmarks, a repository of NL2SQL models, and various evaluation metrics, which aims to provide an intuitive and user-friendly platform to enable both standard and customized performance evaluations. 294 | 295 | * Test-suite-sql-eval [Repository Link](https://github.com/taoyds/test-suite-sql-eval) 296 | 297 | This repo contains a test suite evaluation metric for 11 text-to-SQL tasks. It is now the official metric of [Spider](https://yale-lily.github.io/spider), [SParC](https://yale-lily.github.io/sparc), and [CoSQL](https://yale-lily.github.io/cosql), and is also now available for Academic, ATIS, Advising, Geography, IMDB, Restaurants, Scholar, and Yelp (building on the amazing work by [Catherine and Jonathan](https://github.com/jkkummerfeld/text2sql-data)). 298 | 299 | * BIRD-SQL-Official [Repository Link](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird#evaluation) 300 | 301 | It is now the official tool of [BIRD-SQL](https://bird-bench.github.io/). It is the first tool to propose VES and give an official test suite. 302 | 303 | 304 | ### 🗺️ Roadmap and Decision Flow 305 | 306 | You can get some inspiration from the Roadmap and Decision Flow. 307 |

308 | 309 |

310 | 311 | ## 📱 NL2SQL Related Applications: 312 | 313 | * Chat2DB: AI-driven database tool and SQL client, The hottest GUI client, supporting MySQL, Oracle, PostgreSQL, DB2, SQL Server, DB2, SQLite, H2, ClickHouse, and more. [](https://github.com/codePhiliaX/Chat2DB) [](https://chat2db-ai.com/zh-CN) 314 | * DB-GPT: AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents. [](https://github.com/eosphoros-ai/DB-GPT) 315 | * Postgres.new: In-browser Postgres sandbox with AI assistance. [](https://github.com/supabase-community/postgres-new/tree/main) [](https://postgres.new/) 316 | -------------------------------------------------------------------------------- /assets/Model_Module_Overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/assets/Model_Module_Overview.png -------------------------------------------------------------------------------- /assets/NL2SQL.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/assets/NL2SQL.jpg -------------------------------------------------------------------------------- /assets/The Differences Between PLM and LLM in NL2SQL task.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /assets/readme.md: -------------------------------------------------------------------------------- 1 | Figures or other source 2 | -------------------------------------------------------------------------------- /chapter/Benchmark.md: -------------------------------------------------------------------------------- 1 | ## Benchmarks 2 | 3 | | Benchmark | Year | Language | Domain Type | Turn Type | Collection | Paper Link | Download Link | 4 | | ---------------- | ---- | --------------- | ------------- | --------- | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | 5 | | ATIS | 1994 | English | Single-domain | Single | Hand-crafted | [Paper](http://aclweb.org/anthology/P18-1033) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 6 | | GeoQuery | 1996 | English | Single-domain | Single | Hand-crafted | [Paper](http://dl.acm.org/citation.cfm?id=1864519.1864543) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 7 | | Restaurants | 2003 | English | Single-domain | Single | Hand-crafted | [Paper](http://doi.acm.org/10.1145/604045.604070) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 8 | | Academic | 2014 | English | Single-domain | Single | Hand-crafted | [Paper](http://dx.doi.org/10.14778/2735461.2735468) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 9 | | IMDb | 2017 | English | Single-domain | Single | Hand-crafted | [Paper](http://doi.org/10.1145/3133887) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 10 | | Yelp | 2017 | English | Single-domain | Single | Hand-crafted | [Paper](http://doi.org/10.1145/3133887) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 11 | | Scholar | 2017 | English | Single-domain | Single | Hand-crafted | [Paper](http://www.aclweb.org/anthology/P17-1089) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 12 | | WikiSQL | 2017 | English | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/1709.00103) | [Download](https://github.com/salesforce/WikiSQL) | 13 | | Advising | 2018 | English | Single-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/1709.00103) | [Download](https://github.com/jkkummerfeld/text2sql-data/tree/master) | 14 | | Spider | 2018 | English | Cross-domain | Single | Hand-crafted | [Paper](http://aclweb.org/anthology/D18-1425) | [Download](https://yale-lily.github.io/spider) | 15 | | SParC | 2019 | English | Cross-domain | Multiple | Hand-crafted | [Paper](https://arxiv.org/abs/1906.02285) | [Download](https://yale-lily.github.io/sparc) | 16 | | CoSQL | 2019 | English | Cross-domain | Multiple | Hand-crafted | [Paper](https://arxiv.org/abs/1909.05378) | [Download](https://yale-lily.github.io/cosql) | 17 | | CSpider | 2019 | Chinese | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/1909.13293) | [Download](https://taolusi.github.io/CSpider-explorer/) | 18 | | MIMICSQL | 2020 | English | Single-domain | Single | Auto-generated + Mannual | [Paper](https://dmkd.cs.vt.edu/papers/WWW20.pdf) | [Download](https://github.com/wangpinggl/TREQS) | 19 | | SQUALL | 2020 | English | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2010.11246) | [Download](https://github.com/tzshi/squall) | 20 | | FIBEN | 2020 | English | Single-domain | Single | Hand-crafted | [Paper](https://www.vldb.org/pvldb/vol13/p2747-sen.pdf) | [Download](https://github.com/IBM/fiben-benchmark/tree/master) | 21 | | ViText2SQL | 2020 | Vietnamese | Cross-domain | Single | Hand-crafted | [Paper](https://aclanthology.org/2020.findings-emnlp.364/) | [Download](https://github.com/VinAIResearch/ViText2SQL/tree/master) | 22 | | DuSQL | 2020 | Chinese | Cross-domain | Single | Auto-generated + Mannual | [Paper](https://aclanthology.org/2020.emnlp-main.562/) | [Download](https://github.com/DejianYang/DuSQL-1) | 23 | | PortugueseSpider | 2021 | Portuguese | Cross-domain | Single | Auto-generated + Mannual | [Paper](https://arxiv.org/abs/2110.03546) | - | 24 | | CHASE | 2021 | Chinese | Cross-domain | Multiple | Hand-crafted | [Paper](https://aclanthology.org/2021.acl-long.180/) | [Download](https://github.com/xjtu-intsoft/chase) | 25 | | Spider-Syn | 2021 | English | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2106.01065) | [Download](https://github.com/ygan/Spider-Syn) | 26 | | Spider-DK | 2021 | English | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2109.05157) | [Download](https://github.com/ygan/spider-dk) | 27 | | Spider-Realistic | 2021 | English | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/pdf/2010.12773v3) | [Download](https://zenodo.org/records/5205322) | 28 | | KaggleDBQA | 2021 | English | Cross-domain | Single | Hand-crafted | [Paper](https://aclanthology.org/2021.acl-long.176/) | [Download](https://github.com/Chia-Hsuan-Lee/KaggleDBQA) | 29 | | SEDE | 2021 | English | Single-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2106.05006) | [Download](https://github.com/hirupert/sede) | 30 | | MT-TEQL | 2021 | English | Cross-domain | Single | Auto-generated | [Paper](https://dl.acm.org/doi/abs/10.14778/3494124.3494139) | [Download](https://github.com/MTTeql/MT-Teql) | 31 | | PAUQ | 2022 | Russian | Cross-domain | Single | Hand-crafted | [Paper](https://aclanthology.org/2022.findings-emnlp.175.pdf) | [Download](https://github.com/ai-spiderweb/pauq/tree/main?tab=readme-ov-file) | 32 | | knowSQL | 2022 | Chinese | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2301.01067) | - | 33 | | Dr.Spider | 2023 | English | Cross-domain | Single | Auto-generated + Mannual | [Paper](https://arxiv.org/pdf/2301.08881.pdf) | [Download](https://github.com/awslabs/diagnostic-robustness-text-to-sql/tree/main) | 34 | | BIRD | 2023 | English | Cross-domain | Single | Hand-crafted | [Paper](https://arxiv.org/pdf/2305.03111) | [Download](https://bird-bench.github.io/) | 35 | | AmbiQT | 2023 | English | Cross-domain | Single | ChatGPT-aided + Mannual | [Paper](https://arxiv.org/abs/2310.13659) | [Download](https://github.com/testzer0/ambiqt) | 36 | | ScienceBenchmark | 2024 | English | Single-domain | Single | Auto-generated + Mannual | [Paper](https://arxiv.org/abs/2306.04743) | [Download](https://github.com/ckosten/sciencebenchmark_dataset) | 37 | | BULL | 2024 | English/Chinese | Single-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2401.10506v1) | [Download](https://github.com/bigbigwatermalon/FinSQL) | 38 | | BookSQL | 2024 | English | Single-domain | Single | Hand-crafted | [Paper](https://arxiv.org/abs/2406.07860) | [Download](https://github.com/Exploration-Lab/BookSQL) | 39 | | Archer | 2024 | English/Chinese | Cross-domain | Single | Hand-crafted | [Paper](https://aclanthology.org/2024.eacl-long.6/) | [Download](https://sig4kg.github.io/archer-bench/) | 40 | 41 | -------------------------------------------------------------------------------- /chapter/Error_Analysis.md: -------------------------------------------------------------------------------- 1 | ## Error Analysis 2 | 3 | ### Our Taxonomy for NL2SQL Errors Analysis. 4 | We propose the following principles to guide the development of this taxonomy: 5 | 6 | * **Comprehensiveness**: The taxonomy should encompass all potential errors that could occur during the NL2SQL conversion process. 7 | * **Mutual Exclusivity**: Each error type should be clearly distinct with no overlap, to avoid ambiguity in error classification. 8 | * **Extensibility**: The taxonomy should be adaptable to incorporate new error types as NL2SQL technologies and methodologies evolve. 9 | * **Practicality**: The taxonomy should be practical and applicable in real-world settings, aiding developers in diagnosing and correcting errors effectively. 10 | 11 | Following these principles, we attempted to design a taxonomy containing two levels: 12 | 13 | * **Error Localization**: This level focuses on identifying the specific parts of the SQL where errors occur, such as in the `SELECT` clause. It is vital for precisely locating where misunderstandings or misinterpretations arise, thereby facilitating targeted corrections. 14 | * **Cause of Error**: This level focuses on understanding why the model is wrong when generating SQL. For example, value errors in the `WHERE` clause may indicate the model's insufficient ability to understand and retrieve database content. On the other hand, conditional errors in the `WHERE` clause typically reveal flaws in semantic understanding, where the model fails to grasp the logical requirements of the query. 15 | 16 | ### A Case Study of Errors Analysis 17 | 18 | We collected the errors generated by [DIN-SQL](https://arxiv.org/abs/2304.11015) on the [Spider](https://yale-lily.github.io/spider) dataset and manually classified them according to the taxonomy we designed. 19 | 20 |

21 | 22 |

23 | 24 | -------------------------------------------------------------------------------- /chapter/Evaluation.md: -------------------------------------------------------------------------------- 1 | # Evaluation 2 | 3 | ## Evaluation Metric 4 | 5 | - **Execution Accuracy (EX)** 6 | - **Description:** Execution Accuracy (EX) evaluates the performance of the NL2SQL system by comparing whether the execution result sets of the ground-truth SQL queries and the predicted SQL queries are identical. 7 | - [Paper Link](https://arxiv.org/abs/1809.08887) 8 | - **String-Match Accuracy (SM)** 9 | - **Description:** String-Match Accuracy (SM) (also called Logical Form Accuracy) simply compares whether the ground-truth SQL query and the predicted SQL query are identical as strings. It may penalize SQL queries that produce the correct execution result sets but do not have the exact string match with the ground-truth SQL queries. 10 | - [Paper Link](https://arxiv.org/abs/1709.00103) 11 | - **Component-Match Accuracy (CM)** 12 | - **Description:** Component-Match Accuracy (CM) evaluates the detailed performance of the NL2SQL system by measuring the exact matching of different SQL components such as `SELECT`, `WHERE` and others between the ground-truth SQL query and the predicted SQL query. 13 | - [Paper Link](https://arxiv.org/abs/1809.08887) 14 | - **Exact-Match Accuracy (EM)** 15 | - **Description:** Exact-Match Accuracy is based on the Component-Match Accuracy (CM) and measures whether all SQL components of the predicted SQL query match the ground-truth SQL query. 16 | - [Paper Link](https://arxiv.org/abs/1809.08887) 17 | - **Valid Efficiency Score (VES)** 18 | - **Description:** Valid Efficiency Score (VES) measures the execution efficiency of valid SQL queries. It considers both the accuracy and efficiency of SQL execution. 19 | - [Paper Link](https://arxiv.org/pdf/2305.03111) 20 | - **Query Variance Testing (QVT)** 21 | - **Description:** Query Variance Testing (QVT) measures the robustness and flexibility of the NL2SQL system in handling variations in NL questions. 22 | - [Paper Link](https://arxiv.org/abs/2406.01265) 23 | 24 | ## Evaluation Toolkit 25 | 26 | - **NL2SQL360** 27 | - **Description:** **NL2SQL360** is a testbed for fine-grained evaluation of NL2SQL solutions. The testbed integrates existing NL2SQL benchmarks, a repository of NL2SQL models, and various evaluation metrics, which aims to provide an intuitive and user-friendly platform to enable both standard and customized performance evaluations. Users can utilize **NL2SQL360** to assess different NL2SQL methods against established benchmarks or tailor their evaluations based on specific criteria. This flexibility allows for testing solutions in specific data domains or analyzing performance on different characteristics of SQL queries. 28 | - [Paper Link](https://arxiv.org/abs/2406.01265) 29 | - [Repository Link](https://github.com/HKUSTDial/NL2SQL360) 30 |

31 | 32 |

33 | - **MT-TEQL** 34 | - **Description:** **MT-TEQL** is unified framework for evaluating the performance of NL2SQL systems in handling real-world variations in NL questions and database schemas. It is based on a meta-morphic testing approach, implementing semantic-preserving transformations of NL questions and database schemas to automatically generate their variants without manual efforts. 35 | - [Paper Link](https://www.vldb.org/pvldb/vol15/p569-ma.pdf) 36 | - [Repository Link](https://github.com/MTTeql/MT-Teql) -------------------------------------------------------------------------------- /chapter/Post_Processing.md: -------------------------------------------------------------------------------- 1 | ## Post-Processing 2 | Post-processing is a crucial step to refine the generated SQL queries, ensuring they meet user expectations more accurately. This involves enhancing the initial SQL output using various strategies. 3 | ### SQL Correction Strategies: 4 | #### 🎓Basic concept: 5 | SQL correction strategies are designed to prompt language models to identify and fix syntax errors in SQL queries generated by NLP models. These strategies involve guiding models to refine their outputs by addressing issues such as missing or redundant keywords and incorrect predicate values. As the capabilities of language models improve, these strategies are evolving to support more general error correction, enhancing the accuracy and robustness of SQL query generation. 6 | #### 📚Representative papers: 7 | + `Paper` [DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://proceedings.neurips.cc/paper_files/paper/2023/hash/72223cc66f63ca1aa59edaec1b3670e6-Abstract-Conference.html) 8 | + `Describe` This paper proposes a self-correction module that guides the model to correct SQL errors. This module is implemented in the zero-shot setting, where the model is only provided with the buggy SQL and asked to fix the errors. The study suggests two different prompts for different models: a general prompt for the CodeX model and a mild prompt for the GPT-4 model. 9 | + `Paper` [Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation](https://arxiv.org/abs/2306.08891) 10 | + `Describe` This paper adopts a multi-level matching approach that incrementally expands the matching scope across three levels (columns, tables, and databases) to sequentially match predicate values. The matched predicate values are then returned to the LLMs, helping it generate SQL queries consistent with the database content. 11 | + `Paper` [MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL](https://arxiv.org/abs/2312.11242) 12 | + `Describe` This paper designs a Refiner agent, whose primary function is to detect and correct SQL errors. After receiving an SQL query, the Refiner Agent will diagnose the SQL statement from three aspects: syntactic correctness, execution feasibility, and whether it retrieves non-empty results from the database. If the check fails, it will reason based on the original SQL and error feedback information or modification guidance signals to correct the erroneous SQL statement. The core function is to enable the model to perform self-diagnosis and self-correction, thereby enhancing the robustness and accuracy of the overall framework. 13 | --- 14 | ### Output Consistency: 15 | #### 🎓Basic concept: 16 | The purpose of output consistency strategies is to enhance the reliability of SQL queries generated by models, ensuring they consistently express the same meaning despite inherent randomness and uncertainty. Techniques like self-consistency sample multiple reasoning paths and use voting mechanisms to select the most consistent result, while cross-consistency involves multiple models generating SQL at low temperatures to maintain performance and diversify outputs. These methods improve the accuracy and reliability of SQL generation but can significantly increase inference cost and time. 17 | #### 📚Representative papers: 18 | + `Paper` [C3: Zero-shot Text-to-SQL with ChatGPT](https://arxiv.org/abs/2307.07306) 19 | + `Describe` This paper incorporates the Consistency Output (CO) component, which aims to maintain the consistency of the generated SQL queries by overcoming the inherent randomness and uncertainty in the outputs of large language models, thereby improving zero-shot \nlsql performance. Specifically, CO first samples multiple reasoning paths to generate different SQL answers. Then, these SQL queries are executed on the database, and the execution results are collected. After removing errors from all results, a voting mechanism is applied to these execution results to determine the most consistent SQL as the final SQL. This method enables models to leverage the collective knowledge derived from these multiple paths, resulting in more reliable outcomes in generating SQL queries. 20 | + `Paper` [PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency 21 | ](https://arxiv.org/abs/2403.09732) 22 | + `Describe` This paper proposes the cross-consistency strategy, which instructs multiple LLMs to generate SQL at lower temperatures and then votes on the execution results of the SQL. This cross-consistency strategy not only diversifies SQL queries but also maintains the performance of LLMs at low-temperature settings. 23 | --- 24 | ### Execution-Guided Strategies: 25 | #### 🎓Basic concept: 26 | Execution-guided strategies use the results of SQL query executions to refine and ensure the accuracy of generated queries. By incorporating execution feedback, models can iteratively correct errors and optimize SQL queries to retrieve valid data. However, this approach can increase the time required for SQL generation, especially when dealing with large databases. 27 | #### 📚Representative papers: 28 | + `Paper` [Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation](https://arxiv.org/abs/2306.08891) 29 | + `Describe` This paper continuously generates SQL queries through an executable check process after obtaining multiple candidate SQL sketches. It feeds back error messages to the LLMs to achieve an executable query. 30 | + `Paper` [CHESS: Contextual Harnessing for Efficient SQL Synthesis](https://arxiv.org/pdf/2405.16755) 31 | + `Describe` To reflect human behavior when writing complex sql queries, this paper returns not only the database schema, question, and candidate SQL queries but also the execution results of the SQL queries to LLMs. Specifically, CHESS starts with a draft query and refines it based on its execution results, making necessary adjustments to the SQL query in case of syntax errors. 32 | --- 33 | ### N-best Rankers Strategies: 34 | #### 🎓Basic concept: 35 | N-best rerankers strategies aim to reorder the top n generated SQL queries to enhance accuracy in cross-domain NL2SQL tasks. These methods often utilize larger models or additional knowledge sources to refine the rankings, thereby improving the semantic match between the user’s query and the generated SQL. By employing techniques like fine-tuning pre-trained language models and leveraging contrastive learning, these strategies address the variability and correctness of the generated queries, leading to more reliable outcomes. 36 | #### 📚Representative papers: 37 | + `Paper` [Bertrand-DR: Improving Text-to-SQL using a Discriminative Re-ranker](https://arxiv.org/abs/2002.00557) 38 | + `Describe` This paper fine-tunes a BERT model as a reranker on the Spider dataset, and this work has successfully improved multiple NL2SQL models. 39 | + `Paper` [G3R: A Graph-Guided Generate-and-Rerank Framework for Complex and Cross-domain Text-to-SQL Generation](https://aclanthology.org/2023.findings-acl.23/) 40 | + `Describe` This paper proposes a feature-enhanced reranker based on Pre-trained Language Model (PLM) to address the shortcomings of unstablility and highly dependenence on threshold settings. The SQL reranker leverages a PLM with hybrid prompt tuning to integrate into the PLM's knowledge, effectively bridging gaps between various domains without adding extra parameters. Contrastive learning is then used to push away the representation distance of candidate queries, making them more distinguishable. 41 | + `Paper` [N-Best Hypotheses Reranking for Text-to-SQL Systems](https://ieeexplore.ieee.org/abstract/document/10023434) 42 | + `Describe` This paper proposes two rerankers from the perspectives of consistency and correctness. To improve consistency, query plans generated by independent models can be used to explore N-best reranking. Then, to enhance correctness, they introduced a heuristic algorithm that applies schema linking on the N-best list to impose constraints missing in PICARD. The combined reranking method produces improvements on T5 models. 43 | + `Paper` [ReFSQL: A Retrieval-Augmentation Framework for Text-to-SQL Generation](https://aclanthology.org/2023.findings-emnlp.48/) 44 | + `Describe` This paper employs a ranking algorithm to retrieve the most closely related generated results from the retriever and generator module. 45 | --- 46 | 47 | 48 | -------------------------------------------------------------------------------- /chapter/Pre_Processing.md: -------------------------------------------------------------------------------- 1 | ## Pre-Processing 2 | Pre-processing serves as a enhancement to model’s inputs in the NL2SQL parsing process. Although not strictly necessary, pre-processing significantly contributes to the refinement of 3 | NL2SQL parsing. 4 | ### Schema linking: 5 | #### 🎓Basic concept: 6 | The purpose of the schema linking is to identify the tables and columns related to the given NL query. 7 | It ensures the accurate mapping and processing of key information within the limited input, thereby improving the performance of the NL2SQL task. 8 | In the LLMs era, schema linking has become increasingly crucial due to the input length limit of LLMs. 9 | #### 📚Representative papers: 10 | + `Paper` Data-anonymous encoding for text-to-sql generation. [](https://aclanthology.org/D19-1543/) 11 | + `Describe` This paper formulates schema linking as a sequential tagging problem and propose a two-stage anonymization model to learn the semantic relationship between schema and NL. 12 | + `Paper` Re-examining the role of schema linking in text-to-sql. [](https://aclanthology.org/2020.emnlp-main.564/) 13 | [](https://github.com/WING-NUS/slsql) 14 | + `Describe` This paper annotates the schema linking information for each instance in the training and development sets of Spider to support a data-driven and systematic study. 15 | + `Paper` RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. [](https://arxiv.org/abs/2302.05965) [](https://github.com/RUCKBReasoning/RESDSQL) 16 | + `Describe` This paper proposes a ranking-enhanced encoding framework for schema linking. An additional cross-encoder is trained to classify tables and columns based on the input query. This framework ranks and filters them according to classification probabilities, resulting in a ranked sequence of schema items. 17 | + `Paper` C3: Zero-shot Text-to-SQL with ChatGPT [](https://arxiv.org/abs/2307.07306) [](https://github.com/bigbigwatermalon/C3SQL) 18 | + `Describe` This paper designs different zero-shot prompts to instruct GPT-3.5 for table and column linking, employing the self-consistency method. For the table linking, the prompt guides the process in three steps: ranking tables by relevance, ensuring all relevant tables are included, and outputting in list format. For the column linking, another prompt guides the ranking of columns within candidate tables and outputting in dictionary format, prioritizing those matching question terms or foreign keys. 19 | --- 20 | ### DB content Retrival 21 | #### 🎓Basic concept: 22 | The purpose of database content retrieval is to efficiently retrieve cell values through textual searching algorithms and database indexing. Given the large scale of databases, retrieving cell values from them is resource-intensive. Additionally, addressing the requirements of the WHERE and JOIN clauses can significantly optimize NL2SQL performance. Therefore, it is crucial to implement appropriate strategies for the scenario requirement. 23 | #### 📚Representative papers: 24 | + `Paper` [Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing](https://arxiv.org/pdf/2012.12627) 25 | + `Describe` BRIDGE designs an anchor text matching to extract cell values mentioned in the NL automatically. It uses a heuristic method to calculate the maximum sequence match between the problem and the cell values to determine the matching boundary. When the cell values are substrings of words in the query, the heuristic can exclude those string matches. The matching threshold is then adjusted by making coarse accuracy measurements. 26 | + `Paper` [ValueNet: A Natural Language-to-SQL System that Learns from Database Information](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9458778&casa_token=UWDqhoU2Wb0AAAAA:QetXS1rDu1qXExZJa6cKotIE5YXzHG-YwWyRNuhdaqwaRnB-Wj_S8MuypI--RIcF9oHb5a7pz1IR8h0&tag=1) 27 | + `Describe` ValueNet implements three methods for generating candidate cell values based on n-grams method, string similarity and heuristic selection. 28 | + `Paper` [TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data](https://arxiv.org/pdf/2005.08314) 29 | + `Describe` TABERT utilizes a method called database content snapshots to encode the relevant subset of database content corresponding to the NL query. It uses an attention mechanism to manage information between cell value representations across different rows. 30 | + `Paper` [Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation](https://arxiv.org/pdf/1905.08205) 31 | + `Describe` IRNet employs the knowledge graph ConceptNet to recognize cell value links and search cell value candidates in the knowledge graph. When a result exactly or partially matches a cell value, the column is assigned a type of value exact match or partial match, respectively. 32 | + `Paper` [RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers](https://arxiv.org/pdf/1911.04942) 33 | + `Describe` RAT-SQL improves structural reasoning capabilities by modeling the relationship between cell values and the NL query. Specifically, it identifies the column-value relationship, meaning that the value in the question is part of the candidate cell value of the column. 34 | + `Paper` [CHESS: Contextual Harnessing for Efficient SQL Synthesis](https://arxiv.org/pdf/2405.16755) 35 | + `Describe` CHESS utilizes a Locality-sensitive Hashing algorithm for approximate nearest neighbor searches. It indexes unique cell values to quickly identify the top similar values related to the NL query. This approach significantly speeds up the process of computing the edit distance and semantic embedding between the NL query and cell values. 36 | + `Paper` [CodeS: Towards Building Open-source Language Models for Text-to-SQL](https://dl.acm.org/doi/abs/10.1145/3654930) 37 | + `Describe` CodeS introduces a coarse-to-fine cell value matching approach. It leverages indexes for a coarse-grained initial search, followed by a fine-grained matching process. First, it builds the index for all values using BM25. The index identifies candidate values relevant to NL. The Longest Common Substring algorithm is then used to calculate the matching degree between NL and the candidate values to find the most relevant cell values. 38 | --- 39 | ### Additional Information Acquisition 40 | #### 🎓Basic concept: 41 | Additional information (e.g. *domain knowledge*) plays an essential role in improving the comprehension capabilities of NL2SQL models for understanding the NL query, performing the schema linking, and benefiting the NL2SQL translation. This information can provide demonstration examples, domain knowledge, formulaic evidence, and format information for the NL2SQL backbone model or specific modules, thereby enhancing the quality of the generated results. 42 | #### 📚Representative papers: 43 | + `Paper` [DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://arxiv.org/pdf/2304.11015v3.pdf) 44 | + `Describe` DIN-SQL inserts additional information through few-shot learning across multiple stages of the workflow, such as schema linking, query classification, task decomposition, and self-correction. These stages allow DIN-SQL to effectively tackle various challenges, including the complexity of schema links, identification of multiple table joins, and handling of nested queries. 45 | + `Paper` [CodeS: Towards Building Open-source Language Models for Text-to-SQL](https://dl.acm.org/doi/abs/10.1145/3654930) 46 | + `Describe` CodeS utilizes metadata examples of cross-domain databases as the main additional information, including data types and annotation text, which help the model resolve potential ambiguity issues and understand entity relationships. This extracted information is transformed into coherent text and concatenated with the question query to form the final input context. 47 | + `Paper` [PET-SQL: A Prompt-enhanced Two-stage Text-to-SQL Framework with Cross-consistency](https://arxiv.org/pdf/2403.09732) 48 | + `Describe` PET-SQL constructs a pool of examples from the training set, which contains question frames and question-SQL pairs. Then, it selects the $k$ examples that are most similar to the target question. These selected examples are combined with customized prompts as the final input. 49 | + `Paper` [Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation](https://arxiv.org/pdf/2308.15363v4.pdf) 50 | + `Describe` DAIL-SQL intricately designs a two-stage representation algorithm for additional information. It begins by presenting the question and database as SQL statement hints, thereby providing comprehensive database information. Following this, it employs a masking mechanism and similarity calculation to select appropriate examples and systematically organizes tags to enhance the efficiency of the algorithm. 51 | + `Paper` [The Dawn of Natural Language to SQL: Are We Fully Ready?](https://arxiv.org/pdf/2308.15363v4.pdf) 52 | + `Describe` SuperSQL extends the representation algorithm of the DAIL-SQL by integrating similarity-based sample selection with schema linking and database content information, which filters out irrelevant schemas, thereby enhancing the quality of SQL generation. 53 | + `Paper` [Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge](https://arxiv.org/pdf/2301.01067) 54 | + `Describe` REGROUP constructs a formulaic knowledge base encompassing various domains, such as finance, real estate, and transportation. It leverages a Dense Passage Retriever (DPR) to compute similarity scores for the retrieval results from the formulaic knowledge base. Subsequently, an Erasing-Then-Awakening (ETA) model is used to integrate the entities in these formulaic knowledge items with the entities in NL and schema. This model filters irrelevant entities below a confidence threshold and maps the remainder to schema elements, thereby grounding knowledge for accurate SQL query generation. 55 | + `Paper` [Reboost Large Language Model-based Text-to-SQL, Text-to-Python, and Text-to-Function - with Real Applications in Traffic Domain](https://arxiv.org/pdf/2310.18752) 56 | + `Describe` ReBoost engages with the LLMs model using the Explain-Squeeze Schema Linking mechanism. This mechanism is a two-phase strategy. Initially, it presents a generalized schema to the LLMs to establish a foundational understanding. Subsequently, it employs targeted prompting to elicit detailed associations between query phrases and specific database entities, thereby enhancing accuracy in mapping queries to database structures.without incurring excessive token cost. 57 | + `Paper` [Selective Demonstrations for Cross-domain Text-to-SQL](https://arxiv.org/pdf/2310.06302) 58 | + `Describe` ODIS proposes SimSQL method to retrieve additional knowledge from cross-domain databases. This method utilizes the BM25 algorithm to measure the resemblance in SQL keywords and schema tokens. The top examples from each database are selected as the demonstrations that closely align with target SQL. 59 | --- 60 | -------------------------------------------------------------------------------- /chapter/Translation_method.md: -------------------------------------------------------------------------------- 1 | ## NL2SQL Translation Methods 2 | 3 | ### Encoding Strategy 4 | #### 🎓Basic concept: 5 | Encoding in the NL2SQL task refers to the process of transforming NL and database schema into a structured format that can be effectively utilized by a language model. This transformation is crucial as it converts unstructured and semi-structured data into a form that can be processed for generating SQL queries. The encoding process involves capturing the semantic meaning of the NL input and the structural information of the database schema, enabling the model to understand and map the user’s intent to the corresponding SQL query. There are three primary encoding strategies in NL2SQL models, each with its unique approach to transforming NL and database schemas: 1) sequential encoding, 2) graph-based encoding, and 3) separate encoding. 6 | 7 |

8 | 9 |

10 | 11 | #### 📚Representative papers: 12 | + `Paper` [RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL](https://ojs.aaai.org/index.php/AAAI/article/view/26535) 13 | + `Describe` RESDSQL uses a ranking-enhanced encoder to sort and filter schema items, thereby reducing the complexity of schema linking during encoding. This method ensures that the most relevant schema items are prioritized, improving the overall efficiency of the encoding process. 14 | + `Paper` [*CatSQL*: Towards Real World Natural Language to SQL Applications](https://dl.acm.org/doi/abs/10.14778/3583140.3583165) 15 | + `Describe` CatSQL utilizes the pre-trained GraPPa encoding network to concatenate the NL, database schema, and additional information into a single input sequence, generating hidden state sequences. This approach integrates multiple sources of information, enhancing the model’s ability to capture complex relationships. 16 | + `Paper` [RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers](https://arxiv.org/abs/1911.04942) 17 | + `Describe` RAT-SQL introduces a relation-aware self-attention mechanism, allowing the model to explicitly consider and utilize predefined relational information when jointly encoding the question and the database schema. These relationships are represented as a graph structure, and through this graph-based encoding, RAT-SQL can more effectively capture the structural information in the schema and its alignment with the NL query. 18 | + `Paper` [Towards Generalizable and Robust Text-to-SQL Parsing](https://arxiv.org/abs/2210.12674) 19 | + `Describe` Based on the pre-trained T5 model, TKK employs task decomposition and multi-task learning strategies in encoding by breaking down the complex NL2SQL task into multiple subtasks and progressively acquiring and combining knowledge. 20 | --- 21 | ### Decoding Strategy 22 | #### 🎓Basic concept: 23 | Decoding plays a crucial role in NL2SQL translation, as it is responsible for converting the representations generated by the encoder into the target SQL queries. The choice of decoding strategy directly affects the quality and performance of the generated SQL queries. An excellent decoding strategy not only produces syntactically correct SQL queries but also ensures that the semantics of the SQL queries align with the NL and can even optimize the execution efficiency of the queries. We will introduce several key decoding strategies employed by existing NL2SQL models, namely: 1) greedy search-based decoding strategy, 2) beam search-based decoding strategy, and 3) constraint-aware incremental decoding strategy. 24 | 25 |

26 | 27 |

28 | 29 | #### 📚Representative papers: 30 | + `Paper` [RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers](https://arxiv.org/abs/1911.04942) 31 | + `Describe` RAT-SQL combines relation-aware graph structure encoding and generation techniques. During the decoding process, RAT-SQL uses beam search to generate multiple candidate SQL queries, which are then reranked, and the optimal query is selected based on graph structure information. 32 | + `Paper` [Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions](https://arxiv.org/abs/1909.00786) 33 | + `Describe` EditSQL employs a context encoding strategy, incorporating dialogue history information into the model. During the decoding process, it uses the beam search-based decoding strategy to generate candidate SQL queries and utilizes dialogue context information to select and optimize the queries. 34 | + `Paper` [PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models](https://arxiv.org/abs/2109.05093) 35 | + `Describe` The constraint-aware incremental decoding strategy, introduced by PICARD (Parsing Incrementally for Constrained Auto-Regressive Decoding), is specifically designed for NL2SQL tasks. This strategy aims to ensure the generation of syntactically correct SQL queries by incorporating constraints during the decoding process. 36 | + `Paper` [Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing](https://arxiv.org/abs/2012.12627) 37 | + `Describe` BRIDGE introduces some simple heuristic rules to prune the search space of the sequence decoder, proposing Schema-Consistency Guided Decoding to ensure that the generated SQL queries are consistent with the database schema. This strategy continuously checks whether the generated SQL queries match the database schema during the decoding process and adjusts the decoding path based on the matching results. 38 | --- 39 | ### Task-specific Prompt Strategy 40 | #### 🎓Basic concept: 41 | In the era of LLMs, prompt engineering can harness the capabilities of LLMs and has been widely adopted in natural language processing, with various frameworks developed for specific tasks. In the NL2SQL field, task-specific prompt strategy refers to the tailored prompt engineering techniques used in the NL2SQL translation process. These strategies instruct the LLMs to optimize the SQL query generation process according to task-specific rules, improving the accuracy of translating complex semantic NL query into the corresponding SQL query. 42 | #### 📚Representative papers: 43 | + `Paper` [CHESS: Contextual Harnessing for Efficient SQL Synthesis](https://arxiv.org/pdf/2405.16755) 44 | + `Describe` CHESS transforms NL into SQL statements using a streamlined pipeline that relies on LLMs and CoT. This process comprises entity and context retrieval, schema selection, SQL generation, and revision. 45 | + `Paper` [DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models](https://arxiv.org/pdf/2402.01117) 46 | + `Describe` DTS-SQL splits the work task into two subtasks, schema linking, and generation, to close the performance gap between open-source LLMs and closed-source LLMs. 47 | + `Paper` [Towards Generalizable and Robust Text-to-SQL Parsing](https://arxiv.org/pdf/2210.12674) 48 | + `Describe` The TKK framework divides the initial NL2SQL parsing tasks into various small individual subtasks, with each corresponding to the mapping of the NL query to one or more clauses of the SQL query. 49 | + `Paper` [MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL](https://arxiv.org/abs/2312.11242) 50 | + `Describe` MAC-SQL incorporates a Decomposer agent designed to break down the user's original problem into several subproblems. This decomposition process aims to lessen the complexity of the origin question, enabling the generation of simpler SQL queries to solve each individual subproblem. 51 | + `Paper` [DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://arxiv.org/pdf/2304.11015v3.pdf) 52 | + `Describe` DIN-SQL employs a sophisticated categorization module for decomposition. It classify queries into distinct complexity groups: EASY, NON-NESTED, and NESTED, with the reference of NL and database schema.This module is fundamental for the subsequent decomposition process, which meticulously dissects complex queries into simpler sub-problems. By strategically identifying and separating schema linking, join conditions, and nested structures, the module facilitates a structured generation of SQL queries and amplifies the accuracy of translating complex the NL query into executable SQL. 53 | --- 54 | ### Intermediate Representation for NL2SQL Translation 55 | #### 🎓Basic concept: 56 | As mentioned before, the NL2SQL task is challenging due to the complexity and ambiguity of NL queries, as well as the formal and structured nature of SQL. Thus, researchers try to simplify this process by designing a *grammar-free* intermediate representation compared to SQL as the bridge between the ''free-form'' NL query and the ''constrained and formal'' SQL query.Roughly speaking, an intermediate representation (IR) is a structured yet flexible grammar that captures the essential components and relationships of an NL query without the strict syntax rules of SQL. 57 | 58 |

59 | 60 |

61 | 62 | #### 📚Representative papers: 63 | + `Paper` [Schema-free SQL](https://dl.acm.org/doi/pdf/10.1145/2588555.2588571) 64 | + `Describe` In the research of Schema-free SQL, the original question can be transformed into an intermediate representation even in the absence of user knowledge about schema information. 65 | + `Paper` [SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task](https://arxiv.org/pdf/1810.05237) 66 | + `Describe` SyntaxSQLNet removes portions of the FROM and JOIN clauses in the syntax language. 67 | + `Paper` [SemQL: a semantic query language for multidatabase systems](https://dl.acm.org/doi/pdf/10.1145/319950.320011) 68 | + `Describe` SemQL removes the FROM, JOIN, ON and GROUP BY clauses and combines WHERE and HAVING conditions. 69 | + `Paper` [Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions](https://arxiv.org/pdf/1909.00786) 70 | + `Describe` EditSQL adds WHERE and HAVING conditions but retains the GROUP BY clause. 71 | + `Paper` [Natural SQL: Making SQL Easier to Infer from Natural Language Specifications](https://arxiv.org/pdf/2109.05153.pdf) 72 | + `Describe` Natural SQL (NatSQL) is a widely recognized SQL-like syntax language that eliminates SQL statement operators, keywords, set operators, and other elements seldom found in user problem descriptions. It enhances schema linking by minimizing the necessary number of schema items. 73 | + `Paper` [Semantic Decomposition of Question and SQL for Text-to-SQL Parsing](https://arxiv.org/pdf/2310.13575) 74 | + `Describe` The Query Plan Language (QPL) leverages the problem decomposition strategy to improve the parsing of intricate SQL queries. By breaking down a SQL query into modularized sub-queries, the complexity of the original query is reduced. This approach mitigates parsing difficulties associated with complex problems and cross-domain complex queries. 75 | + `Paper` [Weakly Supervised Text-to-SQL Parsing through Question Decomposition](https://arxiv.org/pdf/2112.06311) 76 | + `Describe` Question Decomposition Meaning Representation (QDMR) decomposes the original question into a number of atomic questions. Each atomic question serves as an intermediate representation of the original question and can be translated into a set of small-scale formal operations involving tasks such as selecting entities, retrieving attributes, or aggregating information. 77 | + `Paper` [Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning](https://dl.acm.org/doi/pdf/10.1145/3589292) 78 | + `Describe` The SC-prompt utilizes a two-stage divide and conquer method for NL2SQL parsing. During the initial phase, it instructs PLM to generate specific SQL structures, such as query commands and operators, while also supplying placeholders for any missing identifiers. In the subsequent phase, it directs the PLM to generate SQL structures containing actual values to fill the previously provided placeholders. 79 | + `Paper` [CatSQL: Towards Real World Natural Language to SQL Applications](https://dl.acm.org/doi/pdf/10.14778/3583140.3583165) 80 | + `Describe` CatSQL constructs a template sketch with slots serving as initial placeholders. Different from the former, this sketch is much more general. Its base model can focus on the parsing of user queries to fill these placeholders, consequently decreasing the computational resource cost. Furthermore, it implements a novel semantic correction algorithm to assess the semantic accuracy of the resulting SQL queries and rectify any semantic issues detected in the generated queries. 81 | + `Paper` [Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation](https://arxiv.org/pdf/2306.08891) 82 | + `Describe` ZeroNL2SQL integrates the schema alignment capabilities of PLM with the complex reasoning capabilities of LLMs. Initially, it utilizes PLM to produce SQL sketches for achieving schema alignment and subsequently employs LLMs to execute complex content reasoning for populating missing information. Additionally, it also proposes a predicate calibration method for guiding the design of language models for SQL sketches based on database instances and selecting the optimal SQL query. 83 | + `Paper` [Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation](https://arxiv.org/pdf/2405.15307) 84 | + `Describe` TA-SQL combines pandas code and symbolic representation to generate an abstract sketch of SQL and uses this sketch to align with schema information in subsequent modules to generate complete SQL. 85 | + `Paper` [RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL](https://arxiv.org/pdf/2302.05965v3.pdf) 86 | + `Describe` RESDSQL introduces a rank-enhanced encoding and skeleton-aware decoding framework, which separates schema linking from skeleton parsing. During the decoding generation phase, its decoder initially produces the SQL skeleton and then generates the actual SQL query. This approach implicitly constrains the SQL parsing and governs the quality of generation. When combined with NatSQL, RESDSQL demonstrates the ability to further enhance the quality of SQL query generation. 87 | --- 88 | -------------------------------------------------------------------------------- /report/ATIS/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 5280, 3 | "Number of Queries": 947, 4 | "Number of Questions / Number of Queries": 5.6, 5 | "Total Databases": 1, 6 | "Total Tables": 25, 7 | "Average Tables per Database": 25.0, 8 | "Average Columns per Table": 5.24, 9 | "Average Records per Database": 162243.0, 10 | "Average Tables per Query": 8.39, 11 | "Average Selects per Query": 1.79, 12 | "Average Aggs per Query": 0.22, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/Academic/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 196, 3 | "Number of Queries": 185, 4 | "Number of Questions / Number of Queries": 1.1, 5 | "Total Databases": 1, 6 | "Total Tables": 17, 7 | "Average Tables per Database": 17.0, 8 | "Average Columns per Table": 3.12, 9 | "Average Records per Database": 58249674.0, 10 | "Average Tables per Query": 3.48, 11 | "Average Selects per Query": 1.04, 12 | "Average Aggs per Query": 0.54, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/Advising/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 4387, 3 | "Number of Queries": 205, 4 | "Number of Questions / Number of Queries": 21.4, 5 | "Total Databases": 1, 6 | "Total Tables": 15, 7 | "Average Tables per Database": 15.0, 8 | "Average Columns per Table": 7.4, 9 | "Average Records per Database": 332596.0, 10 | "Average Tables per Query": 3.41, 11 | "Average Selects per Query": 1.21, 12 | "Average Aggs per Query": 0.4, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.11 15 | } -------------------------------------------------------------------------------- /report/AmbiQT/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 23295, 3 | "Number of Queries": 25550, 4 | "Number of Questions / Number of Queries": 0.9, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.92, 11 | "Average Selects per Query": 1.16, 12 | "Average Aggs per Query": 0.47, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.01 15 | } -------------------------------------------------------------------------------- /report/Archer/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 518, 3 | "Number of Queries": 260, 4 | "Number of Questions / Number of Queries": 2.0, 5 | "Total Databases": 10, 6 | "Total Tables": 68, 7 | "Average Tables per Database": 6.8, 8 | "Average Columns per Table": 6.81, 9 | "Average Records per Database": 31365.3, 10 | "Average Tables per Query": 3.89, 11 | "Average Selects per Query": 3.07, 12 | "Average Aggs per Query": 1.77, 13 | "Average Scalar Functions per Query": 0.1, 14 | "Average Math Computations per Query": 3.55 15 | } -------------------------------------------------------------------------------- /report/BIRD/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 10962, 3 | "Number of Queries": 10840, 4 | "Number of Questions / Number of Queries": 1.0, 5 | "Total Databases": 80, 6 | "Total Tables": 611, 7 | "Average Tables per Database": 7.64, 8 | "Average Columns per Table": 7.14, 9 | "Average Records per Database": 4585335.21, 10 | "Average Tables per Query": 2.07, 11 | "Average Selects per Query": 1.09, 12 | "Average Aggs per Query": 0.61, 13 | "Average Scalar Functions per Query": 0.2, 14 | "Average Math Computations per Query": 0.27 15 | } -------------------------------------------------------------------------------- /report/BULL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 7932, 3 | "Number of Queries": 5864, 4 | "Number of Questions / Number of Queries": 1.4, 5 | "Total Databases": 3, 6 | "Total Tables": 78, 7 | "Average Tables per Database": 26.0, 8 | "Average Columns per Table": 14.96, 9 | "Average Records per Database": 85631.0, 10 | "Average Tables per Query": 1.22, 11 | "Average Selects per Query": 1.0, 12 | "Average Aggs per Query": 0.18, 13 | "Average Scalar Functions per Query": 0.42, 14 | "Average Math Computations per Query": 0.05 15 | } -------------------------------------------------------------------------------- /report/BookSQL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 78433, 3 | "Number of Queries": 39530, 4 | "Number of Questions / Number of Queries": 2.0, 5 | "Total Databases": 1, 6 | "Total Tables": 7, 7 | "Average Tables per Database": 7.0, 8 | "Average Columns per Table": 8.86, 9 | "Average Records per Database": 1012948.0, 10 | "Average Tables per Query": 1.25, 11 | "Average Selects per Query": 1.12, 12 | "Average Aggs per Query": 0.78, 13 | "Average Scalar Functions per Query": 0.39, 14 | "Average Math Computations per Query": 0.22 15 | } -------------------------------------------------------------------------------- /report/CHASE/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 15408, 3 | "Number of Queries": 13900, 4 | "Number of Questions / Number of Queries": 1.1, 5 | "Total Databases": 350, 6 | "Total Tables": 1609, 7 | "Average Tables per Database": 4.6, 8 | "Average Columns per Table": 5.19, 9 | "Average Records per Database": 4594.33, 10 | "Average Tables per Query": 1.81, 11 | "Average Selects per Query": 1.16, 12 | "Average Aggs per Query": 0.31, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/CSpider/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 11840, 3 | "Number of Queries": 6408, 4 | "Number of Questions / Number of Queries": 1.8, 5 | "Total Databases": 206, 6 | "Total Tables": 1056, 7 | "Average Tables per Database": 5.13, 8 | "Average Columns per Table": 5.01, 9 | "Average Records per Database": 8980.19, 10 | "Average Tables per Query": 1.83, 11 | "Average Selects per Query": 1.17, 12 | "Average Aggs per Query": 0.54, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/CoSpider/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 8350, 3 | "Number of Queries": 8007, 4 | "Number of Questions / Number of Queries": 1.0, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.54, 11 | "Average Selects per Query": 1.11, 12 | "Average Aggs per Query": 0.42, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/DrSpider/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 15269, 3 | "Number of Queries": 3847, 4 | "Number of Questions / Number of Queries": 4.0, 5 | "Total Databases": 549, 6 | "Total Tables": 2197, 7 | "Average Tables per Database": 4.0, 8 | "Average Columns per Table": 5.54, 9 | "Average Records per Database": 28460.35, 10 | "Average Tables per Query": 1.81, 11 | "Average Selects per Query": 1.19, 12 | "Average Aggs per Query": 0.52, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/DuSQL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 25003, 3 | "Number of Queries": 20308, 4 | "Number of Questions / Number of Queries": 1.2, 5 | "Total Databases": 208, 6 | "Total Tables": 840, 7 | "Average Tables per Database": 4.038461538461538, 8 | "Average Columns per Table": 5.294047619047619, 9 | "Average Records per Database": 20.192307692307693, 10 | "Average Tables per Query": 1.49, 11 | "Average Selects per Query": 1.25, 12 | "Average Aggs per Query": 0.73, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.3 15 | } -------------------------------------------------------------------------------- /report/FIBEN/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 300, 3 | "Number of Queries": 233, 4 | "Number of Questions / Number of Queries": 1.3, 5 | "Total Databases": 1, 6 | "Total Tables": 152, 7 | "Average Tables per Database": 152.0, 8 | "Average Columns per Table": 1.0, 9 | "Average Records per Database": 11668125.0, 10 | "Average Tables per Query": 5.59, 11 | "Average Selects per Query": 1.56, 12 | "Average Aggs per Query": 0.97, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.04 15 | } -------------------------------------------------------------------------------- /report/GeoQuery/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 877, 3 | "Number of Queries": 246, 4 | "Number of Questions / Number of Queries": 3.6, 5 | "Total Databases": 1, 6 | "Total Tables": 7, 7 | "Average Tables per Database": 7.0, 8 | "Average Columns per Table": 4.14, 9 | "Average Records per Database": 937.0, 10 | "Average Tables per Query": 2.22, 11 | "Average Selects per Query": 2.19, 12 | "Average Aggs per Query": 0.92, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.01 15 | } -------------------------------------------------------------------------------- /report/IMDB/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 131, 3 | "Number of Queries": 89, 4 | "Number of Questions / Number of Queries": 1.5, 5 | "Total Databases": 1, 6 | "Total Tables": 17, 7 | "Average Tables per Database": 17.0, 8 | "Average Columns per Table": 3.94, 9 | "Average Records per Database": 40147386.0, 10 | "Average Tables per Query": 2.91, 11 | "Average Selects per Query": 1.01, 12 | "Average Aggs per Query": 0.3, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/KaggleDBQA/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 272, 3 | "Number of Queries": 249, 4 | "Number of Questions / Number of Queries": 1.1, 5 | "Total Databases": 8, 6 | "Total Tables": 17, 7 | "Average Tables per Database": 2.12, 8 | "Average Columns per Table": 10.53, 9 | "Average Records per Database": 595075.12, 10 | "Average Tables per Query": 1.25, 11 | "Average Selects per Query": 1.05, 12 | "Average Aggs per Query": 0.69, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.04 15 | } -------------------------------------------------------------------------------- /report/MIMICSQL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 20000, 3 | "Number of Queries": 10000, 4 | "Number of Questions / Number of Queries": 2.0, 5 | "Total Databases": -1, 6 | "Total Tables": -1, 7 | "Average Tables per Database": 1.0, 8 | "Average Columns per Table": -0.0, 9 | "Average Records per Database": -0.0, 10 | "Average Tables per Query": 1.74, 11 | "Average Selects per Query": 1.0, 12 | "Average Aggs per Query": 0.84, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/MTTEQL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 489076, 3 | "Number of Queries": 4525, 4 | "Number of Questions / Number of Queries": 108.1, 5 | "Total Databases": 489076, 6 | "Total Tables": 3279004, 7 | "Average Tables per Database": 6.704487646091814, 8 | "Average Columns per Table": 5.506181755191515, 9 | "Average Records per Database": 0.0, 10 | "Average Tables per Query": 1.69, 11 | "Average Selects per Query": 1.15, 12 | "Average Aggs per Query": 0.53, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/PAUQ/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 9876, 3 | "Number of Queries": 5497, 4 | "Number of Questions / Number of Queries": 1.8, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9692.61, 10 | "Average Tables per Query": 1.82, 11 | "Average Selects per Query": 1.17, 12 | "Average Aggs per Query": 0.53, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/PortugueseSpider/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 9693, 3 | "Number of Queries": 5275, 4 | "Number of Questions / Number of Queries": 1.8, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.85, 11 | "Average Selects per Query": 1.17, 12 | "Average Aggs per Query": 0.54, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/Restaurants/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 378, 3 | "Number of Queries": 23, 4 | "Number of Questions / Number of Queries": 16.4, 5 | "Total Databases": 1, 6 | "Total Tables": 3, 7 | "Average Tables per Database": 3.0, 8 | "Average Columns per Table": 4.0, 9 | "Average Records per Database": 19295.0, 10 | "Average Tables per Query": 2.43, 11 | "Average Selects per Query": 1.17, 12 | "Average Aggs per Query": 0.35, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/SEDE/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 12023, 3 | "Number of Queries": 11421, 4 | "Number of Questions / Number of Queries": 1.1, 5 | "Total Databases": 1, 6 | "Total Tables": 29, 7 | "Average Tables per Database": 29.0, 8 | "Average Columns per Table": 7.275862068965517, 9 | "Average Records per Database": 0.0, 10 | "Average Tables per Query": 1.9, 11 | "Average Selects per Query": 1.29, 12 | "Average Aggs per Query": 0.94, 13 | "Average Scalar Functions per Query": 0.49, 14 | "Average Math Computations per Query": 0.49 15 | } -------------------------------------------------------------------------------- /report/SParC/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 10228, 3 | "Number of Queries": 8981, 4 | "Number of Questions / Number of Queries": 1.1, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.58, 11 | "Average Selects per Query": 1.1, 12 | "Average Aggs per Query": 0.44, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/SQUALL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 11276, 3 | "Number of Queries": 8296, 4 | "Number of Questions / Number of Queries": 1.4, 5 | "Total Databases": 2108, 6 | "Total Tables": 4028, 7 | "Average Tables per Database": 1.91, 8 | "Average Columns per Table": 9.18, 9 | "Average Records per Database": 70.81, 10 | "Average Tables per Query": 1.22, 11 | "Average Selects per Query": 1.29, 12 | "Average Aggs per Query": 0.4, 13 | "Average Scalar Functions per Query": 0.03, 14 | "Average Math Computations per Query": 0.16 15 | } -------------------------------------------------------------------------------- /report/Scholar/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 817, 3 | "Number of Queries": 193, 4 | "Number of Questions / Number of Queries": 4.2, 5 | "Total Databases": 1, 6 | "Total Tables": 10, 7 | "Average Tables per Database": 10.0, 8 | "Average Columns per Table": 2.5, 9 | "Average Records per Database": 147416275.0, 10 | "Average Tables per Query": 3.38, 11 | "Average Selects per Query": 1.02, 12 | "Average Aggs per Query": 0.68, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.02 15 | } -------------------------------------------------------------------------------- /report/ScienceBenchmark/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 5031, 3 | "Number of Queries": 3654, 4 | "Number of Questions / Number of Queries": 1.4, 5 | "Total Databases": -1, 6 | "Total Tables": -1, 7 | "Average Tables per Database": 1.0, 8 | "Average Columns per Table": -0.0, 9 | "Average Records per Database": -0.0, 10 | "Average Tables per Query": 1.45, 11 | "Average Selects per Query": 1.0, 12 | "Average Aggs per Query": 0.24, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.07 15 | } -------------------------------------------------------------------------------- /report/Spider/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 11840, 3 | "Number of Queries": 6448, 4 | "Number of Questions / Number of Queries": 1.8, 5 | "Total Databases": 206, 6 | "Total Tables": 1056, 7 | "Average Tables per Database": 5.13, 8 | "Average Columns per Table": 5.01, 9 | "Average Records per Database": 8980.19, 10 | "Average Tables per Query": 1.83, 11 | "Average Selects per Query": 1.17, 12 | "Average Aggs per Query": 0.54, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/SpiderDK/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 535, 3 | "Number of Queries": 283, 4 | "Number of Questions / Number of Queries": 1.9, 5 | "Total Databases": 169, 6 | "Total Tables": 887, 7 | "Average Tables per Database": 5.25, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9493.83, 10 | "Average Tables per Query": 1.71, 11 | "Average Selects per Query": 1.16, 12 | "Average Aggs per Query": 0.54, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/SpiderRealistic/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 508, 3 | "Number of Queries": 290, 4 | "Number of Questions / Number of Queries": 1.8, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.79, 11 | "Average Selects per Query": 1.21, 12 | "Average Aggs per Query": 0.5, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/SpiderSyn/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 1034, 3 | "Number of Queries": 550, 4 | "Number of Questions / Number of Queries": 1.9, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.68, 11 | "Average Selects per Query": 1.17, 12 | "Average Aggs per Query": 0.59, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/ViText2SQL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 9693, 3 | "Number of Queries": 5223, 4 | "Number of Questions / Number of Queries": 1.9, 5 | "Total Databases": 166, 6 | "Total Tables": 876, 7 | "Average Tables per Database": 5.28, 8 | "Average Columns per Table": 5.14, 9 | "Average Records per Database": 9664.73, 10 | "Average Tables per Query": 1.17, 11 | "Average Selects per Query": 1.12, 12 | "Average Aggs per Query": 0.54, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/WikiSQL/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 80654, 3 | "Number of Queries": 80257, 4 | "Number of Questions / Number of Queries": 1.0, 5 | "Total Databases": 26531, 6 | "Total Tables": 26531, 7 | "Average Tables per Database": 1.0, 8 | "Average Columns per Table": 6.34, 9 | "Average Records per Database": 17.29, 10 | "Average Tables per Query": 1.0, 11 | "Average Selects per Query": 1.0, 12 | "Average Aggs per Query": 0.28, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /report/Yelp/report.json: -------------------------------------------------------------------------------- 1 | { 2 | "Number of Questions": 128, 3 | "Number of Queries": 110, 4 | "Number of Questions / Number of Queries": 1.2, 5 | "Total Databases": 1, 6 | "Total Tables": 8, 7 | "Average Tables per Database": 8.0, 8 | "Average Columns per Table": 5.0, 9 | "Average Records per Database": 4823945.0, 10 | "Average Tables per Query": 2.41, 11 | "Average Selects per Query": 1.0, 12 | "Average Aggs per Query": 0.45, 13 | "Average Scalar Functions per Query": 0.0, 14 | "Average Math Computations per Query": 0.0 15 | } -------------------------------------------------------------------------------- /slides/NL2SQL_handbook.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/slides/NL2SQL_handbook.pdf -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/dataset.cpython-310.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/dataset.cpython-310.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/dataset.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/dataset.cpython-39.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/sql_parser.cpython-310.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/sql_parser.cpython-310.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/sql_parser.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/sql_parser.cpython-311.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/sql_parser.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/sql_parser.cpython-39.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/utils.cpython-310.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/utils.cpython-310.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/utils.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/utils.cpython-311.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/__pycache__/utils.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HKUSTDial/NL2SQL_Handbook/a9b8acbf1fa98d5a31b3f50e18d6b2ec988b6943/src/dataset_analyze/__pycache__/utils.cpython-39.pyc -------------------------------------------------------------------------------- /src/dataset_analyze/analyze.py: -------------------------------------------------------------------------------- 1 | from utils import * 2 | from dataset import * 3 | import pandas as pd 4 | import os 5 | 6 | 7 | ALL_DATASETS = [ 8 | # ATIS(), 9 | # GeoQuery(), 10 | # Restaurants(), 11 | # Academic(), 12 | # IMDB(), 13 | # Yelp(), 14 | # Scholar(), 15 | # WikiSQL(), 16 | # Advising(), 17 | # Spider(), 18 | # BIRD(), 19 | # CSpider(), 20 | # SParC(), 21 | # CoSpider(), 22 | # SpiderSyn(), 23 | # SpiderRealistic(), 24 | # SpiderDK(), 25 | # DrSpider(), 26 | # SQUALL(), 27 | # FIBEN(), 28 | # KaggleDBQA(), 29 | # SEDE(), 30 | # MTTEQL(), 31 | # AmbiQT(), 32 | # ScienceBenchmark(), 33 | # BULL(), 34 | # BookSQL(), 35 | # PAUQ(), 36 | # CHASE(), 37 | # DuSQL(), 38 | # ViText2SQL(), 39 | # MIMICSQL(), 40 | # PortugueseSpider(), 41 | Archer() 42 | ] 43 | 44 | def report_dataset(dataset: Dataset): 45 | if dataset.get_all_db_paths(): 46 | report_database_complexity = generate_report_database_complexity(dataset.get_all_db_paths(), is_wikisql=isinstance(dataset, WikiSQL)) 47 | else: 48 | report_database_complexity = { 49 | "Total Databases": dataset._total_databases, 50 | "Total Tables": dataset._total_tables, 51 | "Average Tables per Database": dataset._avg_tables_per_db, 52 | "Average Columns per Table": dataset._avg_columns_per_table, 53 | "Average Records per Database": dataset._avg_records_per_db 54 | } 55 | if isinstance(dataset, WikiSQL): 56 | _queries = [query.split("WHERE")[0].strip() for query in dataset.get_all_queries()] 57 | report_query_complexity = generate_report_query_complexity(_queries) 58 | else: 59 | report_query_complexity = generate_report_query_complexity(dataset.get_all_queries()) 60 | num_questions = len(dataset.get_all_questions()) 61 | num_quries = len(dataset.get_all_queries()) 62 | dataset_report = { 63 | "Number of Questions": num_questions, 64 | "Number of Queries": num_quries, 65 | "Number of Questions / Number of Queries": round(num_questions / num_quries, 1) 66 | } 67 | dataset_report.update(report_database_complexity) 68 | dataset_report.update(report_query_complexity) 69 | return dataset_report 70 | 71 | 72 | if __name__ == "__main__": 73 | for dataset in ALL_DATASETS: 74 | print(f"analyze [{dataset.__class__.__name__}] dataset...") 75 | report = report_dataset(dataset) 76 | dir = os.path.join("report", dataset.__class__.__name__) 77 | os.makedirs(dir, exist_ok=True) 78 | json.dump(report, open(os.path.join(dir, "report.json"), "w", encoding="utf-8"), indent=4) -------------------------------------------------------------------------------- /src/dataset_analyze/dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import pandas as pd 4 | 5 | 6 | class Dataset: 7 | 8 | def get_all_questions(self): 9 | raise NotImplementedError() 10 | 11 | def get_all_queries(self): 12 | raise NotImplementedError() 13 | 14 | def get_all_db_paths(self): 15 | raise NotImplementedError() 16 | 17 | 18 | class ATIS(Dataset): 19 | 20 | ROOT_PATH = "data/atis" 21 | 22 | def get_all_questions(self): 23 | with open(os.path.join(self.ROOT_PATH, "atis.json"), "r", encoding="utf-8") as f: 24 | data = json.load(f) 25 | all_questions = [] 26 | for item in data: 27 | for sentence in item["sentences"]: 28 | text = sentence["text"] 29 | for key, value in sentence["variables"].items(): 30 | text = text.replace(key, value) 31 | all_questions.append(text) 32 | return all_questions 33 | 34 | def get_all_queries(self): 35 | with open(os.path.join(self.ROOT_PATH, "atis.json"), "r", encoding="utf-8") as f: 36 | data = json.load(f) 37 | all_queries = [] 38 | for item in data: 39 | all_queries.append(item["sql"][0]) 40 | return all_queries 41 | 42 | def get_all_db_paths(self): 43 | return [os.path.join(self.ROOT_PATH, "atis-db.added-in-2020.sqlite")] 44 | 45 | 46 | class GeoQuery(Dataset): 47 | 48 | ROOT_PATH = "data/geoquery" 49 | 50 | def get_all_questions(self): 51 | with open(os.path.join(self.ROOT_PATH, "geography.json"), "r", encoding="utf-8") as f: 52 | data = json.load(f) 53 | all_questions = [] 54 | for item in data: 55 | for sentence in item["sentences"]: 56 | text = sentence["text"] 57 | for key, value in sentence["variables"].items(): 58 | text = text.replace(key, value) 59 | all_questions.append(text) 60 | return all_questions 61 | 62 | def get_all_queries(self): 63 | with open(os.path.join(self.ROOT_PATH, "geography.json"), "r", encoding="utf-8") as f: 64 | data = json.load(f) 65 | 66 | all_queries = [] 67 | for item in data: 68 | all_queries.append(item["sql"][0]) 69 | 70 | return all_queries 71 | 72 | def get_all_db_paths(self): 73 | return [os.path.join(self.ROOT_PATH, "geography-db.added-in-2020.sqlite")] 74 | 75 | 76 | class Restaurants(Dataset): 77 | 78 | ROOT_PATH = "data/restaurants" 79 | 80 | def get_all_questions(self): 81 | with open(os.path.join(self.ROOT_PATH, "restaurants.json"), "r", encoding="utf-8") as f: 82 | data = json.load(f) 83 | all_questions = [] 84 | for item in data: 85 | for sentence in item["sentences"]: 86 | text = sentence["text"] 87 | for key, value in sentence["variables"].items(): 88 | text = text.replace(key, value) 89 | all_questions.append(text) 90 | return all_questions 91 | 92 | def get_all_queries(self): 93 | with open(os.path.join(self.ROOT_PATH, "restaurants.json"), "r", encoding="utf-8") as f: 94 | data = json.load(f) 95 | 96 | all_queries = [] 97 | for item in data: 98 | all_queries.append(item["sql"][0]) 99 | 100 | return all_queries 101 | 102 | def get_all_db_paths(self): 103 | return [os.path.join(self.ROOT_PATH, "restaurants-db.added-in-2020.sqlite")] 104 | 105 | 106 | class Scholar(Dataset): 107 | 108 | ROOT_PATH = "data/scholar" 109 | 110 | def get_all_questions(self): 111 | with open(os.path.join(self.ROOT_PATH, "scholar.json"), "r", encoding="utf-8") as f: 112 | data = json.load(f) 113 | all_questions = [] 114 | for item in data: 115 | for sentence in item["sentences"]: 116 | text = sentence["text"] 117 | for key, value in sentence["variables"].items(): 118 | text = text.replace(key, value) 119 | all_questions.append(text) 120 | return all_questions 121 | 122 | def get_all_queries(self): 123 | with open(os.path.join(self.ROOT_PATH, "scholar.json"), "r", encoding="utf-8") as f: 124 | data = json.load(f) 125 | 126 | all_queries = [] 127 | for item in data: 128 | all_queries.append(item["sql"][0]) 129 | 130 | return all_queries 131 | 132 | def get_all_db_paths(self): 133 | return [os.path.join(self.ROOT_PATH, "scholar.db")] 134 | 135 | 136 | class Academic(Dataset): 137 | 138 | ROOT_PATH = "data/academic" 139 | 140 | def get_all_questions(self): 141 | with open(os.path.join(self.ROOT_PATH, "academic.json"), "r", encoding="utf-8") as f: 142 | data = json.load(f) 143 | all_questions = [] 144 | for item in data: 145 | for sentence in item["sentences"]: 146 | text = sentence["text"] 147 | for key, value in sentence["variables"].items(): 148 | text = text.replace(key, value) 149 | all_questions.append(text) 150 | return all_questions 151 | 152 | def get_all_queries(self): 153 | with open(os.path.join(self.ROOT_PATH, "academic.json"), "r", encoding="utf-8") as f: 154 | data = json.load(f) 155 | 156 | all_queries = [] 157 | for item in data: 158 | all_queries.append(item["sql"][0]) 159 | 160 | return all_queries 161 | 162 | def get_all_db_paths(self): 163 | return [os.path.join(self.ROOT_PATH, "MAS.db")] 164 | 165 | 166 | class IMDB(Dataset): 167 | 168 | ROOT_PATH = "data/imdb" 169 | 170 | def get_all_questions(self): 171 | with open(os.path.join(self.ROOT_PATH, "imdb.json"), "r", encoding="utf-8") as f: 172 | data = json.load(f) 173 | all_questions = [] 174 | for item in data: 175 | for sentence in item["sentences"]: 176 | text = sentence["text"] 177 | for key, value in sentence["variables"].items(): 178 | text = text.replace(key, value) 179 | all_questions.append(text) 180 | return all_questions 181 | 182 | def get_all_queries(self): 183 | with open(os.path.join(self.ROOT_PATH, "imdb.json"), "r", encoding="utf-8") as f: 184 | data = json.load(f) 185 | 186 | all_queries = [] 187 | for item in data: 188 | all_queries.append(item["sql"][0]) 189 | 190 | return all_queries 191 | 192 | def get_all_db_paths(self): 193 | return [os.path.join(self.ROOT_PATH, "IMDB.db")] 194 | 195 | 196 | class Yelp(Dataset): 197 | 198 | ROOT_PATH = "data/yelp" 199 | 200 | def get_all_questions(self): 201 | with open(os.path.join(self.ROOT_PATH, "yelp.json"), "r", encoding="utf-8") as f: 202 | data = json.load(f) 203 | all_questions = [] 204 | for item in data: 205 | for sentence in item["sentences"]: 206 | text = sentence["text"] 207 | for key, value in sentence["variables"].items(): 208 | text = text.replace(key, value) 209 | all_questions.append(text) 210 | return all_questions 211 | 212 | def get_all_queries(self): 213 | with open(os.path.join(self.ROOT_PATH, "yelp.json"), "r", encoding="utf-8") as f: 214 | data = json.load(f) 215 | 216 | all_queries = [] 217 | for item in data: 218 | all_queries.append(item["sql"][0]) 219 | 220 | return all_queries 221 | 222 | def get_all_db_paths(self): 223 | return [os.path.join(self.ROOT_PATH, "YELP.db")] 224 | 225 | 226 | class Advising(Dataset): 227 | 228 | ROOT_PATH = "data/advising" 229 | 230 | def get_all_questions(self): 231 | with open(os.path.join(self.ROOT_PATH, "advising.json"), "r", encoding="utf-8") as f: 232 | data = json.load(f) 233 | all_questions = [] 234 | for item in data: 235 | for sentence in item["sentences"]: 236 | text = sentence["text"] 237 | for key, value in sentence["variables"].items(): 238 | text = text.replace(key, value) 239 | all_questions.append(text) 240 | return all_questions 241 | 242 | def get_all_queries(self): 243 | with open(os.path.join(self.ROOT_PATH, "advising.json"), "r", encoding="utf-8") as f: 244 | data = json.load(f) 245 | 246 | all_queries = [] 247 | for item in data: 248 | all_queries.append(item["sql"][0]) 249 | 250 | return all_queries 251 | 252 | def get_all_db_paths(self): 253 | return [os.path.join(self.ROOT_PATH, "advising-db.added-in-2020.sqlite")] 254 | 255 | 256 | class Spider(Dataset): 257 | 258 | ROOT_PATH = "data/spider" 259 | 260 | def get_all_questions(self): 261 | data_json = [] 262 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_spider.json"), "r", encoding="utf-8"))) 263 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_others.json"), "r", encoding="utf-8"))) 264 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 265 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "test_data", "dev.json"), "r", encoding="utf-8"))) 266 | all_questions = [item["question"] for item in data_json] 267 | return all_questions 268 | 269 | def get_all_queries(self): 270 | data_json = [] 271 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_spider.json"), "r", encoding="utf-8"))) 272 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_others.json"), "r", encoding="utf-8"))) 273 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 274 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "test_data", "dev.json"), "r", encoding="utf-8"))) 275 | all_queries = list(set([item["query"].strip() for item in data_json])) 276 | return all_queries 277 | 278 | def get_all_db_paths(self): 279 | # Note that, all databases have been in "test_database" directory 280 | db_paths = [os.path.join(self.ROOT_PATH, "test_database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "test_database"))] 281 | # db_paths.extend( 282 | # [os.path.join(self.ROOT_PATH, "test_database", db_id) for db_id in os.listdir(os.path.join(self.ROOT_PATH, "test_database"))] 283 | # ) 284 | return db_paths 285 | 286 | 287 | class WikiSQL(Dataset): 288 | 289 | ROOT_PATH = "data/wikisql" 290 | 291 | agg_ops = ['', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG'] 292 | cond_ops = ['=', '>', '<', 'OP'] 293 | 294 | def get_all_questions(self): 295 | all_data_json = [] 296 | for data_file in ["train.jsonl", "dev.jsonl", "test.jsonl"]: 297 | with open(os.path.join(self.ROOT_PATH, data_file), "r", encoding="utf-8") as f: 298 | lines = f.readlines() 299 | all_data_json.extend([json.loads(line) for line in lines]) 300 | return [item["question"] for item in all_data_json] 301 | 302 | def get_all_queries(self): 303 | all_queries = [] 304 | 305 | tables = dict() 306 | with open(os.path.join(self.ROOT_PATH, "train.tables.jsonl"), "r", encoding="utf-8") as f: 307 | for line in f.readlines(): 308 | item = json.loads(line) 309 | tables[item["id"]] = item 310 | with open(os.path.join(self.ROOT_PATH, "train.jsonl"), "r", encoding="utf-8") as f: 311 | for line in f.readlines(): 312 | item = json.loads(line) 313 | sel_col_name = tables[item["table_id"]]["header"][item["sql"]["sel"]] 314 | agg_name = self.agg_ops[item["sql"]["agg"]] 315 | table_name = tables[item["table_id"]].get("name", item["table_id"]) 316 | if agg_name: 317 | rep = 'SELECT {agg}(`{sel}`) FROM `{table}`'.format( 318 | agg=agg_name, 319 | sel=sel_col_name, 320 | table=table_name 321 | ) 322 | else: 323 | rep = 'SELECT `{sel}` FROM `{table}`'.format( 324 | sel=sel_col_name, 325 | table=table_name 326 | ) 327 | if item["sql"]["conds"]: 328 | rep += ' WHERE ' + ' AND '.join(["`{}` {} {}".format(tables[item["table_id"]]["header"][i], self.cond_ops[o], v) for i, o, v in item["sql"]["conds"]]) 329 | all_queries.append(rep) 330 | 331 | 332 | tables = dict() 333 | with open(os.path.join(self.ROOT_PATH, "dev.tables.jsonl"), "r", encoding="utf-8") as f: 334 | for line in f.readlines(): 335 | item = json.loads(line) 336 | tables[item["id"]] = item 337 | with open(os.path.join(self.ROOT_PATH, "dev.jsonl"), "r", encoding="utf-8") as f: 338 | for line in f.readlines(): 339 | item = json.loads(line) 340 | sel_col_name = tables[item["table_id"]]["header"][item["sql"]["sel"]] 341 | agg_name = self.agg_ops[item["sql"]["agg"]] 342 | table_name = tables[item["table_id"]].get("name", item["table_id"]) 343 | if agg_name: 344 | rep = 'SELECT {agg}(`{sel}`) FROM `{table}`'.format( 345 | agg=agg_name, 346 | sel=sel_col_name, 347 | table=table_name 348 | ) 349 | else: 350 | rep = 'SELECT `{sel}` FROM `{table}`'.format( 351 | sel=sel_col_name, 352 | table=table_name 353 | ) 354 | if item["sql"]["conds"]: 355 | rep += ' WHERE ' + ' AND '.join(["`{}` {} {}".format(tables[item["table_id"]]["header"][i], self.cond_ops[o], v) for i, o, v in item["sql"]["conds"]]) 356 | all_queries.append(rep) 357 | 358 | tables = dict() 359 | with open(os.path.join(self.ROOT_PATH, "test.tables.jsonl"), "r", encoding="utf-8") as f: 360 | for line in f.readlines(): 361 | item = json.loads(line) 362 | tables[item["id"]] = item 363 | with open(os.path.join(self.ROOT_PATH, "test.jsonl"), "r", encoding="utf-8") as f: 364 | for line in f.readlines(): 365 | item = json.loads(line) 366 | sel_col_name = tables[item["table_id"]]["header"][item["sql"]["sel"]] 367 | agg_name = self.agg_ops[item["sql"]["agg"]] 368 | table_name = tables[item["table_id"]].get("name", item["table_id"]) 369 | if agg_name: 370 | rep = 'SELECT {agg}(`{sel}`) FROM `{table}`'.format( 371 | agg=agg_name, 372 | sel=sel_col_name, 373 | table=table_name 374 | ) 375 | else: 376 | rep = 'SELECT `{sel}` FROM `{table}`'.format( 377 | sel=sel_col_name, 378 | table=table_name 379 | ) 380 | if item["sql"]["conds"]: 381 | rep += ' WHERE ' + ' AND '.join(["`{}` {} {}".format(tables[item["table_id"]]["header"][i], self.cond_ops[o], v) for i, o, v in item["sql"]["conds"]]) 382 | all_queries.append(rep) 383 | 384 | all_queries = list(set(all_queries)) 385 | return all_queries 386 | 387 | def get_all_db_paths(self): 388 | return [os.path.join(self.ROOT_PATH, "train.db"), 389 | os.path.join(self.ROOT_PATH, "dev.db"), 390 | os.path.join(self.ROOT_PATH, "test.db")] 391 | 392 | 393 | class BIRD(Dataset): 394 | 395 | ROOT_PATH = "data/bird" 396 | 397 | def get_all_questions(self): 398 | data_json = [] 399 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train", "train.json"), "r", encoding="utf-8"))) 400 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev", "dev.json"), "r", encoding="utf-8"))) 401 | all_questions = [item["question"] for item in data_json] 402 | return all_questions 403 | 404 | def get_all_queries(self): 405 | data_json = [] 406 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train", "train.json"), "r", encoding="utf-8"))) 407 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev", "dev.json"), "r", encoding="utf-8"))) 408 | all_queries = list(set([item["SQL"].strip() for item in data_json])) 409 | return all_queries 410 | 411 | def get_all_db_paths(self): 412 | db_paths = [os.path.join(self.ROOT_PATH, "dev", "dev_databases", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "dev", "dev_databases"))] 413 | db_paths.extend( 414 | [os.path.join(self.ROOT_PATH, "train", "train_databases", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "train", "train_databases"))] 415 | ) 416 | return db_paths 417 | 418 | 419 | class CSpider(Dataset): 420 | 421 | ROOT_PATH = "data/cspider" 422 | 423 | def get_all_questions(self): 424 | data_json = [] 425 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8"))) 426 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 427 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "test_data", "test.json"), "r", encoding="utf-8"))) 428 | all_questions = [item["question"] for item in data_json] 429 | return all_questions 430 | 431 | def get_all_queries(self): 432 | data_json = [] 433 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8"))) 434 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 435 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "test_data", "test.json"), "r", encoding="utf-8"))) 436 | all_queries = list(set([item["query"].strip() for item in data_json])) 437 | return all_queries 438 | 439 | def get_all_db_paths(self): 440 | db_paths = [os.path.join(self.ROOT_PATH, "test_database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "test_database"))] 441 | db_paths.extend( 442 | [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 443 | ) 444 | return db_paths 445 | 446 | 447 | class SParC(Dataset): 448 | 449 | ROOT_PATH = "data/sparc" 450 | 451 | def get_all_questions(self): 452 | data_json = [] 453 | for item in json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8")): 454 | interaction = item["interaction"] 455 | for turn in interaction: 456 | data_json.append({ 457 | "question": turn["utterance"], 458 | "query": turn["query"] 459 | }) 460 | for item in json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8")): 461 | interaction = item["interaction"] 462 | for turn in interaction: 463 | data_json.append({ 464 | "question": turn["utterance"], 465 | "query": turn["query"] 466 | }) 467 | all_questions = [item["question"] for item in data_json] 468 | return all_questions 469 | 470 | def get_all_queries(self): 471 | data_json = [] 472 | for item in json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8")): 473 | interaction = item["interaction"] 474 | for turn in interaction: 475 | data_json.append({ 476 | "question": turn["utterance"], 477 | "query": turn["query"] 478 | }) 479 | for item in json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8")): 480 | interaction = item["interaction"] 481 | for turn in interaction: 482 | data_json.append({ 483 | "question": turn["utterance"], 484 | "query": turn["query"] 485 | }) 486 | all_queries = list(set([item["query"].strip() for item in data_json])) 487 | return all_queries 488 | 489 | def get_all_db_paths(self): 490 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 491 | return db_paths 492 | 493 | 494 | class CoSpider(Dataset): 495 | 496 | ROOT_PATH = "data/cospider" 497 | 498 | def get_all_questions(self): 499 | data_json = [] 500 | for item in json.load(open(os.path.join(self.ROOT_PATH, "sql_state_tracking", "cosql_train.json"), "r", encoding="utf-8")): 501 | interaction = item["interaction"] 502 | for turn in interaction: 503 | data_json.append({ 504 | "question": turn["utterance"], 505 | "query": turn["query"] 506 | }) 507 | for item in json.load(open(os.path.join(self.ROOT_PATH, "sql_state_tracking", "cosql_dev.json"), "r", encoding="utf-8")): 508 | interaction = item["interaction"] 509 | for turn in interaction: 510 | data_json.append({ 511 | "question": turn["utterance"], 512 | "query": turn["query"] 513 | }) 514 | all_questions = [item["question"] for item in data_json] 515 | return all_questions 516 | 517 | def get_all_queries(self): 518 | data_json = [] 519 | for item in json.load(open(os.path.join(self.ROOT_PATH, "sql_state_tracking", "cosql_train.json"), "r", encoding="utf-8")): 520 | interaction = item["interaction"] 521 | for turn in interaction: 522 | data_json.append({ 523 | "question": turn["utterance"], 524 | "query": turn["query"] 525 | }) 526 | for item in json.load(open(os.path.join(self.ROOT_PATH, "sql_state_tracking", "cosql_dev.json"), "r", encoding="utf-8")): 527 | interaction = item["interaction"] 528 | for turn in interaction: 529 | data_json.append({ 530 | "question": turn["utterance"], 531 | "query": turn["query"] 532 | }) 533 | # need to fix "> =" and "< =" issues 534 | all_queries = list(set([item["query"].strip().replace("> =", ">=").replace("< =", "<=") for item in data_json])) 535 | return all_queries 536 | 537 | def get_all_db_paths(self): 538 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 539 | return db_paths 540 | 541 | 542 | class SpiderSyn(Dataset): 543 | 544 | ROOT_PATH = "data/spider_syn" 545 | 546 | def get_all_questions(self): 547 | data_json = json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8")) 548 | all_questions = [item["SpiderSynQuestion"] for item in data_json] 549 | return all_questions 550 | 551 | def get_all_queries(self): 552 | data_json = json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8")) 553 | all_queries = list(set([item["query"].strip() for item in data_json])) 554 | return all_queries 555 | 556 | def get_all_db_paths(self): 557 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 558 | return db_paths 559 | 560 | 561 | class SpiderRealistic(Dataset): 562 | 563 | ROOT_PATH = "data/spider_realistic" 564 | 565 | def get_all_questions(self): 566 | data_json = json.load(open(os.path.join(self.ROOT_PATH, "spider_realistic.json"), "r", encoding="utf-8")) 567 | all_questions = [item["question"] for item in data_json] 568 | return all_questions 569 | 570 | def get_all_queries(self): 571 | data_json = json.load(open(os.path.join(self.ROOT_PATH, "spider_realistic.json"), "r", encoding="utf-8")) 572 | all_queries = list(set([item["query"].strip() for item in data_json])) 573 | return all_queries 574 | 575 | def get_all_db_paths(self): 576 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 577 | return db_paths 578 | 579 | 580 | class SpiderDK(Dataset): 581 | 582 | ROOT_PATH = "data/spider_dk" 583 | 584 | def get_all_questions(self): 585 | data_json = json.load(open(os.path.join(self.ROOT_PATH, "spider_dk.json"), "r", encoding="utf-8")) 586 | all_questions = [item["question"] for item in data_json] 587 | return all_questions 588 | 589 | def get_all_queries(self): 590 | data_json = json.load(open(os.path.join(self.ROOT_PATH, "spider_dk.json"), "r", encoding="utf-8")) 591 | all_queries = list(set([item["query"].strip() for item in data_json])) 592 | return all_queries 593 | 594 | def get_all_db_paths(self): 595 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 596 | return db_paths 597 | 598 | 599 | class DrSpider(Dataset): 600 | 601 | ROOT_PATH = "data/dr_spider" 602 | 603 | def get_all_questions(self): 604 | all_perturbations = os.listdir(os.path.join(self.ROOT_PATH)) 605 | data_json = [] 606 | for perturbation in all_perturbations: 607 | if perturbation.startswith("DB_"): 608 | question_file_name = "questions_post_perturbation.json" 609 | else: 610 | question_file_name = "questions_post_perturbation.json" 611 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, perturbation, question_file_name), "r", encoding="utf-8"))) 612 | 613 | all_questions = [item["question"] for item in data_json] 614 | return all_questions 615 | 616 | def get_all_queries(self): 617 | all_perturbations = os.listdir(os.path.join(self.ROOT_PATH)) 618 | data_json = [] 619 | for perturbation in all_perturbations: 620 | if perturbation.startswith("DB_"): 621 | question_file_name = "questions_post_perturbation.json" 622 | else: 623 | question_file_name = "questions_post_perturbation.json" 624 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, perturbation, question_file_name), "r", encoding="utf-8"))) 625 | 626 | all_queries = list(set([item["query"].strip() for item in data_json])) 627 | return all_queries 628 | 629 | def get_all_db_paths(self): 630 | all_perturbations = os.listdir(os.path.join(self.ROOT_PATH)) 631 | db_paths = [] 632 | for perturbation in all_perturbations: 633 | if perturbation.startswith("DB_"): 634 | db_dir_name = "database_post_perturbation" 635 | else: 636 | db_dir_name = "databases" 637 | for db_id in os.listdir(os.path.join(self.ROOT_PATH, perturbation, db_dir_name)): 638 | if os.path.isdir(os.path.join(self.ROOT_PATH, perturbation, db_dir_name, db_id)): 639 | db_paths.append(os.path.join(self.ROOT_PATH, perturbation, db_dir_name, db_id, f"{db_id}.sqlite")) 640 | return db_paths 641 | 642 | 643 | class SQUALL(Dataset): 644 | 645 | ROOT_PATH = "data/squall" 646 | 647 | def get_all_questions(self): 648 | all_data_json = json.load(open(os.path.join(self.ROOT_PATH, "squall.json"), "r", encoding="utf-8")) 649 | return [" ".join(item["nl"]) for item in all_data_json] 650 | 651 | def get_all_queries(self): 652 | all_data_json = json.load(open(os.path.join(self.ROOT_PATH, "squall.json"), "r", encoding="utf-8")) 653 | all_queries = [] 654 | for item in all_data_json: 655 | sql = " ".join([tok[1] for tok in item["sql"]]) 656 | all_queries.append(sql.strip()) 657 | return list(set(all_queries)) 658 | 659 | def get_all_db_paths(self): 660 | db_paths = [os.path.join(self.ROOT_PATH, "db", db_file) for db_file in os.listdir(os.path.join(self.ROOT_PATH, "db"))] 661 | return db_paths 662 | 663 | 664 | class FIBEN(Dataset): 665 | 666 | ROOT_PATH = "data/fiben" 667 | 668 | def __init__(self): 669 | all_table_csv = os.listdir(os.path.join(self.ROOT_PATH, "data")) 670 | self._total_databases = 1 671 | self._total_tables = len(all_table_csv) 672 | all_table_dataframe = [] 673 | for table_csv in all_table_csv: 674 | try: 675 | df = pd.read_csv(os.path.join(self.ROOT_PATH, "data", table_csv), header=None, low_memory=False) 676 | all_table_dataframe.append(df) 677 | except pd.errors.EmptyDataError: 678 | all_table_dataframe.append(pd.DataFrame()) 679 | total_columns, total_records = 0, 0 680 | for table_dataframe in all_table_dataframe: 681 | total_columns += len(table_dataframe.columns) 682 | total_records += len(table_dataframe) 683 | self._avg_tables_per_db = self._total_tables / self._total_databases 684 | self._avg_columns_per_table = total_columns / self._total_tables 685 | self._avg_records_per_db = total_records / self._total_databases 686 | 687 | def get_all_questions(self): 688 | all_data_json = json.load(open(os.path.join(self.ROOT_PATH, "FIBEN_Queries.json"), "r", encoding="utf-8")) 689 | return [item["question"] for item in all_data_json] 690 | 691 | def get_all_queries(self): 692 | all_data_json = json.load(open(os.path.join(self.ROOT_PATH, "FIBEN_Queries.json"), "r", encoding="utf-8")) 693 | all_queries = list(set([item["SQL"].strip() for item in all_data_json])) 694 | return all_queries 695 | 696 | def get_all_db_paths(self): 697 | return [] 698 | 699 | 700 | class KaggleDBQA(Dataset): 701 | 702 | ROOT_PATH = "data/kaggledbqa" 703 | 704 | def get_all_questions(self): 705 | all_data_json = [] 706 | for filename in os.listdir(os.path.join(self.ROOT_PATH, "examples")): 707 | if "_fewshot" in filename or "_test" in filename: 708 | continue 709 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "examples", filename), "r", encoding="utf-8"))) 710 | return [item["question"] for item in all_data_json] 711 | 712 | def get_all_queries(self): 713 | all_data_json = [] 714 | for filename in os.listdir(os.path.join(self.ROOT_PATH, "examples")): 715 | if "_fewshot" in filename or "_test" in filename: 716 | continue 717 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "examples", filename), "r", encoding="utf-8"))) 718 | all_queries = list(set([item["query"].strip() for item in all_data_json])) 719 | return all_queries 720 | 721 | def get_all_db_paths(self): 722 | db_paths = [os.path.join(self.ROOT_PATH, "databases", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "databases"))] 723 | return db_paths 724 | 725 | 726 | class SEDE(Dataset): 727 | 728 | ROOT_PATH = "data/sede" 729 | 730 | def __init__(self): 731 | schemas = json.load(open(os.path.join(self.ROOT_PATH, "tables_so.json"), "r", encoding="utf-8")) 732 | self._total_databases = len(schemas) 733 | self._total_tables = 0 734 | total_columns, total_records = 0, 0 735 | for db in schemas: 736 | self._total_tables += len(db["table_names_original"]) 737 | total_columns += (len(db["column_names_original"]) - 1) # ignore star col 738 | self._avg_tables_per_db = self._total_tables / self._total_databases 739 | self._avg_columns_per_table = total_columns / self._total_tables 740 | self._avg_records_per_db = total_records / self._total_databases 741 | 742 | def get_all_questions(self): 743 | all_data_json = [] 744 | for filename in ["train.jsonl", "val.jsonl", "test.jsonl"]: 745 | with open(os.path.join(self.ROOT_PATH, filename), "r", encoding="utf-8") as f: 746 | for line in f.readlines(): 747 | sample = json.loads(line) 748 | all_data_json.append(sample) 749 | return [item["Title"] for item in all_data_json] 750 | 751 | def get_all_queries(self): 752 | all_data_json = [] 753 | for filename in ["train.jsonl", "val.jsonl", "test.jsonl"]: 754 | with open(os.path.join(self.ROOT_PATH, filename), "r", encoding="utf-8") as f: 755 | for line in f.readlines(): 756 | sample = json.loads(line) 757 | all_data_json.append(sample) 758 | all_queries = list(set([item["QueryBody"].split("\n\n")[-1] for item in all_data_json])) 759 | return all_queries 760 | 761 | def get_all_db_paths(self): 762 | return [] 763 | 764 | 765 | class MTTEQL(Dataset): 766 | 767 | ROOT_PATH = "data/mt_teql" 768 | 769 | def __init__(self): 770 | schemas = [] 771 | for filename in ["dev-tables.json", "train-tables.json"]: 772 | schemas.extend(json.load(open(os.path.join(self.ROOT_PATH, filename), "r", encoding="utf-8"))) 773 | self._total_databases = len(schemas) 774 | self._total_tables = 0 775 | total_columns, total_records = 0, 0 776 | for db in schemas: 777 | self._total_tables += len(db["table_names_original"]) 778 | total_columns += (len(db["column_names_original"]) - 1) # ignore star col 779 | self._avg_tables_per_db = self._total_tables / self._total_databases 780 | self._avg_columns_per_table = total_columns / self._total_tables 781 | self._avg_records_per_db = total_records / self._total_databases 782 | 783 | def get_all_questions(self): 784 | all_data_json = [] 785 | for filename in ["train.json", "dev.json"]: 786 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, filename), "r", encoding="utf-8"))) 787 | return [item["question"] for item in all_data_json] 788 | 789 | def get_all_queries(self): 790 | all_data_json = [] 791 | for filename in ["train.json", "dev.json"]: 792 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, filename), "r", encoding="utf-8"))) 793 | all_queries = [] 794 | for item in all_data_json: 795 | if "query" in item: 796 | all_queries.append(item["query"].strip()) 797 | all_queries = list(set(all_queries)) 798 | return all_queries 799 | 800 | def get_all_db_paths(self): 801 | return [] 802 | 803 | 804 | class AmbiQT(Dataset): 805 | 806 | ROOT_PATH = "data/ambiqt" 807 | 808 | def get_all_questions(self): 809 | all_data_json = [] 810 | for benchmark_type in os.listdir(os.path.join(self.ROOT_PATH, "benchmark")): 811 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "benchmark", benchmark_type, "train.json"), "r", encoding="utf-8"))) 812 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "benchmark", benchmark_type, "validation.json"), "r", encoding="utf-8"))) 813 | return [item["question"] for item in all_data_json] 814 | 815 | def get_all_queries(self): 816 | all_data_json = [] 817 | for benchmark_type in os.listdir(os.path.join(self.ROOT_PATH, "benchmark")): 818 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "benchmark", benchmark_type, "train.json"), "r", encoding="utf-8"))) 819 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "benchmark", benchmark_type, "validation.json"), "r", encoding="utf-8"))) 820 | all_queries = [] 821 | all_queries.extend([item["query1"].strip() for item in all_data_json]) 822 | all_queries.extend([item["query2"].strip() for item in all_data_json]) 823 | all_queries = list(set(all_queries)) 824 | return all_queries 825 | 826 | def get_all_db_paths(self): 827 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 828 | return db_paths 829 | 830 | 831 | class ScienceBenchmark(Dataset): 832 | 833 | ROOT_PATH = "data/sciencebenchmark" 834 | 835 | def __init__(self): 836 | self._total_databases = -1 837 | self._total_tables = -1 838 | total_columns, total_records = 0, 0 839 | self._avg_tables_per_db = self._total_tables / self._total_databases 840 | self._avg_columns_per_table = total_columns / self._total_tables 841 | self._avg_records_per_db = total_records / self._total_databases 842 | 843 | def get_all_questions(self): 844 | all_data_json = [] 845 | for domain in os.listdir(self.ROOT_PATH): 846 | all_data_json.extend( 847 | json.load(open(os.path.join(self.ROOT_PATH, domain, "seed.json"), "r", encoding="utf-8")) 848 | ) 849 | all_data_json.extend( 850 | json.load(open(os.path.join(self.ROOT_PATH, domain, "dev.json"), "r", encoding="utf-8")) 851 | ) 852 | all_data_json.extend( 853 | json.load(open(os.path.join(self.ROOT_PATH, domain, "synth.json"), "r", encoding="utf-8")) 854 | ) 855 | return [item["question"] for item in all_data_json] 856 | 857 | def get_all_queries(self): 858 | all_data_json = [] 859 | for domain in os.listdir(self.ROOT_PATH): 860 | all_data_json.extend( 861 | json.load(open(os.path.join(self.ROOT_PATH, domain, "seed.json"), "r", encoding="utf-8")) 862 | ) 863 | all_data_json.extend( 864 | json.load(open(os.path.join(self.ROOT_PATH, domain, "dev.json"), "r", encoding="utf-8")) 865 | ) 866 | all_data_json.extend( 867 | json.load(open(os.path.join(self.ROOT_PATH, domain, "synth.json"), "r", encoding="utf-8")) 868 | ) 869 | all_queries = [item["query"].strip() for item in all_data_json] 870 | all_queries = list(set(all_queries)) 871 | return all_queries 872 | 873 | def get_all_db_paths(self): 874 | db_paths = [] 875 | return db_paths 876 | 877 | 878 | class BULL(Dataset): 879 | 880 | """_summary_ 881 | 882 | Note that, we only do statistics for "train" split, because the dev data is not public 883 | """ 884 | 885 | ROOT_PATH = "data/bull" 886 | 887 | def get_all_questions(self): 888 | all_data_json = [] 889 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-cn", "train.json"), "r", encoding="utf-8"))) 890 | # all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-cn", "dev_cn.json"), "r", encoding="utf-8"))) 891 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-en", "train.json"), "r", encoding="utf-8"))) 892 | # all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-en", "dev_en.json"), "r", encoding="utf-8"))) 893 | return [item["question"] for item in all_data_json] 894 | 895 | def get_all_queries(self): 896 | all_data_json = [] 897 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-cn", "train.json"), "r", encoding="utf-8"))) 898 | # all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-cn", "dev_cn.json"), "r", encoding="utf-8"))) 899 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-en", "train.json"), "r", encoding="utf-8"))) 900 | # all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BULL-en", "dev_en.json"), "r", encoding="utf-8"))) 901 | all_queries = [item["sql_query"].strip() for item in all_data_json] 902 | all_queries = list(set(all_queries)) 903 | return all_queries 904 | 905 | def get_all_db_paths(self): 906 | db_paths = [ 907 | # os.path.join(self.ROOT_PATH, "database_cn", "ccks_fund", "ccks_fund.sqlite"), 908 | # os.path.join(self.ROOT_PATH, "database_cn", "ccks_macro", "ccks_macro.sqlite"), 909 | # os.path.join(self.ROOT_PATH, "database_cn", "ccks_stock", "ccks_stock.sqlite"), 910 | os.path.join(self.ROOT_PATH, "database_en", "ccks_fund", "ccks_fund.sqlite"), 911 | os.path.join(self.ROOT_PATH, "database_en", "ccks_macro", "ccks_macro.sqlite"), 912 | os.path.join(self.ROOT_PATH, "database_en", "ccks_stock", "ccks_stock.sqlite") 913 | ] 914 | return db_paths 915 | 916 | 917 | class BookSQL(Dataset): 918 | 919 | ROOT_PATH = "data/booksql" 920 | 921 | def get_all_questions(self): 922 | all_data_json = [] 923 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8"))) 924 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BookSQL_val.json"), "r", encoding="utf-8"))) 925 | return [item["Query"] for item in all_data_json] 926 | 927 | def get_all_queries(self): 928 | all_data_json = [] 929 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8"))) 930 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "BookSQL_val.json"), "r", encoding="utf-8"))) 931 | all_queries = [item["SQL"].strip().replace("\"\"", "'") for item in all_data_json] 932 | all_queries = list(set(all_queries)) 933 | return all_queries 934 | 935 | def get_all_db_paths(self): 936 | db_paths = [ 937 | os.path.join(self.ROOT_PATH, "accounting.sqlite"), 938 | ] 939 | return db_paths 940 | 941 | 942 | class PAUQ(Dataset): 943 | 944 | ROOT_PATH = "data/pauq" 945 | 946 | def get_all_questions(self): 947 | all_data_json = [] 948 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "pauq_dev.json"), "r", encoding="utf-8"))) 949 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "pauq_train.json"), "r", encoding="utf-8"))) 950 | return [item["question"]["ru"] for item in all_data_json] 951 | 952 | def get_all_queries(self): 953 | all_data_json = [] 954 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "pauq_dev.json"), "r", encoding="utf-8"))) 955 | all_data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "pauq_train.json"), "r", encoding="utf-8"))) 956 | all_queries = [item["query"]["ru"].strip() for item in all_data_json] 957 | all_queries = list(set(all_queries)) 958 | return all_queries 959 | 960 | def get_all_db_paths(self): 961 | db_paths = [os.path.join(self.ROOT_PATH, "merged_database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "merged_database"))] 962 | return db_paths 963 | 964 | 965 | class CHASE(Dataset): 966 | 967 | ROOT_PATH = "data/chase" 968 | 969 | def get_all_questions(self): 970 | data_json = [] 971 | for item in json.load(open(os.path.join(self.ROOT_PATH, "chase_train.json"), "r", encoding="utf-8")): 972 | interaction = item["interaction"] 973 | for turn in interaction: 974 | data_json.append({ 975 | "question": turn["utterance"], 976 | "query": turn["query"] 977 | }) 978 | for item in json.load(open(os.path.join(self.ROOT_PATH, "chase_dev.json"), "r", encoding="utf-8")): 979 | interaction = item["interaction"] 980 | for turn in interaction: 981 | data_json.append({ 982 | "question": turn["utterance"], 983 | "query": turn["query"] 984 | }) 985 | all_questions = [item["question"] for item in data_json] 986 | return all_questions 987 | 988 | def get_all_queries(self): 989 | data_json = [] 990 | for item in json.load(open(os.path.join(self.ROOT_PATH, "chase_train.json"), "r", encoding="utf-8")): 991 | interaction = item["interaction"] 992 | for turn in interaction: 993 | data_json.append({ 994 | "question": turn["utterance"], 995 | "query": turn["query"] 996 | }) 997 | for item in json.load(open(os.path.join(self.ROOT_PATH, "chase_dev.json"), "r", encoding="utf-8")): 998 | interaction = item["interaction"] 999 | for turn in interaction: 1000 | data_json.append({ 1001 | "question": turn["utterance"], 1002 | "query": turn["query"] 1003 | }) 1004 | all_queries = list(set([item["query"].strip() for item in data_json])) 1005 | return all_queries 1006 | 1007 | def get_all_db_paths(self): 1008 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_file) for db_file in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 1009 | return db_paths 1010 | 1011 | 1012 | class DuSQL(Dataset): 1013 | 1014 | ROOT_PATH = "data/dusql" 1015 | 1016 | def __init__(self): 1017 | schemas = json.load(open(os.path.join(self.ROOT_PATH, "db_schema.json"), "r", encoding="utf-8")) 1018 | self._total_databases = len(schemas) 1019 | self._total_tables = 0 1020 | total_columns, total_records = 0, 0 1021 | for db in schemas: 1022 | self._total_tables += len(db["table_names"]) 1023 | total_columns += (len(db["column_names"]) - 1) # ignore star col 1024 | self._avg_tables_per_db = self._total_tables / self._total_databases 1025 | self._avg_columns_per_table = total_columns / self._total_tables 1026 | 1027 | db_content = json.load(open(os.path.join(self.ROOT_PATH, "db_content.json"), "r", encoding="utf-8")) 1028 | 1029 | for db in db_content: 1030 | for k, v in db["tables"].items(): 1031 | total_records += len(v["cell"]) 1032 | 1033 | self._avg_records_per_db = total_records / self._total_databases 1034 | 1035 | def get_all_questions(self): 1036 | all_data_json = [] 1037 | all_data_json.extend( 1038 | json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8")) 1039 | ) 1040 | all_data_json.extend( 1041 | json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8")) 1042 | ) 1043 | return [item["question"] for item in all_data_json] 1044 | 1045 | def get_all_queries(self): 1046 | all_data_json = [] 1047 | all_data_json.extend( 1048 | json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8")) 1049 | ) 1050 | all_data_json.extend( 1051 | json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8")) 1052 | ) 1053 | all_queries = [item["query"].strip() for item in all_data_json] 1054 | all_queries = list(set(all_queries)) 1055 | return all_queries 1056 | 1057 | def get_all_db_paths(self): 1058 | db_paths = [] 1059 | return db_paths 1060 | 1061 | 1062 | class ViText2SQL(Dataset): 1063 | 1064 | ROOT_PATH = "data/vitext2sql" 1065 | 1066 | def get_all_questions(self): 1067 | data_json = [] 1068 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "word-level", "train.json"), "r", encoding="utf-8"))) 1069 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "word-level", "dev.json"), "r", encoding="utf-8"))) 1070 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "word-level", "test.json"), "r", encoding="utf-8"))) 1071 | all_questions = [item["question"] for item in data_json] 1072 | return all_questions 1073 | 1074 | def get_all_queries(self): 1075 | data_json = [] 1076 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "word-level", "train.json"), "r", encoding="utf-8"))) 1077 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "word-level", "dev.json"), "r", encoding="utf-8"))) 1078 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "word-level", "test.json"), "r", encoding="utf-8"))) 1079 | all_queries = list(set([item["query"].strip() for item in data_json])) 1080 | return all_queries 1081 | 1082 | def get_all_db_paths(self): 1083 | db_paths = [os.path.join("data/spider", "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join("data/spider", "database"))] 1084 | return db_paths 1085 | 1086 | 1087 | class MIMICSQL(Dataset): 1088 | 1089 | ROOT_PATH = "data/mimicsql" 1090 | 1091 | def __init__(self): 1092 | self._total_databases = -1 1093 | self._total_tables = -1 1094 | total_columns, total_records = 0, 0 1095 | self._avg_tables_per_db = self._total_tables / self._total_databases 1096 | self._avg_columns_per_table = total_columns / self._total_tables 1097 | self._avg_records_per_db = total_records / self._total_databases 1098 | 1099 | def get_all_questions(self): 1100 | all_data_json = [] 1101 | with open(os.path.join(self.ROOT_PATH, "mimicsql_template", "train.json"), "r", encoding="utf-8") as f: 1102 | for line in f.readlines(): 1103 | all_data_json.append(json.loads(line)) 1104 | with open(os.path.join(self.ROOT_PATH, "mimicsql_template", "test.json"), "r", encoding="utf-8") as f: 1105 | for line in f.readlines(): 1106 | all_data_json.append(json.loads(line)) 1107 | with open(os.path.join(self.ROOT_PATH, "mimicsql_template", "dev.json"), "r", encoding="utf-8") as f: 1108 | for line in f.readlines(): 1109 | all_data_json.append(json.loads(line)) 1110 | with open(os.path.join(self.ROOT_PATH, "mimicsql_natural_v2", "train.json"), "r", encoding="utf-8") as f: 1111 | for line in f.readlines(): 1112 | all_data_json.append(json.loads(line)) 1113 | with open(os.path.join(self.ROOT_PATH, "mimicsql_natural_v2", "test.json"), "r", encoding="utf-8") as f: 1114 | for line in f.readlines(): 1115 | all_data_json.append(json.loads(line)) 1116 | with open(os.path.join(self.ROOT_PATH, "mimicsql_natural_v2", "dev.json"), "r", encoding="utf-8") as f: 1117 | for line in f.readlines(): 1118 | all_data_json.append(json.loads(line)) 1119 | return [item["question_refine"] for item in all_data_json] 1120 | 1121 | def get_all_queries(self): 1122 | all_data_json = [] 1123 | with open(os.path.join(self.ROOT_PATH, "mimicsql_template", "train.json"), "r", encoding="utf-8") as f: 1124 | for line in f.readlines(): 1125 | all_data_json.append(json.loads(line)) 1126 | with open(os.path.join(self.ROOT_PATH, "mimicsql_template", "test.json"), "r", encoding="utf-8") as f: 1127 | for line in f.readlines(): 1128 | all_data_json.append(json.loads(line)) 1129 | with open(os.path.join(self.ROOT_PATH, "mimicsql_template", "dev.json"), "r", encoding="utf-8") as f: 1130 | for line in f.readlines(): 1131 | all_data_json.append(json.loads(line)) 1132 | with open(os.path.join(self.ROOT_PATH, "mimicsql_natural_v2", "train.json"), "r", encoding="utf-8") as f: 1133 | for line in f.readlines(): 1134 | all_data_json.append(json.loads(line)) 1135 | with open(os.path.join(self.ROOT_PATH, "mimicsql_natural_v2", "test.json"), "r", encoding="utf-8") as f: 1136 | for line in f.readlines(): 1137 | all_data_json.append(json.loads(line)) 1138 | with open(os.path.join(self.ROOT_PATH, "mimicsql_natural_v2", "dev.json"), "r", encoding="utf-8") as f: 1139 | for line in f.readlines(): 1140 | all_data_json.append(json.loads(line)) 1141 | all_queries = [item["sql"].strip() for item in all_data_json] 1142 | all_queries = list(set(all_queries)) 1143 | return all_queries 1144 | 1145 | def get_all_db_paths(self): 1146 | db_paths = [] 1147 | return db_paths 1148 | 1149 | 1150 | class PortugueseSpider(Dataset): 1151 | ROOT_PATH = "data/spider" 1152 | 1153 | def get_all_questions(self): 1154 | data_json = [] 1155 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_spider.json"), "r", encoding="utf-8"))) 1156 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_others.json"), "r", encoding="utf-8"))) 1157 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 1158 | all_questions = [item["question"] for item in data_json] 1159 | return all_questions 1160 | 1161 | def get_all_queries(self): 1162 | data_json = [] 1163 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_spider.json"), "r", encoding="utf-8"))) 1164 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train_others.json"), "r", encoding="utf-8"))) 1165 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 1166 | all_queries = list(set([item["query"].strip() for item in data_json])) 1167 | return all_queries 1168 | 1169 | def get_all_db_paths(self): 1170 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 1171 | return db_paths 1172 | 1173 | 1174 | class Archer(Dataset): 1175 | 1176 | ROOT_PATH = "data/archer" 1177 | 1178 | def get_all_questions(self): 1179 | data_json = [] 1180 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8"))) 1181 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 1182 | all_questions = [item["question"] for item in data_json] 1183 | return all_questions 1184 | 1185 | def get_all_queries(self): 1186 | data_json = [] 1187 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "train.json"), "r", encoding="utf-8"))) 1188 | data_json.extend(json.load(open(os.path.join(self.ROOT_PATH, "dev.json"), "r", encoding="utf-8"))) 1189 | all_queries = list(set([item["query"].strip() for item in data_json])) 1190 | return all_queries 1191 | 1192 | def get_all_db_paths(self): 1193 | db_paths = [os.path.join(self.ROOT_PATH, "database", db_id, f"{db_id}.sqlite") for db_id in os.listdir(os.path.join(self.ROOT_PATH, "database"))] 1194 | return db_paths 1195 | 1196 | 1197 | if __name__ == "__main__": 1198 | dataset = Archer() 1199 | print(len(dataset.get_all_questions())) 1200 | print(len(dataset.get_all_queries())) 1201 | print(len(dataset.get_all_db_paths())) 1202 | -------------------------------------------------------------------------------- /src/dataset_analyze/sql_parser.py: -------------------------------------------------------------------------------- 1 | from sqlglot import parse_one, exp 2 | 3 | 4 | class SQLParser: 5 | 6 | _SCALAR_KEYWORDS = (exp.Abs, exp.Length, exp.Cast, exp.Round, exp.Upper, exp.Lower, exp.Rand) 7 | _SCALAR_KEYWORDS_ANONYMOUS_STR = ("STRFTIME", "JULIADAY", "NOW", "INSTR", "SUBSTR") 8 | 9 | _MATH_COMPUTE_KEYWORDS = (exp.Add, exp.Sub, exp.Mul, exp.Div, exp.Mod) 10 | 11 | def __init__(self, sql, dialect="sqlite"): 12 | self.ast = parse_one(sql, dialect=dialect) 13 | 14 | @property 15 | def count_table(self): 16 | return len(list(self.ast.find_all(exp.Table))) 17 | 18 | @property 19 | def count_select(self): 20 | return len(list(self.ast.find_all(exp.Select))) 21 | 22 | @property 23 | def count_aggregation(self): 24 | return len(list(self.ast.find_all(exp.AggFunc))) 25 | 26 | @property 27 | def count_scalar_function(self): 28 | scalar_nodes = list(self.ast.find_all(self._SCALAR_KEYWORDS)) 29 | scalar_nodes.extend([node for node in self.ast.find_all(exp.Anonymous) if node.this.upper() in self._SCALAR_KEYWORDS_ANONYMOUS_STR]) 30 | return len(scalar_nodes) 31 | 32 | @property 33 | def count_math_compute(self): 34 | return len(list(self.ast.find_all(self._MATH_COMPUTE_KEYWORDS))) 35 | -------------------------------------------------------------------------------- /src/dataset_analyze/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sqlite3 3 | from sql_parser import SQLParser 4 | 5 | ROUND_NUM = 2 6 | 7 | 8 | def generate_report_database_complexity(db_paths: list, is_wikisql=False): 9 | total_databases = len(db_paths) 10 | total_tables = 0 11 | total_columns = 0 12 | total_records = 0 13 | 14 | for db_path in db_paths: 15 | conn = sqlite3.connect(db_path) 16 | cursor = conn.cursor() 17 | 18 | cursor.execute("SELECT name FROM sqlite_master WHERE type='table';") 19 | tables = cursor.fetchall() 20 | num_tables = len(tables) 21 | total_tables += num_tables 22 | for table in tables: 23 | table_name = table[0] 24 | 25 | cursor.execute(f"PRAGMA table_info(`{table_name}`);") 26 | columns = cursor.fetchall() 27 | num_columns = len(columns) 28 | total_columns += num_columns 29 | 30 | cursor.execute(f"SELECT COUNT(*) FROM `{table_name}`;") 31 | num_records = cursor.fetchone()[0] 32 | total_records += num_records 33 | 34 | conn.close() 35 | 36 | if is_wikisql: 37 | total_databases = total_tables 38 | 39 | avg_tables_per_db = round(total_tables / total_databases, ROUND_NUM) 40 | avg_columns_per_table = round(total_columns / total_tables, ROUND_NUM) 41 | avg_records_per_db = round(total_records / total_databases, ROUND_NUM) 42 | 43 | report = { 44 | "Total Databases": total_databases, 45 | "Total Tables": total_tables, 46 | "Average Tables per Database": avg_tables_per_db, 47 | "Average Columns per Table": avg_columns_per_table, 48 | "Average Records per Database": avg_records_per_db 49 | } 50 | 51 | return report 52 | 53 | 54 | def generate_report_query_complexity(queries: list[str]): 55 | tables_per_query = [] 56 | selects_per_query = [] 57 | aggs_per_query = [] 58 | scalar_funcs_per_query = [] 59 | math_computes_per_query = [] 60 | 61 | for query in queries: 62 | try: 63 | sql_parser = SQLParser(query) 64 | except Exception as e: 65 | print(query) 66 | print(e) 67 | continue 68 | tables_per_query.append(sql_parser.count_table) 69 | selects_per_query.append(sql_parser.count_select) 70 | aggs_per_query.append(sql_parser.count_aggregation) 71 | scalar_funcs_per_query.append(sql_parser.count_scalar_function) 72 | math_computes_per_query.append(sql_parser.count_math_compute) 73 | 74 | avg_tables_per_query = round(sum(tables_per_query) / len(tables_per_query), ROUND_NUM) 75 | avg_selects_per_query = round(sum(selects_per_query) / len(selects_per_query), ROUND_NUM) 76 | avg_aggs_per_query = round(sum(aggs_per_query) / len(aggs_per_query), ROUND_NUM) 77 | avg_scalar_funcs_per_query = round(sum(scalar_funcs_per_query) / len(scalar_funcs_per_query), ROUND_NUM) 78 | avg_math_computes_per_query = round(sum(math_computes_per_query) / len(math_computes_per_query), 2) 79 | 80 | report = { 81 | "Average Tables per Query": avg_tables_per_query, 82 | "Average Selects per Query": avg_selects_per_query, 83 | "Average Aggs per Query": avg_aggs_per_query, 84 | "Average Scalar Functions per Query": avg_scalar_funcs_per_query, 85 | "Average Math Computations per Query": avg_math_computes_per_query 86 | } 87 | 88 | return report --------------------------------------------------------------------------------