├── imgs ├── demo.gif ├── logo.png ├── header.png ├── intro.png ├── logo2.png ├── framework.png ├── sft_algo.png ├── overall_results.png ├── execution_accuracy.png ├── test_score_high_res.png ├── critic_score_mean_pairs.jpg ├── critic_score_mean_pairs.pdf ├── response_length_mean_pairs.png ├── execution_accuracy_by_difficulty.png ├── model_performance_comparison_17B.png └── model_performance_comparison_4B.png └── README.md /imgs/demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/demo.gif -------------------------------------------------------------------------------- /imgs/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/logo.png -------------------------------------------------------------------------------- /imgs/header.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/header.png -------------------------------------------------------------------------------- /imgs/intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/intro.png -------------------------------------------------------------------------------- /imgs/logo2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/logo2.png -------------------------------------------------------------------------------- /imgs/framework.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/framework.png -------------------------------------------------------------------------------- /imgs/sft_algo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/sft_algo.png -------------------------------------------------------------------------------- /imgs/overall_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/overall_results.png -------------------------------------------------------------------------------- /imgs/execution_accuracy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/execution_accuracy.png -------------------------------------------------------------------------------- /imgs/test_score_high_res.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/test_score_high_res.png -------------------------------------------------------------------------------- /imgs/critic_score_mean_pairs.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/critic_score_mean_pairs.jpg -------------------------------------------------------------------------------- /imgs/critic_score_mean_pairs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/critic_score_mean_pairs.pdf -------------------------------------------------------------------------------- /imgs/response_length_mean_pairs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/response_length_mean_pairs.png -------------------------------------------------------------------------------- /imgs/execution_accuracy_by_difficulty.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/execution_accuracy_by_difficulty.png -------------------------------------------------------------------------------- /imgs/model_performance_comparison_17B.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/model_performance_comparison_17B.png -------------------------------------------------------------------------------- /imgs/model_performance_comparison_4B.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/model_performance_comparison_4B.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 | 4 |

5 | 6 |

7 | demo 8 |

9 | 10 | 11 |

12 | 13 |

14 | 15 | 16 |

17 |    📄 Arxiv   |   🤗 Hugging Face   18 |

19 | 20 |

21 | 22 | | Resource | Link | 23 | |----------|------| 24 | | 🤗 MTSQL-R1 (4B) | MTSQL-R1(4B) (Will release after internal review) | 25 | | 🤗 MTSQL-R1 (1.7B) | MTSQL-R1(1.7B) (Will release after internal review) | 26 | | 🤗 Dataset | CoSQL-Long-Horizon-SFT-RL-Data (Will release after internal review) | 27 | | 🤗 Dataset | SParC-Long-Horizon-SFT-RL-Data (Will release after internal review) | 28 | | Code For SFT | Will release after internal review | 29 | | Code For RL | Will release after internal review | 30 | 31 |
32 | 33 | [![Python](https://img.shields.io/badge/Python-3.10-green.svg)](https://www.python.org/) 34 | ![CUDA 12.4](https://img.shields.io/badge/CUDA-12.4-76B900?logo=nvidia&logoColor=white) 35 | 36 | 37 | # 🚀 MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training 38 | 39 | 40 | 41 | # 📋 Table of Contents 42 | 43 | - [🌟 Highlights](#highlights) 44 | - [📖 Introduction](#introduction) 45 | - [⚙️ Configuration](#configuration) 46 | - [🔄 Training Framework](#training-framework) 47 | - [Stage1: Self-Taught Warm-Start SFT](#stage1-self-taught-warm-start-sft) 48 | - [Stage2: End-to-End Long-Horizon Reinforcement Learning](#stage2-end-to-end-long-horizon-reinforcement-learning) 49 | - [📈 Training Dynamics](#training-dynamics) 50 | - [📊 Experiment Results](#experiment-results) 51 | - [Overall Experiment Results](#-overall-experiment-results) 52 | - [Performance over different difficulties and turns](#performance-over-different-difficulties-and-turns) 53 | - [The evolution of different Long-Horizon Abilities](#the-evolution-of-different-long-horizon-abilities-and-related-execution-match-performance-for-4b-and-17b-model) 54 | - [🙏 Acknowledgements](#acknowledgements) 55 | - [📫 Contact](#contact) 56 | 57 | 58 | 59 | 60 |

🌟 Highlights

61 | 62 |
63 | 64 | | Category | Feature | Description | 65 | |---------|---------|------------| 66 | | Text-to-SQL | 🎯 Excellent in Solving **Long-Turn and Extra Hard** SQL Questions | | 67 | | Text-to-SQL | 🔄 Long-Horizon Formulation with Environment Feedback | Leverages environment feedback through database execution and explicit memory verification to guide SQL generation and error correction | 68 | | LLM Training | 🎓 **Two-Stage** Training Framework | 1) **Tool-Integrated High-Quality SFT Dataset construction** by Self-Taught; Warm-Start SFT 2)**Curriculum RL Training** with **Multi-level rewards** (Outcome and Dense Process Reward) Design | 69 | | LLM Training | 🔁 **Multi-Turn** End-to-End RL Training | Enables end-to-end training across multiple turns with database and memory to enhance coherence | 70 | 71 |
72 | 73 |

📖 Introduction

74 | Short-horizon Text-to-SQL directly translates question to SQL, resulting execution erros and coherence-related erros. 75 | 76 | Our approach enables: 77 | - Environment-based verification: The model 78 | interacts dynamically with two components: (i) 79 | a database for execution feedback and (ii) a long- 80 | term dialogue memory for explicit coherence 81 | checking to verify intermediate SQL outputs. 82 | 83 | - Self-correction: Based on verification feedback, 84 | the model iteratively refines its generated SQL 85 | queries to achieve consistent, executable outputs 86 | across multiple turns. 87 | 88 | - Autonomous End-to-End Learn actions (Propose, EXECUTE, Verify and Self-Correct) to generate better SQL. 89 | 90 | 91 |

92 | 93 |

94 | 95 | 96 | 97 |

⚙️ Configuration

98 | Verl == 0.4.1 99 | 100 | LLamafactory == 0.9.3 101 | 102 |

🔄 Training Framework

103 | 104 |

105 | 106 |

107 | 108 | ## Stage1: Self-Taught Warm-Start SFT 109 | 110 | - Step1: Random Sampling with high temperature for generating natural reasoning trajectories 111 | - Step2: Difficulty-Aware Reject Sampling 112 | - Step3: SFT Model with Tool-Integrated Multi-Turn Trajectories and Loss Masking 113 | - Step4: Update Dataset, Model and repeat 114 | 115 |

116 | 117 |

118 | 119 | 120 | 121 | ## Stage2: End-to-End Long-Horizon Reinforcement Learning 122 | 123 | - Step1: Curriculum Data Partition by difficulty 124 | - Step2: Outcome and Process Reward Design 125 | - Step3: Multi-Turn RL with Loss Masking 126 | 127 | 128 | 129 | 130 |

📈 Training Dynamics

131 | 132 | The dynamics of Reward Score and Response Length During Training: 133 | 134 |

135 | 136 | 137 |

138 | 139 | The dynamics of test score across different training checkpoints: 140 | 141 |

142 | 143 |

144 | 145 | 146 | 147 |

📊 Experiment Results

148 | 149 | ## Overall Experiment Results 150 | 151 | Key Findings and Take Aways: 152 | 153 | - Warm-start SFT and RL both provide gains in performance. 154 | - Small LLMs (1.7B/4B) struggle to follow long-horizon function-calling instructions. 155 | - Conventional SFT attains good Exact Match but exhibits weaker logical consistency (Execution Match) while Long-Horizon archives better Execution Match. 156 | - Long-horizon reasoning yields larger gains on multi-turn dialogues and complex questions. 157 | - long-horizon RL substantially improves out-of-domain performance. 158 | - Process Dense Reward helps the model learn from harder examples, further boosting performance compared with sparse outcome-only rewards. 159 | - Stronger function calling, verification, and self-correction correlate with better SQL performance. 160 | - With long-horizon actions and training, the agent learns to resolve execution failures (even null-return cases - we call it **aha-moment** in Text-to-SQL) and coherence errors. 161 | 162 | 163 |

164 | 165 |

166 | 167 | 168 | 169 | ## Performance over different difficulties and turns 170 | 171 |

172 | 173 | 174 |

175 | 176 | 177 | ## The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model 178 | 179 |

180 | 181 | 182 |

183 | 184 | 185 | 186 |

🙏 Acknowledgements

187 | 188 | We would like to express our gratitude to the open-source community for their valuable contributions: 189 | - Verl: https://github.com/volcengine/verl 190 | - LLamafactory: https://github.com/hiyouga/LLaMA-Factory 191 | - SGLang: https://github.com/sgl-project/sglang 192 | - VLLM: https://github.com/vllm-project/vllm 193 | - DB-GPT-Hub: https://github.com/eosphoros-ai/DB-GPT-Hub 194 | - CoSQL: https://github.com/taoyds/cosql 195 | - SPaRC: https://github.com/taoyds/sparc 196 | - Search-R1: https://github.com/PeterGriffinJin/Search-R1 197 | 198 | ......etc 199 | 200 | 201 |

📫 Contact

202 | 203 | For any issues or discussion, please contact tguo2@nd.edu, thanks 204 | --------------------------------------------------------------------------------