├── imgs
    ├── demo.gif
    ├── logo.png
    ├── header.png
    ├── intro.png
    ├── logo2.png
    ├── framework.png
    ├── sft_algo.png
    ├── overall_results.png
    ├── execution_accuracy.png
    ├── test_score_high_res.png
    ├── critic_score_mean_pairs.jpg
    ├── critic_score_mean_pairs.pdf
    ├── response_length_mean_pairs.png
    ├── execution_accuracy_by_difficulty.png
    ├── model_performance_comparison_17B.png
    └── model_performance_comparison_4B.png
└── README.md


/imgs/demo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/demo.gif


--------------------------------------------------------------------------------
/imgs/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/logo.png


--------------------------------------------------------------------------------
/imgs/header.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/header.png


--------------------------------------------------------------------------------
/imgs/intro.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/intro.png


--------------------------------------------------------------------------------
/imgs/logo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/logo2.png


--------------------------------------------------------------------------------
/imgs/framework.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/framework.png


--------------------------------------------------------------------------------
/imgs/sft_algo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/sft_algo.png


--------------------------------------------------------------------------------
/imgs/overall_results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/overall_results.png


--------------------------------------------------------------------------------
/imgs/execution_accuracy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/execution_accuracy.png


--------------------------------------------------------------------------------
/imgs/test_score_high_res.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/test_score_high_res.png


--------------------------------------------------------------------------------
/imgs/critic_score_mean_pairs.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/critic_score_mean_pairs.jpg


--------------------------------------------------------------------------------
/imgs/critic_score_mean_pairs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/critic_score_mean_pairs.pdf


--------------------------------------------------------------------------------
/imgs/response_length_mean_pairs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/response_length_mean_pairs.png


--------------------------------------------------------------------------------
/imgs/execution_accuracy_by_difficulty.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/execution_accuracy_by_difficulty.png


--------------------------------------------------------------------------------
/imgs/model_performance_comparison_17B.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/model_performance_comparison_17B.png


--------------------------------------------------------------------------------
/imgs/model_performance_comparison_4B.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taichengguo/MTSQL-R1/HEAD/imgs/model_performance_comparison_4B.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 |     <img src="imgs/logo.png" width="230" style="vertical-align: middle; margin-right: 10px"/>
  3 |     <img src="imgs/logo2.png" width="550" style="vertical-align: middle"/>
  4 | </p>
  5 | 
  6 | <p align="center">
  7 |   <img src="imgs/demo.gif" alt="demo" width="800">
  8 | </p>
  9 | 
 10 | 
 11 | <p align="center">
 12 |     <img src="imgs/header.png" width="400"/>
 13 | <p>
 14 | 
 15 | 
 16 | <p align="center">
 17 |  &nbsp&nbsp 📄 <a href="https://arxiv.org/abs/2510.12831">Arxiv</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="">Hugging Face</a>&nbsp&nbsp 
 18 |  <p>
 19 | 
 20 | <div align="center">
 21 | 
 22 | | Resource | Link |
 23 | |----------|------|
 24 | | 🤗 MTSQL-R1 (4B) | MTSQL-R1(4B) (Will release after internal review) |
 25 | | 🤗 MTSQL-R1 (1.7B) | MTSQL-R1(1.7B) (Will release after internal review) |
 26 | | 🤗 Dataset | CoSQL-Long-Horizon-SFT-RL-Data (Will release after internal review) |
 27 | | 🤗 Dataset | SParC-Long-Horizon-SFT-RL-Data (Will release after internal review) |
 28 | | Code For SFT | Will release after internal review |
 29 | | Code For RL | Will release after internal review | 
 30 | 
 31 | </div>
 32 | 
 33 | [![Python](https://img.shields.io/badge/Python-3.10-green.svg)](https://www.python.org/)
 34 | ![CUDA 12.4](https://img.shields.io/badge/CUDA-12.4-76B900?logo=nvidia&logoColor=white)
 35 | 
 36 | 
 37 | # 🚀 MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training
 38 | 
 39 | 
 40 | 
 41 | # 📋 Table of Contents
 42 | 
 43 | - [🌟 Highlights](#highlights)
 44 | - [📖 Introduction](#introduction)
 45 | - [⚙️ Configuration](#configuration)
 46 | - [🔄 Training Framework](#training-framework)
 47 |   - [Stage1: Self-Taught Warm-Start SFT](#stage1-self-taught-warm-start-sft)
 48 |   - [Stage2: End-to-End Long-Horizon Reinforcement Learning](#stage2-end-to-end-long-horizon-reinforcement-learning)
 49 | - [📈 Training Dynamics](#training-dynamics)
 50 | - [📊 Experiment Results](#experiment-results)
 51 |   - [Overall Experiment Results](#-overall-experiment-results)
 52 |   - [Performance over different difficulties and turns](#performance-over-different-difficulties-and-turns)
 53 |   - [The evolution of different Long-Horizon Abilities](#the-evolution-of-different-long-horizon-abilities-and-related-execution-match-performance-for-4b-and-17b-model)
 54 | - [🙏 Acknowledgements](#acknowledgements)
 55 | - [📫 Contact](#contact)
 56 | 
 57 | 
 58 | 
 59 | 
 60 | <h1 id="highlights">🌟 Highlights</h1>
 61 | 
 62 | <div align="center">
 63 | 
 64 | | Category | Feature | Description |
 65 | |---------|---------|------------|
 66 | | Text-to-SQL | 🎯 Excellent in Solving **Long-Turn and Extra Hard** SQL Questions | <img src="imgs/execution_accuracy_by_difficulty.png" width="300"/>   <img src="imgs/execution_accuracy.png" width="300"/> |
 67 | | Text-to-SQL | 🔄 Long-Horizon Formulation with Environment Feedback  | Leverages environment feedback through database execution and explicit memory verification to guide SQL generation and error correction |
 68 | | LLM Training | 🎓 **Two-Stage** Training Framework  | 1) **Tool-Integrated High-Quality SFT Dataset construction** by Self-Taught; Warm-Start SFT 2)**Curriculum RL Training** with **Multi-level rewards** (Outcome and Dense Process Reward) Design |
 69 | | LLM Training | 🔁 **Multi-Turn** End-to-End RL Training | Enables end-to-end training across multiple turns with database and memory to enhance coherence |
 70 | 
 71 | </div>
 72 | 
 73 | <h1 id="introduction">📖 Introduction</h1>
 74 | Short-horizon Text-to-SQL directly translates question to SQL, resulting execution erros and coherence-related erros.
 75 | 
 76 | Our approach enables:
 77 | - Environment-based verification: The model
 78 | interacts dynamically with two components: (i)
 79 | a database for execution feedback and (ii) a long-
 80 | term dialogue memory for explicit coherence
 81 | checking to verify intermediate SQL outputs.
 82 | 
 83 | - Self-correction: Based on verification feedback,
 84 | the model iteratively refines its generated SQL
 85 | queries to achieve consistent, executable outputs
 86 | across multiple turns.
 87 | 
 88 | - Autonomous End-to-End Learn actions (Propose, EXECUTE, Verify and Self-Correct) to generate better SQL.
 89 | 
 90 | 
 91 | <p align="center">
 92 |     <img src="imgs/intro.png" width="600"/>
 93 | <p>
 94 | 
 95 | 
 96 | 
 97 | <h1 id="configuration">⚙️ Configuration</h1>
 98 | Verl == 0.4.1
 99 | 
100 | LLamafactory == 0.9.3
101 | 
102 | <h1 id="training-framework">🔄 Training Framework</h1>
103 | 
104 | <p align="center">
105 |     <img src="imgs/framework.png" width="800"/>
106 | <p>
107 | 
108 | ## Stage1: Self-Taught Warm-Start SFT 
109 | 
110 | - Step1: Random Sampling with high temperature for generating natural reasoning trajectories 
111 | - Step2: Difficulty-Aware Reject Sampling
112 | - Step3: SFT Model with Tool-Integrated Multi-Turn Trajectories and Loss Masking
113 | - Step4: Update Dataset, Model and repeat
114 | 
115 | <p align="left">
116 |     <img src="imgs/sft_algo.png" width="400"/>
117 | <p>
118 | 
119 | 
120 | 
121 | ## Stage2: End-to-End Long-Horizon Reinforcement Learning
122 | 
123 | - Step1: Curriculum Data Partition by difficulty
124 | - Step2: Outcome and Process Reward Design
125 | - Step3: Multi-Turn RL with Loss Masking
126 | 
127 | 
128 | 
129 | 
130 | <h1 id="training-dynamics">📈 Training Dynamics</h1>
131 | 
132 | The dynamics of Reward Score and Response Length During Training:
133 | 
134 | <p align="center">
135 |   <img src="imgs/critic_score_mean_pairs.jpg" width="48%"/>
136 |   <img src="imgs/response_length_mean_pairs.png" width="48%"/>
137 | </p>
138 | 
139 | The dynamics of test score across different training checkpoints:
140 | 
141 | <p align="center">
142 |     <img src="imgs/test_score_high_res.png" width="500"/>
143 | <p>
144 | 
145 | 
146 | 
147 | <h1 id="experiment-results">📊 Experiment Results</h1>
148 | 
149 | ## Overall Experiment Results
150 | 
151 | Key Findings and Take Aways:
152 | 
153 | - Warm-start SFT and RL both provide gains in performance.
154 | - Small LLMs (1.7B/4B) struggle to follow long-horizon function-calling instructions. 
155 | - Conventional SFT attains good Exact Match but exhibits weaker logical consistency (Execution Match) while Long-Horizon archives better Execution Match.
156 | - Long-horizon reasoning yields larger gains on multi-turn dialogues and complex questions.
157 | - long-horizon RL substantially improves out-of-domain performance. 
158 | - Process Dense Reward helps the model learn from harder examples, further boosting performance compared with sparse outcome-only rewards.
159 | - Stronger function calling, verification, and self-correction correlate with better SQL performance.
160 | - With long-horizon actions and training, the agent learns to resolve execution failures (even null-return cases - we call it **aha-moment** in Text-to-SQL) and coherence errors. 
161 | 
162 | 
163 | <p align="center">
164 |     <img src="imgs/overall_results.png" width="800"/>
165 | <p>
166 | 
167 | 
168 | 
169 | ## Performance over different difficulties and turns
170 | 
171 | <p align="center">
172 |   <img src="imgs/execution_accuracy_by_difficulty.png" width="48%"/>
173 |   <img src="imgs/execution_accuracy.png" width="48%"/>
174 | </p>
175 | 
176 | 
177 | ## The evolution of different Long-Horizon Abilities and related Execution Match performance for 4B and 1.7B model
178 | 
179 | <p align="center">
180 |   <img src="imgs/model_performance_comparison_17B.png" width="48%"/>
181 |   <img src="imgs/model_performance_comparison_4B.png"  width="48%"/>
182 | </p>
183 | 
184 | 
185 | 
186 | <h1 id="acknowledgements">🙏 Acknowledgements</h1>
187 | 
188 | We would like to express our gratitude to the open-source community for their valuable contributions:
189 | - Verl:  https://github.com/volcengine/verl 
190 | - LLamafactory: https://github.com/hiyouga/LLaMA-Factory 
191 | - SGLang: https://github.com/sgl-project/sglang 
192 | - VLLM: https://github.com/vllm-project/vllm 
193 | - DB-GPT-Hub: https://github.com/eosphoros-ai/DB-GPT-Hub
194 | - CoSQL: https://github.com/taoyds/cosql 
195 | - SPaRC: https://github.com/taoyds/sparc
196 | - Search-R1: https://github.com/PeterGriffinJin/Search-R1
197 | 
198 | ......etc 
199 | 
200 | 
201 | <h1 id="contact">📫 Contact</h1>
202 | 
203 | For any issues or discussion, please contact tguo2@nd.edu, thanks
204 | 


--------------------------------------------------------------------------------