├── .gitignore
├── README.md
├── data
├── columns.json
├── csv_files
│ ├── earning.csv
│ ├── employee.csv
│ ├── employeepayrollrun.csv
│ ├── group_final.csv
│ ├── paygroup.csv
│ └── payrollrun.csv
├── database
│ └── final_db.db
├── example_queries
│ ├── complete_set
│ │ ├── earning.csv
│ │ ├── employeepayrollrun.csv
│ │ ├── group_final.csv
│ │ ├── paygroup.csv
│ │ └── payrollrun.csv
│ ├── retr_set
│ │ ├── final_earning.csv
│ │ ├── final_employee.csv
│ │ ├── final_employeepayrollrun.csv
│ │ ├── final_group_final.csv
│ │ ├── final_paygroup.csv
│ │ └── final_payrollrun.csv
│ ├── test_set
│ │ ├── final_earning.csv
│ │ ├── final_employee.csv
│ │ ├── final_employeepayrollrun.csv
│ │ ├── final_group_final.csv
│ │ ├── final_paygroup.csv
│ │ └── final_payrollrun.csv
│ └── trash
│ │ ├── final_employee_old.csv
│ │ └── final_employee_old_test.csv
├── prompts
│ ├── synthetic_dataset_prompts.txt
│ └── synthetic_dataset_template.txt
├── samples.json
└── sql
│ ├── earning.sql
│ ├── employee.sql
│ ├── employeepayrollrun.sql
│ ├── group_final.sql
│ ├── paygroup.sql
│ └── payrollrun.sql
├── images
├── stage1.png
├── stage2.png
├── stage3.png
├── stage4.png
├── stage5a.png
└── stage5b.png
├── requirements.txt
└── src
├── check_retr_data.py
├── create_db.py
├── main.py
├── text_sim.py
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | api-keys/huggingface.json
2 | api-keys/openai.json
3 | src/chatgpt.py
4 | src/old_code.py
5 | src/langchain.py
6 | data/example_queries/retr_set/employee.json
7 | data/example_queries/retr_set/employee_final.json
8 | data/example_queries/test_set/employee.json
9 | data/example_queries/trash
10 | src/__pycache__
11 | images/old/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # text2sql-LLM
2 |
3 | Leveraging In-Context Learning using a Synthetic Dataset for Text-to-SQL Models
4 |
5 | ## Running
6 | ```
7 | pip install -r requirements.txt
8 | python3 src/main.py
9 | ```
10 |
11 | ## Run Through
12 | 1. Enter the Table Name of the table you want to query
13 |
14 | *
*
15 |
16 | You can find the table names in ```/data/csv_files```.
17 |
18 | 2. Select if you want to enter a custom Natural Language Question or perform evaluation on the test questions of the ```data/example_queries/test_set```.
19 |
20 | **
21 |
22 | 3. Enter the Natural Language Question
23 |
24 | **
25 |
26 | 4. Retrieving the top 5 most similar questions from the synthetic dataset
27 |
28 | **
29 |
30 | 4. Generating the SQL query using Zero-Shot Prompting
31 |
32 | **
33 |
34 | 5. Generating the SQL query using In-context Learning
35 |
36 | **
37 |
38 | ## Methodology
39 |
40 | TL;DR
41 | We used synthetic data to craft prompts to leverage In-Context Learning for Text-to-SQL models.
42 |
43 | The methodology consists of the following steps -
44 |
45 | - Synthetic Data Generation - Use ChatGPT to generate synthetic data of Natural Language Questions and Corresponding SQL queries for each table.
46 |
- Cosine Similarity Calculation - For the test selection, select the top 5 most similar questions to the test question from the synthetic dataset.
47 |
- In-context Learning (ICL) Prompting - Form a prompt using the top 5 most similar questions and feed it to the model to generate the SQL query.
48 |
49 |
50 |
51 |
52 | ### Synthetic Data Generation
53 | We used ChatGPT Web API to generate synthetic dataset of Natural Language Questions for each table using the following template -
54 |
55 | ```
56 | --- Prompt 1 ---
57 | Give SQL query for the following -
58 |
59 | Question:
60 |
61 |
62 | Table Schema: [*, id, remote_id .... ]
63 | Table Name :
64 |
65 | Some rows in the table looks like this:
66 |
67 |
68 |
69 |
70 | --- Prompt 2 ---
71 | Can you give me 10 different natural language questions and their corresponding SQL questions specific to this dataset in the format of a combined CSV file in the following format
72 |
73 | Index, Question, SQL Query
74 |
75 | The delimiter should be "|" instead of comma as a txt file
76 |
77 | --- Prompt 3 ---
78 | Give me more different natural language questions and the corresponding SQL queries in the same format starting from index 11 in a txt file
79 | ```
80 |
81 | The synthetic dataset generated for each table is present in ```/data/example_queries```. The complete set folder contains the all the synthetic data
82 |
83 |
84 | Note - X is the number of synthetic questions to be generated. I used 40 synthetic questions for each table.
85 | For checking the syntax of the SQL queries, I ran all the queries over the table once (see: ```src/check_retr_data.py```).
86 |
87 |
88 | ### Cosine Similarity Calculation
89 | We used the ```sentence-transformers``` library to calculate the cosine similarity between the test question and the synthetic questions generated for each table. The top k most similar questions are selected as prompts for the next step.
90 |
91 |
92 | ### In-context Learning (ICL) Prompting
93 | We crafted a prompt using the top 5 most similar questions as prompts to generate the SQL query. The prompt also contained information about the table schema and the table name. The prompt is then fed to the model to generate the SQL query for the test question.
94 |
95 |
96 | ### Models Tested
97 | I tested the framework with the following models:
98 |
99 | ```
100 | juierror/flan-t5-text2sql-with-schema
101 | dawei756/text-to-sql-t5-spider-fine-tuned
102 | gpt2
103 | ```
104 |
105 | ## Observations
106 |
107 | #### ICL Prompting helps the model output than Zero-Shot Prompting
108 | Empirically, I have observed that the model produces much better queries with Retrieval-based ICL Prompting than Zero-Shot Prompting.
109 |
110 | For instance, For the question How many distinct types of earnings are there? for the table - earning
111 |
112 | For Zero-Shot Prompting, the model outputs the following query which does not work -
113 | **
114 |
115 | For ICL Prompting, the model outputs the following query which works well and answers the question correctly -
116 | **
117 |
118 | #### Fine-tuned Models Outperform Pre-trained Models
119 |
120 | The models fine-tuned for text to sql task performed much better than pre-trained models. The fine-tuned models I used are -
121 | ```
122 | juierror/flan-t5-text2sql-with-schema
123 | dawei756/text-to-sql-t5-spider-fine-tuned
124 | ```
125 |
126 | #### Providing Tables Schemas is not enough
127 | Some of the initial experiments I conducted involved only providing the table schemas in the prompt. In such the case, the model does not have access to the values of the columns in the table.
128 |
129 | To remedy this, I also tried providing some sample rows from the table in the prompt. While this did improve the results, it was not enough to generate the correct SQL query.
130 |
131 | To further improve the results, I started providing some example queries in the prompt. This helped the model generate the correct SQL query.
132 |
133 | #### Best Results
134 | The open-source model ```juierror/flan-t5-text2sql-with-schema``` produced the best results among the models I used for API calls. However, I believe the model's performance could be further improved by incorporating OpenAI APIs, specifically by leveraging the Chain of Thought Prompts and testing them on the OpenAI Playground. This combination, which includes retrieval-based ICL prompting, schema information, samples, and OpenAI API, yielded the best SQL query generation results.
135 |
136 | ## Future Work
137 |
138 | - Benchmarking the results - Benchmarking and comparing zero-shot and ICL Prompting results on a dataset would offer stronger evidence for the effectiveness of Retrieval-based ICL Prompting, complementing the observed improvements in query formulation through empirical observations.
139 |
- Extending the framework for multiple table - The framework currently supports single table queries, but extending it to accommodate multiple table queries would enhance its functionality.
140 |
- Breaking down the query formation process into smaller steps - Inspired by the work of Few-shot-NL2SQL-with-prompting , we can break down the query formation process into smaller steps and use the intermediate results to guide the model to generate the final SQL query. For example, we can prompt the model to estmate the tables that would be used in the query instead of asking the user.
141 |
142 |
143 |
144 |
145 | ## 👪 Contributing
146 | Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to ndiwan2[at]illinois[dot]edu[dot].
147 |
--------------------------------------------------------------------------------
/data/columns.json:
--------------------------------------------------------------------------------
1 | {
2 | "earning" : ["id", "remote_id", "employee_payroll_run", "amount", "type", "remote_was_deleted"],
3 | "employee" : ["id","remote_id","employee_number","company","first_name","last_name","display_full_name","username","groups","work_email","personal_email","mobile_phone_number","employments","home_location","work_location","manager","team","pay_group","ssn","gender","ethnicity","marital_status","date_of_birth","start_date","remote_created_at","employment_status","termination_date","avatar"],
4 | "employeepayrollrun" : ["id", "remote_id", "employee", "payroll_run", "gross_pay", "net_pay", "start_date", "end_date", "check_date", "earnings", "deductions", "taxes", "remote_was_deleted"],
5 | "group_final" : ["id", "remote_id", "parent_group", "name", "type", "remote_was_deleted"],
6 | "paygroup" : ["id", "remote_id", "pay_group_name", "remote_was_deleted"],
7 | "payrollrun" : ["id", "remote_id", "run_state", "run_type", "start_date", "end_date", "check_date", "remote_was_deleted"]
8 | }
--------------------------------------------------------------------------------
/data/csv_files/group_final.csv:
--------------------------------------------------------------------------------
1 | id,remote_id,parent_group,name,type,remote_was_deleted
2 | 43cad245-e8b4-4c53-971e-1e1176ed092b,47091065,3a52ba46-eadd-4747-9526-1b3f215b07c9,Cost Center 9695,COST_CENTER,FALSE
3 | 0dcd4ea7-7eb4-4745-8ac4-c7ddd7fdc11f,19475308,2df0ae47-2570-450b-be98-b376d0573969,Environmental education officer Department,DEPARTMENT,FALSE
4 | bcde1581-c99c-4207-98b1-219d15ff99b4,18307488,e834a380-e1f8-4148-8b06-f4fbcc7b180c,Group 8594,GROUP,FALSE
5 | 27e4de9c-4602-4c58-b032-fdc2a5818ebd,29661672,4772cf10-9a09-4aa8-acf0-b979044c3aad,Cost Center 4492,COST_CENTER,FALSE
6 | 425bceca-8a35-4255-a904-c69e5a432952,74969187,bd44ff2e-56cb-4c84-9738-91b98404ab3d,Astronomer Department,DEPARTMENT,FALSE
7 | 600aa46e-fd55-4c14-8d03-43f2c2fca81b,78724147,856542fb-0faa-4a89-bcd1-7d771806d918,Research scientist (physical sciences) Department,DEPARTMENT,FALSE
8 | 04181780-ac6d-47e3-8126-cbf773966335,10827323,e8ff8f5c-3e9b-4d03-b198-a2810cf3d96f,Meteorologist Department,DEPARTMENT,FALSE
9 | 99434549-f240-4945-a218-f749af3232cc,25386746,10a9fb19-e382-4be9-a103-7a076a314c79,Community development worker Team,TEAM,FALSE
10 | e011e638-77ec-420d-a92f-f4bc42519f60,15831193,0dcd4ea7-7eb4-4745-8ac4-c7ddd7fdc11f,Cost Center 2950,COST_CENTER,FALSE
11 | d0785220-e24e-4a35-ac5e-f7543e6b720f,33248761,3c3bec4e-00b3-41b1-83bf-4f5074ce7665,Cost Center 2362,COST_CENTER,TRUE
12 | 27f2a393-c77d-4b91-8501-434256f52d3b,14996181,4c8922c4-d0c3-4dd7-8206-84dd5b4c3ef9,Art gallery manager Team,TEAM,FALSE
13 | 9ebf104e-963f-4441-af79-440b77ee3ffd,76224061,3a52ba46-eadd-4747-9526-1b3f215b07c9,Business Unit 9648,BUSINESS_UNIT,FALSE
14 | 65ac5e8a-7844-45b1-9c26-e13a785a4339,55193679,27e4de9c-4602-4c58-b032-fdc2a5818ebd,Geoscientist Team,TEAM,TRUE
15 | 78b871a7-77b1-4406-8488-c49d78a1f4a1,86189128,3a52ba46-eadd-4747-9526-1b3f215b07c9,Paramedic Team,TEAM,TRUE
16 | 0800ea55-9a59-46e7-9317-708c5810306a,49310398,1e9ccf03-ac55-4ede-982e-8c54cf454be7,Cost Center 8326,COST_CENTER,FALSE
17 | 0d6d3c34-05e3-4765-b2e4-d4916d3524fa,46087924,89c56a05-effe-4809-b15a-08ff9bb45f59,Agricultural consultant Team,TEAM,TRUE
18 | ed9bee4f-02e6-4003-8062-8b4bd698cd75,32513187,7946a616-917c-4c17-b98a-f75061037648,Cost Center 2799,COST_CENTER,FALSE
19 | d690f763-6741-4e02-87ca-60b8a8088407,58007777,825cf3ef-1947-4959-b2f8-99829e640987,Business Unit 6102,BUSINESS_UNIT,TRUE
20 | db1d4c22-3fe4-40c1-be1b-b8b6e2a7385f,43579094,1e9ccf03-ac55-4ede-982e-8c54cf454be7,Business Unit 570,BUSINESS_UNIT,TRUE
21 | 8f24d95b-9570-482c-8679-9191bf121dd9,44546129,1f264398-6315-4cdc-810e-9e94dea6794f,Cost Center 9768,COST_CENTER,FALSE
22 | d9ab05ce-b74f-43db-bc81-9b0ac9a9f3a3,30371562,dc7faa72-2045-4f8c-8186-ba0f38e6db0f,Cost Center 4602,COST_CENTER,FALSE
23 | 1e9ccf03-ac55-4ede-982e-8c54cf454be7,68325244,3c40a023-7cc5-4ad7-b350-dde04c59109f,Architect Department,DEPARTMENT,TRUE
24 | 3063d441-8375-4313-8e57-26bbafeda1b8,19492908,300b3cea-3ae0-43c3-b07d-4d0b9e91306a,"Radiographer, diagnostic Team",TEAM,TRUE
25 | 923f897a-d74e-4edc-aae3-e824c2677494,93566334,5256fd65-0021-429a-8046-732cd3b77521,Business Unit 1502,BUSINESS_UNIT,TRUE
26 | 157253ec-8d3c-491d-b401-3d1e032b07b0,23394453,c685f3cb-5682-4981-93c3-86265b2c2e57,Business Unit 6935,BUSINESS_UNIT,FALSE
27 | 9ee7c415-8bb3-4509-8673-b2e2fe72f423,14567439,5b876ea5-b369-4d97-b302-6d6a73f30cc8,Production manager Department,DEPARTMENT,TRUE
28 | 319a6653-6eaf-4fa4-9bc2-405241069bc8,70562204,c3cafba4-8ebf-41c0-b106-10ec64bfe876,Business Unit 2457,BUSINESS_UNIT,TRUE
29 | 7fd1b412-90a2-472a-90b6-fa01d10339c6,97699467,39dfcc6b-bae4-421b-91a5-d60e4966e70e,Business Unit 4015,BUSINESS_UNIT,TRUE
30 | 5999ad24-c76a-4571-9b50-a82aec515a0b,17519229,dc7faa72-2045-4f8c-8186-ba0f38e6db0f,Cost Center 8503,COST_CENTER,TRUE
31 | 8175d866-21d5-483c-928a-5ebc33779a36,14833059,dc7faa72-2045-4f8c-8186-ba0f38e6db0f,Group 6491,GROUP,TRUE
32 | 87698d19-8072-4884-afff-a9b455c071e7,59438341,0800ea55-9a59-46e7-9317-708c5810306a,Health service manager Team,TEAM,TRUE
33 | 0bcb6c5a-e2f7-45d5-ab87-49a2f5121515,14344856,b16df9bf-9248-4b0a-a5d2-72fdd1ceb440,Cost Center 5912,COST_CENTER,FALSE
34 | c247c69f-0a6d-47e6-912f-ce404ce39920,75521383,977695b6-d1bd-4ad9-b250-71ea7ebf010e,Cost Center 3830,COST_CENTER,FALSE
35 | 1f264398-6315-4cdc-810e-9e94dea6794f,65333758,12064004-a2ac-4207-bc78-4bafdd056bfc,Business Unit 9796,BUSINESS_UNIT,TRUE
36 | 137e2127-5003-422c-8ae0-7fe5fe10c355,53163195,3c3bec4e-00b3-41b1-83bf-4f5074ce7665,Architect Department,DEPARTMENT,FALSE
37 | f6d0055d-6697-42eb-aecf-e8d6ad68ed95,74014197,8f43edf1-ae6c-4cdb-ad6d-30c0cc662763,Wellsite geologist Department,DEPARTMENT,FALSE
38 | a35da513-294a-4a45-aaae-86006eef6510,48021302,d690f763-6741-4e02-87ca-60b8a8088407,"Radiographer, therapeutic Department",DEPARTMENT,TRUE
39 | 0fd7d20c-bac7-45cb-9008-17808f877597,84694614,d0785220-e24e-4a35-ac5e-f7543e6b720f,Tax adviser Team,TEAM,FALSE
40 | c2fc7bc9-4559-4d09-9e55-2c3e08c59ba4,21379896,4772cf10-9a09-4aa8-acf0-b979044c3aad,Dance movement psychotherapist Team,TEAM,FALSE
41 | 3c40a023-7cc5-4ad7-b350-dde04c59109f,33032111,f66a1688-32f7-4e3c-a8b2-b815a03dad9b,Emergency planning/management officer Department,DEPARTMENT,TRUE
42 | 3d195912-b964-4fdf-b73b-42d7634be2e2,47447527,99434549-f240-4945-a218-f749af3232cc,"Pharmacist, hospital Department",DEPARTMENT,FALSE
43 | 977695b6-d1bd-4ad9-b250-71ea7ebf010e,12559104,bd44ff2e-56cb-4c84-9738-91b98404ab3d,"Accountant, chartered certified Department",DEPARTMENT,FALSE
44 | d503c50a-0f24-4f6b-acd6-29e476805ce0,59214811,a35da513-294a-4a45-aaae-86006eef6510,Cost Center 4131,COST_CENTER,FALSE
45 | 7963b58c-0be9-4594-8bd2-dae1e02b4bb2,71709818,d0785220-e24e-4a35-ac5e-f7543e6b720f,Horticultural consultant Department,DEPARTMENT,TRUE
46 | 406dbcf9-72c8-4bf5-b8b9-07b03f70af8e,15346875,0fd7d20c-bac7-45cb-9008-17808f877597,"Scientist, product/process development Team",TEAM,FALSE
47 | 825cf3ef-1947-4959-b2f8-99829e640987,56096673,aeead089-00c2-4446-a7cf-1d3bb8584895,Paediatric nurse Department,DEPARTMENT,TRUE
48 | c3cafba4-8ebf-41c0-b106-10ec64bfe876,55887103,be475fed-b0d7-4942-b2c3-3405f82f0f3b,Business Unit 252,BUSINESS_UNIT,FALSE
49 | 6d982475-d6a9-40c4-8f59-a0c06dc94af9,54168189,f6d0055d-6697-42eb-aecf-e8d6ad68ed95,English as a second language teacher Team,TEAM,FALSE
50 | ea387ad0-9326-49df-b90a-e063cc99752c,89261236,08e4686f-e0de-42dd-9fac-dcd6558f80bb,Business Unit 4893,BUSINESS_UNIT,TRUE
51 | 4c8922c4-d0c3-4dd7-8206-84dd5b4c3ef9,24812941,1d12d4d5-2310-459d-8873-232386635252,Cost Center 8348,COST_CENTER,TRUE
52 | 43c3bc79-910d-4c47-9e47-cd5251089123,46773267,dc7faa72-2045-4f8c-8186-ba0f38e6db0f,Business Unit 6267,BUSINESS_UNIT,FALSE
53 | 6990d80d-1cac-42b9-bfd2-0610974abfa6,79678206,3c40a023-7cc5-4ad7-b350-dde04c59109f,"Solicitor, Scotland Team",TEAM,FALSE
54 | 72b2cb2d-297e-4890-9c5c-b575da7f03dd,94430287,5a447713-b8dc-4a1c-aab3-3ec3e1460c51,Advertising copywriter Team,TEAM,TRUE
55 | c536f3ab-0912-442c-b53e-496af749d25e,26538427,3c3bec4e-00b3-41b1-83bf-4f5074ce7665,Diplomatic Services operational officer Team,TEAM,TRUE
56 | b150da1e-7a3f-43a4-9204-e657b74af8f7,12035199,27e4de9c-4602-4c58-b032-fdc2a5818ebd,Cost Center 2084,COST_CENTER,FALSE
57 | cbec3de6-077f-463a-8e76-0b094cbc2fb4,98333885,425bceca-8a35-4255-a904-c69e5a432952,Airline pilot Team,TEAM,FALSE
58 | de6248e1-7c8d-43c6-9fc8-2292e46b0f05,99878246,682d41f9-500c-4de4-8eb7-6d9b97c8f35e,Writer Department,DEPARTMENT,FALSE
59 | dc7faa72-2045-4f8c-8186-ba0f38e6db0f,26844391,99434549-f240-4945-a218-f749af3232cc,Group 8260,GROUP,FALSE
60 | 1d12d4d5-2310-459d-8873-232386635252,27966909,5a447713-b8dc-4a1c-aab3-3ec3e1460c51,Group 3676,GROUP,FALSE
61 | 682d41f9-500c-4de4-8eb7-6d9b97c8f35e,49375157,127c4ea1-1a06-4896-aae5-1d1995dc4545,"Producer, television/film/video Team",TEAM,FALSE
62 | 3a52ba46-eadd-4747-9526-1b3f215b07c9,73099570,08e4686f-e0de-42dd-9fac-dcd6558f80bb,Group 5785,GROUP,TRUE
63 | 5256fd65-0021-429a-8046-732cd3b77521,25143577,1f264398-6315-4cdc-810e-9e94dea6794f,Quality manager Department,DEPARTMENT,TRUE
64 | c685f3cb-5682-4981-93c3-86265b2c2e57,65867775,27e4de9c-4602-4c58-b032-fdc2a5818ebd,Business Unit 8270,BUSINESS_UNIT,FALSE
65 | 8d809e31-6bf0-4840-b31d-e445027f2306,40198956,562a1979-da9f-400c-92b1-f72d2a4c3d11,Patent examiner Department,DEPARTMENT,FALSE
66 | 04ed7a5c-88df-42f1-92d0-bf140006dc0e,55481288,426c39ad-295e-4c3e-9042-10805bbcc630,Cost Center 5579,COST_CENTER,FALSE
67 | 4772cf10-9a09-4aa8-acf0-b979044c3aad,68547732,70a081b3-2331-4fd9-bab6-d601331f266d,Cost Center 320,COST_CENTER,TRUE
68 | 83601212-47d7-402e-91ce-953fba897f5b,26911053,682d41f9-500c-4de4-8eb7-6d9b97c8f35e,Cost Center 6786,COST_CENTER,TRUE
69 | e7735b41-1e0c-4539-a3f1-3907a51cb02b,23211719,319a6653-6eaf-4fa4-9bc2-405241069bc8,Cost Center 9923,COST_CENTER,FALSE
70 | e8ff8f5c-3e9b-4d03-b198-a2810cf3d96f,20552980,935e995e-d883-41f7-823a-7ceefbf6b2f3,Animator Department,DEPARTMENT,TRUE
71 | 37c57196-2dcd-4946-ae87-9cb2ea72558c,23203593,c247c69f-0a6d-47e6-912f-ce404ce39920,Business Unit 9624,BUSINESS_UNIT,FALSE
72 | 889c8643-a459-4389-a5be-ee8207542071,58624420,3c40a023-7cc5-4ad7-b350-dde04c59109f,Business Unit 6427,BUSINESS_UNIT,FALSE
73 | 31829cf9-823f-4f53-8005-eab6f5ec2fe0,13100868,c786220f-71c9-466d-8233-f3f818ed5781,Cost Center 418,COST_CENTER,TRUE
74 | cee0583b-30fe-4651-88b7-bdcd440b10a8,11264857,39dfcc6b-bae4-421b-91a5-d60e4966e70e,"Administrator, local government Department",DEPARTMENT,FALSE
75 | 70a081b3-2331-4fd9-bab6-d601331f266d,36705171,426c39ad-295e-4c3e-9042-10805bbcc630,"Engineer, mining Department",DEPARTMENT,FALSE
76 | f66a1688-32f7-4e3c-a8b2-b815a03dad9b,37433417,4c8922c4-d0c3-4dd7-8206-84dd5b4c3ef9,"Horticulturist, commercial Department",DEPARTMENT,TRUE
77 | a1dbafc8-2523-461e-83ba-0338582351e5,62025074,5b876ea5-b369-4d97-b302-6d6a73f30cc8,Group 8799,GROUP,TRUE
78 | bd44ff2e-56cb-4c84-9738-91b98404ab3d,37945977,6a4e5d60-ae91-4bd4-a30d-71e2ffe666c5,Dealer Team,TEAM,FALSE
79 | ccd5f4c5-00cd-47b6-995c-d14cdbd76db2,23762385,c536f3ab-0912-442c-b53e-496af749d25e,Physiological scientist Department,DEPARTMENT,TRUE
80 | 5b876ea5-b369-4d97-b302-6d6a73f30cc8,83771923,d1056987-b78f-49af-8ca5-28f8955f4464,Technical sales engineer Team,TEAM,FALSE
81 | 39dfcc6b-bae4-421b-91a5-d60e4966e70e,64586221,87698d19-8072-4884-afff-a9b455c071e7,Cost Center 3008,COST_CENTER,FALSE
82 | a8e50a03-8a64-4bed-af76-494201dcb4c1,59985811,e906b606-8395-4e7d-ae22-9ac7a3b9f3ef,Business Unit 5573,BUSINESS_UNIT,TRUE
83 | 08e4686f-e0de-42dd-9fac-dcd6558f80bb,65412397,74692f13-59e9-4ee7-b8f4-a53a7eeebac5,Magazine features editor Department,DEPARTMENT,FALSE
84 | 12064004-a2ac-4207-bc78-4bafdd056bfc,51815649,0c1b3501-5bfc-48c2-a309-90e33d58549a,Group 7074,GROUP,FALSE
85 | 6a4e5d60-ae91-4bd4-a30d-71e2ffe666c5,53043545,1a15d48e-f29d-4245-86d0-4f2de1fe99a8,Business Unit 4069,BUSINESS_UNIT,TRUE
86 | c365541a-9ad0-472d-b83d-8e013da8159b,97482688,1d12d4d5-2310-459d-8873-232386635252,Business Unit 5827,BUSINESS_UNIT,TRUE
87 | b16df9bf-9248-4b0a-a5d2-72fdd1ceb440,94355331,0c1b3501-5bfc-48c2-a309-90e33d58549a,Business Unit 6174,BUSINESS_UNIT,TRUE
88 | dc5580d7-0ec4-4830-885c-8a99b9b7a04b,52057019,c734f48d-572e-4d9d-900d-226936605b96,Cost Center 9710,COST_CENTER,TRUE
89 | 562a1979-da9f-400c-92b1-f72d2a4c3d11,22208450,f4fe6d7a-98fa-4e3a-b4a2-83d9caac3e4f,Cost Center 4702,COST_CENTER,TRUE
90 | 74692f13-59e9-4ee7-b8f4-a53a7eeebac5,75544406,5999ad24-c76a-4571-9b50-a82aec515a0b,Business Unit 6124,BUSINESS_UNIT,FALSE
91 | cc07b0af-0cdc-4b69-83cc-5482b04ee7d6,86298291,930d79b1-c130-48f2-b26f-30f9f8677f4a,Land Team,TEAM,FALSE
92 | 73451513-0e2d-4a5d-911f-a79e85635400,44017564,3a52ba46-eadd-4747-9526-1b3f215b07c9,"Engineer, drilling Department",DEPARTMENT,FALSE
93 | d5a21ad4-9afe-4d0f-9838-4b13bb2fe477,98784608,ccd5f4c5-00cd-47b6-995c-d14cdbd76db2,"Editor, magazine features Team",TEAM,TRUE
94 | 127c4ea1-1a06-4896-aae5-1d1995dc4545,60019693,515b8deb-95da-4930-b74a-3fb43b425aa7,Cost Center 5511,COST_CENTER,TRUE
95 | 30f58e59-ace9-491e-9bcb-16ddb783448f,52765100,be475fed-b0d7-4942-b2c3-3405f82f0f3b,Business Unit 5623,BUSINESS_UNIT,TRUE
96 | ccde1d75-60f5-4983-9cce-5f8afd6df5cf,49419992,abea7ee3-788a-4a1f-8826-185b85f4d579,Furniture designer Team,TEAM,FALSE
97 | 5a447713-b8dc-4a1c-aab3-3ec3e1460c51,18596650,65ac5e8a-7844-45b1-9c26-e13a785a4339,Group 3969,GROUP,FALSE
98 | 1a15d48e-f29d-4245-86d0-4f2de1fe99a8,97110412,137e2127-5003-422c-8ae0-7fe5fe10c355,Cost Center 3848,COST_CENTER,FALSE
99 | 24247e71-7f53-4ac1-be5d-6a7cb49d0c21,51674757,bd44ff2e-56cb-4c84-9738-91b98404ab3d,"Radiographer, diagnostic Team",TEAM,TRUE
100 | 1a2ed032-7acd-4c53-bc75-22dc71be5cbd,53568113,6990d80d-1cac-42b9-bfd2-0610974abfa6,Business Unit 4836,BUSINESS_UNIT,FALSE
101 | e906b606-8395-4e7d-ae22-9ac7a3b9f3ef,82656384,c2fc7bc9-4559-4d09-9e55-2c3e08c59ba4,Cost Center 3606,COST_CENTER,TRUE
102 | 75c44d49-1e2b-4324-ba05-526e1c8b8837,48648713,5a447713-b8dc-4a1c-aab3-3ec3e1460c51,Cost Center 406,COST_CENTER,TRUE
103 | b80f9630-36e6-4bb1-9d98-7b5599f6949a,39400067,e906b606-8395-4e7d-ae22-9ac7a3b9f3ef,Group 7497,GROUP,TRUE
104 | 1af65d67-0b9b-4e9e-aa8b-31def4126f24,72768689,f66a1688-32f7-4e3c-a8b2-b815a03dad9b,"Engineer, automotive Team",TEAM,TRUE
105 | b825b37d-437e-4ed0-8a66-3d74621c7af2,68563434,923f897a-d74e-4edc-aae3-e824c2677494,Business Unit 8839,BUSINESS_UNIT,TRUE
106 | 8f43edf1-ae6c-4cdb-ad6d-30c0cc662763,30898269,83601212-47d7-402e-91ce-953fba897f5b,Cost Center 6523,COST_CENTER,FALSE
107 | 10a9fb19-e382-4be9-a103-7a076a314c79,39643760,1a15d48e-f29d-4245-86d0-4f2de1fe99a8,"Administrator, Civil Service Department",DEPARTMENT,FALSE
108 | 426c39ad-295e-4c3e-9042-10805bbcc630,11725552,9ebf104e-963f-4441-af79-440b77ee3ffd,Group 5559,GROUP,FALSE
109 | f91075f8-6c78-4f89-83c3-6a4cd8e91f44,38925454,f002ee36-6481-4150-b240-4b00bf371d8b,Group 6722,GROUP,TRUE
110 | 6df33dbe-e26c-4673-b524-3678dca2b350,91620010,c247c69f-0a6d-47e6-912f-ce404ce39920,"Engineer, maintenance (IT) Team",TEAM,FALSE
111 | cdc70c48-27b9-4f13-aaef-08ba2f4867aa,50875228,27e4de9c-4602-4c58-b032-fdc2a5818ebd,Firefighter Department,DEPARTMENT,FALSE
112 | 7946a616-917c-4c17-b98a-f75061037648,71653905,83601212-47d7-402e-91ce-953fba897f5b,Telecommunications researcher Department,DEPARTMENT,FALSE
113 | 606f3cf3-85f8-4cff-9766-1b78228b1f8c,31614993,6d982475-d6a9-40c4-8f59-a0c06dc94af9,Group 9122,GROUP,FALSE
114 | d95502bb-26d5-46d7-acf8-5ec3a6b266f9,27912310,b150da1e-7a3f-43a4-9204-e657b74af8f7,Glass blower/designer Department,DEPARTMENT,FALSE
115 | 05454df5-8d4a-458b-bafb-1e4fe2a93779,99888610,34008cf4-6c47-44e7-bb9f-88dab099e2f6,Group 8555,GROUP,FALSE
116 | 06de6277-bde1-43c8-b3be-9e94ea92393c,76312038,319a6653-6eaf-4fa4-9bc2-405241069bc8,Technical author Department,DEPARTMENT,FALSE
117 | de016ec9-0530-4a60-994a-6c6c4bb00714,24386152,78b871a7-77b1-4406-8488-c49d78a1f4a1,Group 3115,GROUP,TRUE
118 | 935e995e-d883-41f7-823a-7ceefbf6b2f3,40715065,43c3bc79-910d-4c47-9e47-cd5251089123,Psychotherapist Department,DEPARTMENT,FALSE
119 | 3c3bec4e-00b3-41b1-83bf-4f5074ce7665,61730749,b697cd77-5381-448c-85e7-5fc748f6a71a,Cost Center 8932,COST_CENTER,TRUE
120 | 89c56a05-effe-4809-b15a-08ff9bb45f59,11033733,b91f019a-cda0-4700-8061-b37c265a9a98,"Scientist, audiological Team",TEAM,FALSE
121 | 2b3b2bea-5f85-468d-9e49-01ac6fbb68ff,68779615,426c39ad-295e-4c3e-9042-10805bbcc630,Business Unit 6061,BUSINESS_UNIT,FALSE
122 | 550f743c-4ece-4d8f-9299-bd34cdd45335,92767293,5a447713-b8dc-4a1c-aab3-3ec3e1460c51,Business Unit 8888,BUSINESS_UNIT,FALSE
123 | df198e9d-ce4c-4c13-8d3f-bb138bd661ff,86341642,a35da513-294a-4a45-aaae-86006eef6510,Cost Center 9558,COST_CENTER,FALSE
124 | 300b3cea-3ae0-43c3-b07d-4d0b9e91306a,39913294,0bcb6c5a-e2f7-45d5-ab87-49a2f5121515,Cost Center 9519,COST_CENTER,TRUE
125 | abea7ee3-788a-4a1f-8826-185b85f4d579,53963421,977695b6-d1bd-4ad9-b250-71ea7ebf010e,Business Unit 4104,BUSINESS_UNIT,FALSE
126 | d5cedcb8-eff3-4c6b-82a0-86dc93c57944,99830984,bcde1581-c99c-4207-98b1-219d15ff99b4,Cost Center 4037,COST_CENTER,TRUE
127 | b256af9f-8012-4bdc-aa3e-18c3e3c3763b,30832163,5a447713-b8dc-4a1c-aab3-3ec3e1460c51,Cost Center 2065,COST_CENTER,FALSE
128 | 939a3149-6e18-490c-b4bc-b86822d1ca68,40640578,f6d0055d-6697-42eb-aecf-e8d6ad68ed95,Art therapist Department,DEPARTMENT,TRUE
129 | 5c3d162f-54b5-4dbc-8fa5-e5f5171097b3,77592829,9ebf104e-963f-4441-af79-440b77ee3ffd,Cost Center 8968,COST_CENTER,TRUE
130 | 8b3caf64-5b4d-41d0-b585-479cf88d1ff6,81473320,04181780-ac6d-47e3-8126-cbf773966335,Business Unit 215,BUSINESS_UNIT,FALSE
131 | 2df0ae47-2570-450b-be98-b376d0573969,13618612,b256af9f-8012-4bdc-aa3e-18c3e3c3763b,Group 3416,GROUP,FALSE
132 | f002ee36-6481-4150-b240-4b00bf371d8b,48599683,04181780-ac6d-47e3-8126-cbf773966335,"Therapist, sports Department",DEPARTMENT,FALSE
133 | c113c784-80ec-40b1-ba68-713da5327f99,85617294,aeead089-00c2-4446-a7cf-1d3bb8584895,Animator Department,DEPARTMENT,FALSE
134 | eef4e627-e22b-4ade-865d-643a76f4d8f0,69616207,06de6277-bde1-43c8-b3be-9e94ea92393c,Psychiatric nurse Department,DEPARTMENT,TRUE
135 | c734f48d-572e-4d9d-900d-226936605b96,27003253,606f3cf3-85f8-4cff-9766-1b78228b1f8c,Cost Center 6327,COST_CENTER,TRUE
136 | c786220f-71c9-466d-8233-f3f818ed5781,53343350,3063d441-8375-4313-8e57-26bbafeda1b8,Theme park manager Department,DEPARTMENT,TRUE
137 | 515b8deb-95da-4930-b74a-3fb43b425aa7,45696510,27f2a393-c77d-4b91-8501-434256f52d3b,Group 6765,GROUP,FALSE
138 | f4fe6d7a-98fa-4e3a-b4a2-83d9caac3e4f,55443287,1a2ed032-7acd-4c53-bc75-22dc71be5cbd,Group 5769,GROUP,FALSE
139 | 0c1b3501-5bfc-48c2-a309-90e33d58549a,60061042,0bcb6c5a-e2f7-45d5-ab87-49a2f5121515,"Accountant, chartered certified Team",TEAM,FALSE
140 | a108d4ae-cdbd-4e44-ac57-0bafb2b76df9,83262234,c113c784-80ec-40b1-ba68-713da5327f99,Group 9592,GROUP,TRUE
141 | b90d3726-fff2-4c4a-872e-dd3091bf0c44,47183580,515b8deb-95da-4930-b74a-3fb43b425aa7,Veterinary surgeon Team,TEAM,TRUE
142 | 6429c582-ed9e-4530-86c2-e2b1d31a3b92,31335523,550f743c-4ece-4d8f-9299-bd34cdd45335,Television camera operator Team,TEAM,FALSE
143 | 856542fb-0faa-4a89-bcd1-7d771806d918,97844833,dc7faa72-2045-4f8c-8186-ba0f38e6db0f,Business Unit 1309,BUSINESS_UNIT,TRUE
144 | e834a380-e1f8-4148-8b06-f4fbcc7b180c,91168613,cdc70c48-27b9-4f13-aaef-08ba2f4867aa,Cost Center 1015,COST_CENTER,FALSE
145 | 930d79b1-c130-48f2-b26f-30f9f8677f4a,69807144,d5a21ad4-9afe-4d0f-9838-4b13bb2fe477,"Therapist, drama Department",DEPARTMENT,TRUE
146 | d1056987-b78f-49af-8ca5-28f8955f4464,85259132,562a1979-da9f-400c-92b1-f72d2a4c3d11,Group 1556,GROUP,FALSE
147 | b697cd77-5381-448c-85e7-5fc748f6a71a,70247034,3063d441-8375-4313-8e57-26bbafeda1b8,Business Unit 5949,BUSINESS_UNIT,TRUE
148 | aeead089-00c2-4446-a7cf-1d3bb8584895,84492052,1f264398-6315-4cdc-810e-9e94dea6794f,Group 2758,GROUP,FALSE
149 | be475fed-b0d7-4942-b2c3-3405f82f0f3b,29589018,b80f9630-36e6-4bb1-9d98-7b5599f6949a,Group 2498,GROUP,FALSE
150 | 34008cf4-6c47-44e7-bb9f-88dab099e2f6,34588239,3d195912-b964-4fdf-b73b-42d7634be2e2,Microbiologist Department,DEPARTMENT,TRUE
151 | b91f019a-cda0-4700-8061-b37c265a9a98,90959828,550f743c-4ece-4d8f-9299-bd34cdd45335,Cost Center 5136,COST_CENTER,TRUE
--------------------------------------------------------------------------------
/data/csv_files/paygroup.csv:
--------------------------------------------------------------------------------
1 | id,remote_id,pay_group_name,remote_was_deleted
2 | e3811390-eb58-4df1-b584-45cec0fb782a,94082417,Intern,FALSE
3 | 960fc07d-5498-4434-9ede-260f71953829,14527836,Intern,TRUE
4 | 04790b29-05b1-4940-a6c2-10a531eec967,38075219,Part Time,FALSE
5 | 741688e4-24af-40e9-a8c1-a17858b65182,16969302,Part Time,FALSE
6 | 888d5ee8-f2b8-4d26-af49-4d2db2dbd4cc,60159115,Full Time,TRUE
7 | 32fe1da2-17a9-45c1-a739-750b87ff3208,44695553,Part Time,TRUE
8 | 864829a5-e07d-4b8f-9071-114241cef12d,58005905,Full Time,TRUE
9 | 0d590557-b001-49b3-9bb7-9ced50d9f0f1,19846453,Temp,FALSE
10 | 565a25f6-c3a8-4170-898c-e2c849ed36e4,81443812,Temp,TRUE
11 | 87fdc45a-257f-40a3-96a9-d3ce4b678555,50179773,Contractor,FALSE
12 | b5fad43e-0e69-45ee-ac2f-4781b560d69b,25315543,Intern,FALSE
13 | d85019d4-19c2-4329-8dc2-fdb7e8ec68f9,14396625,Temp,FALSE
14 | 6c47f4a0-f48c-4736-ad24-d5559c374e01,72729901,Intern,TRUE
15 | a2124a15-ee9f-4f95-842d-cc628b981aa7,13270787,Full Time,TRUE
16 | 3cc64f00-0a08-4ba0-86de-9336d27a5231,94726128,Temp,FALSE
17 | 99d0217a-a881-47a8-aaa5-60252b9b1f2a,18376598,Full Time,TRUE
18 | 263b3fa3-fd1a-4958-adef-14074b7bbb89,84449283,Part Time,FALSE
--------------------------------------------------------------------------------
/data/csv_files/payrollrun.csv:
--------------------------------------------------------------------------------
1 | id,remote_id,run_state,run_type,start_date,end_date,check_date,remote_was_deleted
2 | 3888d17b-094f-4fbe-b6e2-72de31e25b88,86137645,CLOSED,SIGN_ON_BONUS,2022-09-20,2022-10-04,2022-10-04,TRUE
3 | 0562335a-25b5-4bfd-bc1e-f3fe8bd70cd0,35245594,CLOSED,TERMINATION,2022-10-04,2022-10-18,2022-10-18,TRUE
4 | d7d4f172-c018-4db3-8cb9-d08f12b25322,39524768,PAID,REGULAR,2022-10-18,2022-11-01,2022-11-01,FALSE
5 | 8790b00f-e440-482a-b7fc-a3fb95c0d760,22630897,FAILED,CORRECTION,2022-11-01,2022-11-15,2022-11-15,FALSE
6 | 7adc002c-5276-4ba7-b3ae-1a018d6abf0b,53756676,PAID,REGULAR,2022-11-15,2022-11-29,2022-11-29,FALSE
7 | 03f33031-d61c-4b55-babc-a46d94003bd0,59174739,DRAFT,CORRECTION,2022-11-29,2022-12-13,2022-12-13,TRUE
8 | 9798a296-e5d8-4480-94c2-d5681fdb0ac7,15247010,APPROVED,SIGN_ON_BONUS,2022-12-13,2022-12-27,2022-12-27,FALSE
9 | 528e3da5-47d7-4f10-a4df-8bbfcd158329,79542196,FAILED,TERMINATION,2022-12-27,2023-01-10,2023-01-10,TRUE
10 | d6d1dfbf-1fef-477a-8943-844feddb183c,97959295,APPROVED,CORRECTION,2023-01-10,2023-01-24,2023-01-24,FALSE
11 | b8a164a2-ff9c-499e-9181-c695ff50f5d1,54741712,PAID,SIGN_ON_BONUS,2023-01-24,2023-02-07,2023-02-07,TRUE
12 | 44cf3101-aea6-4dea-bd56-4ddb53600701,59441461,DRAFT,OFF_CYCLE,2023-02-07,2023-02-21,2023-02-21,TRUE
13 | 9491c51c-aa1d-476a-9e13-ccf0a2928ef0,39072415,DRAFT,TERMINATION,2023-02-21,2023-03-07,2023-03-07,TRUE
14 | 21872d6d-a0bc-4073-bbb0-9d3423e4f5c8,23250898,DRAFT,CORRECTION,2023-03-07,2023-03-21,2023-03-21,FALSE
15 | 8961771d-735d-4cae-83f1-226c75cfb04e,45166759,PAID,SIGN_ON_BONUS,2023-03-21,2023-04-04,2023-04-04,FALSE
16 | 914ec8c0-2bfb-4155-8b50-d0af1eb943ae,87155677,DRAFT,CORRECTION,2023-04-04,2023-04-18,2023-04-18,TRUE
17 | 0886a73a-4efe-4268-9341-048fd9ee56a0,18984129,DRAFT,SIGN_ON_BONUS,2023-04-18,2023-05-02,2023-05-02,TRUE
18 | 2783b97e-fc42-40ae-8d52-d510e883350b,17607453,APPROVED,OFF_CYCLE,2023-05-02,2023-05-16,2023-05-16,FALSE
--------------------------------------------------------------------------------
/data/database/final_db.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/data/database/final_db.db
--------------------------------------------------------------------------------
/data/example_queries/complete_set/earning.csv:
--------------------------------------------------------------------------------
1 | Index|Question|SQL Query
2 | 1|How many distinct types of earnings are there?|SELECT COUNT(DISTINCT type) AS distinct_earnings_count FROM earning;
3 | 2|What is the total amount of salary earned by each employee?|SELECT remote_id, SUM(amount) AS total_salary FROM earning WHERE type = 'SALARY' GROUP BY remote_id;
4 | 3|Which employees have received a bonus?|SELECT remote_id FROM earning WHERE type = 'BONUS';
5 | 4|What is the average amount of overtime pay?|SELECT AVG(amount) AS average_overtime_pay FROM earning WHERE type = 'OVERTIME';
6 | 5|How many earnings records have been deleted?|SELECT COUNT(*) AS deleted_earnings_count FROM earning WHERE remote_was_deleted = TRUE;
7 | 6|What are the distinct employee payroll runs in the dataset?|SELECT DISTINCT employee_payroll_run FROM earning;
8 | 7|Which employee has the highest earning amount?|SELECT remote_id, MAX(amount) AS highest_earning FROM earning;
9 | 8|What is the total amount of earnings for each type?|SELECT type, SUM(amount) AS total_earnings FROM earning GROUP BY type;
10 | 9|How many earnings records are there in total?|SELECT COUNT(*) AS total_earnings_count FROM earning;
11 | 10|Which employee has received the highest bonus?|SELECT remote_id FROM earning WHERE type = 'BONUS' ORDER BY amount DESC LIMIT 1;
12 | 11|What is the minimum amount of salary earned?|SELECT MIN(amount) AS minimum_salary FROM earning WHERE type = 'SALARY';
13 | 12|How many employees have received a bonus greater than 5000?|SELECT COUNT(*) AS employees_with_bonus_gt_5000 FROM earning WHERE type = 'BONUS' AND amount > 5000;
14 | 13|Which type of earning has the highest amount?|SELECT type, MAX(amount) AS highest_amount FROM earning GROUP BY type;
15 | 14|What is the average amount of salary earned?|SELECT AVG(amount) AS average_salary FROM earning WHERE type = 'SALARY';
16 | 15|What is the total amount of earnings for each employee?|SELECT remote_id, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id;
17 | 16|How many distinct employee IDs are there in the dataset?|SELECT COUNT(DISTINCT remote_id) AS distinct_employee_ids FROM earning;
18 | 17|What are the top 5 highest earning employees?|SELECT remote_id, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id ORDER BY total_earnings DESC LIMIT 5;
19 | 18|Which earnings records have not been deleted?|SELECT * FROM earning WHERE remote_was_deleted = FALSE;
20 | 19|How many different types of earnings does each employee have?|SELECT remote_id, COUNT(DISTINCT type) AS distinct_earning_types FROM earning GROUP BY remote_id;
21 | 20|What is the total amount of earnings for each employee's payroll run?|SELECT employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY employee_payroll_run;
22 | 21|What is the highest earning amount for each type of earning?|SELECT type, MAX(amount) AS highest_earning FROM earning GROUP BY type;
23 | 22|Which employees have received both salary and overtime pay?|SELECT remote_id FROM earning WHERE type IN ('SALARY', 'OVERTIME') GROUP BY remote_id HAVING COUNT(DISTINCT type) = 2;
24 | 23|What is the average amount of bonus earned by each employee?|SELECT remote_id, AVG(amount) AS average_bonus FROM earning WHERE type = 'BONUS' GROUP BY remote_id;
25 | 24|How many earnings records are associated with each remote ID?|SELECT remote_id, COUNT(*) AS earnings_count FROM earning GROUP BY remote_id;
26 | 25|Which employee has received the highest total earnings?|SELECT remote_id, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id ORDER BY total_earnings DESC LIMIT 1;
27 | 26|What is the total amount of earnings for each payroll run?|SELECT employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY employee_payroll_run;
28 | 27|Which employee has received earnings in all available types?|SELECT remote_id FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT type) = (SELECT COUNT(DISTINCT type) FROM earning);
29 | 28|What is the average amount of earnings for each employee's payroll run?|SELECT employee_payroll_run, AVG(amount) AS average_earnings FROM earning GROUP BY employee_payroll_run;
30 | 29|How many employees have received more than one type of earning?|SELECT COUNT(DISTINCT remote_id) AS employees_with_multiple_earnings FROM earning WHERE remote_id IN (SELECT remote_id FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT type) > 1);
31 | 30|What is the total amount of earnings for each type and payroll run?|SELECT type, employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY type, employee_payroll_run;
32 | 31|Which type of earning has the highest average amount?|SELECT type, AVG(amount) AS average_amount FROM earning GROUP BY type ORDER BY average_amount DESC LIMIT 1;
33 | 32|How many distinct payroll runs have earnings greater than 10000?|SELECT COUNT(DISTINCT employee_payroll_run) AS distinct_payroll_runs FROM earning WHERE amount > 10000;
34 | 33|What is the average amount of earnings for each type and employee payroll run?|SELECT type, employee_payroll_run, AVG(amount) AS average_earnings FROM earning GROUP BY type, employee_payroll_run;
35 | 34|Which employee has the highest average earnings?|SELECT remote_id, AVG(amount) AS average_earnings FROM earning GROUP BY remote_id ORDER BY average_earnings DESC LIMIT 1;
36 | 35|How many distinct employee IDs have received earnings in all available payroll runs?|SELECT COUNT(DISTINCT remote_id) AS distinct_employee_ids FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT employee_payroll_run) = (SELECT COUNT(DISTINCT employee_payroll_run) FROM earning);
37 | 36|What is the total amount of earnings for each distinct combination of employee ID and payroll run?|SELECT remote_id, employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id, employee_payroll_run;
38 | 37|Which employees have received earnings in all available payroll runs?|SELECT remote_id FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT employee_payroll_run) = (SELECT COUNT(DISTINCT employee_payroll_run) FROM earning);
39 | 38|What is the average amount of earnings for each distinct combination of employee ID and type?|SELECT remote_id, type, AVG(amount) AS average_earnings FROM earning GROUP BY remote_id, type;
40 | 39|How many distinct employee IDs have received earnings greater than 5000?|SELECT COUNT(DISTINCT remote_id) AS distinct_employee_ids FROM earning WHERE amount > 5000;
41 | 40|What is the highest earning amount for each employee's payroll run?|SELECT employee_payroll_run, MAX(amount) AS highest_earning FROM earning GROUP BY employee_payroll_run;
--------------------------------------------------------------------------------
/data/example_queries/complete_set/employeepayrollrun.csv:
--------------------------------------------------------------------------------
1 | Index|Question|SQL Query
2 | 1|What are the unique employee IDs in the dataset?|SELECT DISTINCT employee FROM employeepayrollrun
3 | 2|How many payroll runs were conducted?|SELECT COUNT(DISTINCT payroll_run) FROM employeepayrollrun
4 | 3|What is the average gross pay for each payroll run?|SELECT payroll_run, AVG(gross_pay) AS average_gross_pay FROM employeepayrollrun GROUP BY payroll_run
5 | 4|Which employees have the highest net pay?|SELECT employee, MAX(net_pay) AS highest_net_pay FROM employeepayrollrun GROUP BY employee
6 | 5|What is the total earnings for each employee?|SELECT employee, SUM(earnings) AS total_earnings FROM employeepayrollrun GROUP BY employee
7 | 6|What are the start and end dates for each payroll run?|SELECT payroll_run, start_date, end_date FROM employeepayrollrun
8 | 7|How many employees had deductions in their payroll?|SELECT COUNT(DISTINCT employee) FROM employeepayrollrun WHERE deductions IS NOT NULL
9 | 8|What is the average tax amount for each payroll run?|SELECT payroll_run, AVG(taxes) AS average_tax_amount FROM employeepayrollrun GROUP BY payroll_run
10 | 9|Which employees had remote IDs starting with "e095"?|SELECT employee FROM employeepayrollrun WHERE remote_id LIKE 'e095%'
11 | 10|What is the total number of deleted payroll runs?|SELECT COUNT(*) FROM employeepayrollrun WHERE remote_was_deleted = TRUE
12 | 11|What is the average net pay for each employee?|SELECT employee, AVG(net_pay) AS average_net_pay FROM employeepayrollrun GROUP BY employee
13 | 12|Which payroll runs had the highest gross pay?|SELECT payroll_run, MAX(gross_pay) AS highest_gross_pay FROM employeepayrollrun GROUP BY payroll_run
14 | 13|How many employees had multiple payroll runs?|SELECT employee, COUNT(DISTINCT payroll_run) AS payroll_run_count FROM employeepayrollrun GROUP BY employee HAVING COUNT(DISTINCT payroll_run) > 1
15 | 14|What are the check dates for each payroll run?|SELECT payroll_run, check_date FROM employeepayrollrun
16 | 15|Which employees had the highest deductions?|SELECT employee, MAX(deductions) AS highest_deductions FROM employeepayrollrun GROUP BY employee
17 | 16|What is the total tax amount for each payroll run?|SELECT payroll_run, SUM(taxes) AS total_tax_amount FROM employeepayrollrun GROUP BY payroll_run
18 | 17|How many employees had a specific earnings type?|SELECT COUNT(DISTINCT employee) FROM employeepayrollrun WHERE earnings = 'specific_earnings_type'
19 | 18|Which employees had a specific start date for their payroll run?|SELECT employee FROM employeepayrollrun WHERE start_date = 'specific_start_date'
20 | 19|What is the average net pay for each payroll run?|SELECT payroll_run, AVG(net_pay) AS average_net_pay FROM employeepayrollrun GROUP BY payroll_run
21 | 20|How many employees had their payroll runs deleted?|SELECT COUNT(DISTINCT employee) FROM employeepayrollrun WHERE remote_was_deleted = TRUE
--------------------------------------------------------------------------------
/data/example_queries/complete_set/group_final.csv:
--------------------------------------------------------------------------------
1 | Index|Question|SQL Query
2 | 1|How many records are there in the group_final table?|SELECT COUNT(*) AS record_count FROM group_final;
3 | 2|What are the names of all parent groups?|SELECT name FROM group_final;
4 | 3|How many parent groups have been deleted remotely?|SELECT COUNT(*) AS deleted_count FROM group_final WHERE remote_was_deleted = TRUE;
5 | 4|What is the total count of each parent group type?|SELECT type, COUNT(*) AS type_count FROM group_final GROUP BY type;
6 | 5|Which parent groups have a remote ID greater than 50000000?|SELECT name FROM group_final WHERE remote_id > 50000000;
7 | 6|What is the name of the parent group with the ID '1e9ccf03-ac55-4ede-982e-8c54cf454be7'?|SELECT name FROM group_final WHERE id = '1e9ccf03-ac55-4ede-982e-8c54cf454be7';
8 | 7|How many parent groups belong to the remote ID '68325244'?|SELECT COUNT(*) AS group_count FROM group_final WHERE remote_id = 68325244;
9 | 8|Which parent groups are of type 'BUSINESS_UNIT' and not remotely deleted?|SELECT name FROM group_final WHERE type = 'BUSINESS_UNIT' AND remote_was_deleted = FALSE;
10 | 9|What is the count of each parent group name starting with 'Cost Center'?|SELECT name, COUNT(*) AS name_count FROM group_final WHERE name LIKE 'Cost Center%' GROUP BY name;
11 | 10|What is the count of each type of parent group that was remotely deleted?|SELECT type, COUNT(*) AS deleted_type_count FROM group_final WHERE remote_was_deleted = TRUE GROUP BY type;
12 | 11|What is the name of the parent group with the ID '30371562'?|SELECT name FROM group_final WHERE id = '30371562';
13 | 12|How many parent groups have the word 'Department' in their names?|SELECT COUNT(*) AS department_count FROM group_final WHERE name LIKE '%Department%';
14 | 13|Which parent groups belong to the remote ID '43579094' and have not been remotely deleted?|SELECT name FROM group_final WHERE remote_id = 43579094 AND remote_was_deleted = FALSE;
15 | 14|What are the names of the parent groups that are of type 'DEPARTMENT'?|SELECT name FROM group_final WHERE type = 'DEPARTMENT';
16 | 15|How many parent groups have a name starting with 'Business Unit'?|SELECT COUNT(*) AS business_unit_count FROM group_final WHERE name LIKE 'Business Unit%';
17 | 16|What is the ID of the parent group named 'Cost Center 4602'?|SELECT id FROM group_final WHERE name = 'Cost Center 4602';
18 | 17|How many parent groups of each type have been remotely deleted?|SELECT type, COUNT(*) AS deleted_type_count FROM group_final WHERE remote_was_deleted = TRUE GROUP BY type;
19 | 18|Which parent groups have a remote ID less than 40000000 and are of type 'COST_CENTER'?|SELECT name FROM group_final WHERE remote_id < 40000000 AND type = 'COST_CENTER';
20 | 19|What is the count of each parent group type that has not been remotely deleted?|SELECT type, COUNT(*) AS type_count FROM group_final WHERE remote_was_deleted = FALSE GROUP BY type;
21 | 20|What is the total count of parent groups for each unique remote ID?|SELECT remote_id, COUNT(*) AS group_count FROM group_final GROUP BY remote_id;
22 | 21|What is the type of the parent group named 'Business Unit 570'?|SELECT type FROM group_final WHERE name = 'Business Unit 570';
23 | 22|How many parent groups have a name containing the word 'Center'?|SELECT COUNT(*) AS center_count FROM group_final WHERE name LIKE '%Center%';
24 | 23|What are the names of parent groups that have been remotely deleted?|SELECT name FROM group_final WHERE remote_was_deleted = TRUE;
25 | 24|How many parent groups have a remote ID less than 20000000?|SELECT COUNT(*) AS id_less_than_count FROM group_final WHERE remote_id < 20000000;
26 | 25|What is the ID of the parent group with the name 'Architect Department'?|SELECT id FROM group_final WHERE name = 'Architect Department';
27 | 26|How many parent groups belong to the remote ID '1f264398-6315-4cdc-810e-9e94dea6794f'?|SELECT COUNT(*) AS group_count FROM group_final WHERE remote_id = '1f264398-6315-4cdc-810e-9e94dea6794f';
28 | 27|What are the names of parent groups that are of type 'COST_CENTER' and have not been remotely deleted?|SELECT name FROM group_final WHERE type = 'COST_CENTER' AND remote_was_deleted = FALSE;
29 | 28|What is the count of each parent group name ending with '9768'?|SELECT name, COUNT(*) AS name_count FROM group_final WHERE name LIKE '%9768' GROUP BY name;
30 | 29|What is the count of each type of parent group that has not been remotely deleted?|SELECT type, COUNT(*) AS type_count FROM group_final WHERE remote_was_deleted = FALSE GROUP BY type;
31 | 30|What is the total count of parent groups for each unique remote ID?|SELECT remote_id, COUNT(*) AS group_count FROM group_final GROUP BY remote_id;
32 | 31|What are the names of parent groups that are of type 'BUSINESS_UNIT' and have a remote ID greater than 60000000?|SELECT name FROM group_final WHERE type = 'BUSINESS_UNIT' AND remote_id > 60000000;
33 | 32|How many parent groups have a name containing the word 'Architect'?|SELECT COUNT(*) AS architect_count FROM group_final WHERE name LIKE '%Architect%';
34 | 33|What is the type of the parent group with the ID '3c40a023-7cc5-4ad7-b350-dde04c59109f'?|SELECT type FROM group_final WHERE id = '3c40a023-7cc5-4ad7-b350-dde04c59109f';
35 | 34|How many parent groups have been remotely deleted and belong to the remote ID '44546129'?|SELECT COUNT(*) AS deleted_group_count FROM group_final WHERE remote_was_deleted = TRUE AND remote_id = 44546129;
36 | 35|What are the names of parent groups that start with 'Business' and end with 'Unit'?|SELECT name FROM group_final WHERE name LIKE 'Business%Unit';
37 | 36|How many parent groups belong to the remote ID '68325244' and have been remotely deleted?|SELECT COUNT(*) AS deleted_group_count FROM group_final WHERE remote_id = 68325244 AND remote_was_deleted = TRUE;
38 | 37|What is the name of the parent group with the ID '8f24d95b-9570-482c-8679-9191bf121dd9'?|SELECT name FROM group_final WHERE id = '8f24d95b-9570-482c-8679-9191bf121dd9';
39 | 38|How many parent groups are there of each type?|SELECT type, COUNT(*) AS type_count FROM group_final GROUP BY type;
40 | 39|Which parent groups have a remote ID between 30000000 and 40000000, and are of type 'DEPARTMENT'?|SELECT name FROM group_final WHERE remote_id BETWEEN 30000000 AND 40000000 AND type = 'DEPARTMENT';
41 | 40|What is the count of each parent group type that has been remotely deleted?|SELECT type, COUNT(*) AS deleted_type_count FROM group_final WHERE remote_was_deleted = TRUE GROUP BY type;
42 |
--------------------------------------------------------------------------------
/data/example_queries/complete_set/paygroup.csv:
--------------------------------------------------------------------------------
1 | Index|Question|SQL Query
2 | 1|How many unique pay groups are there?|SELECT COUNT(DISTINCT pay_group_name) AS unique_pay_groups FROM paygroup;
3 | 2|What is the total number of rows in the paygroup table?|SELECT COUNT(*) AS total_rows FROM paygroup;
4 | 3|How many pay groups have TRUE in the 'remote_was_deleted' column?|SELECT COUNT(*) AS pay_groups_with_remote_was_deleted FROM paygroup WHERE remote_was_deleted = TRUE;
5 | 4|What are the remote IDs of the pay groups with 'Temp' as the pay group name?|SELECT remote_id FROM paygroup WHERE pay_group_name = 'Temp';
6 | 5|Which pay groups are marked as 'Temp' and have not been deleted remotely?|SELECT * FROM paygroup WHERE pay_group_name = 'Temp' AND remote_was_deleted = FALSE;
7 | 6|What is the average number of remote IDs per pay group?|SELECT AVG(remote_id) AS average_remote_ids_per_paygroup FROM paygroup;
8 | 7|Which pay groups have an ID starting with '87f'?|SELECT * FROM paygroup WHERE id LIKE '87f%';
9 | 8|What is the maximum remote ID in the paygroup table?|SELECT MAX(remote_id) AS max_remote_id FROM paygroup;
10 | 9|What is the minimum remote ID among the pay groups marked as 'Full Time'?|SELECT MIN(remote_id) AS min_remote_id FROM paygroup WHERE pay_group_name = 'Full Time';
11 | 10|How many pay groups have both 'Contractor' as the pay group name and TRUE in the 'remote_was_deleted' column?|SELECT COUNT(*) AS pay_groups_with_contractor_and_remote_was_deleted FROM paygroup WHERE pay_group_name = 'Contractor' AND remote_was_deleted = TRUE;
12 | 11|What are the pay groups with IDs ending in '555'?|SELECT * FROM paygroup WHERE id LIKE '%555';
13 | 12|How many pay groups are marked as 'Full Time' and have been deleted remotely?|SELECT COUNT(*) AS pay_groups_with_full_time_and_remote_deleted FROM paygroup WHERE pay_group_name = 'Full Time' AND remote_was_deleted = TRUE;
14 | 13|What is the average length of the pay group IDs in the table?|SELECT AVG(LENGTH(id)) AS average_id_length FROM paygroup;
15 | 14|Which pay groups have a remote ID greater than 70000000?|SELECT * FROM paygroup WHERE remote_id > 70000000;
16 | 15|What is the sum of the remote IDs for all pay groups?|SELECT SUM(remote_id) AS total_remote_id_sum FROM paygroup;
17 | 16|Which pay groups have 'Temp' as the pay group name and have a remote ID less than 20000000?|SELECT * FROM paygroup WHERE pay_group_name = 'Temp' AND remote_id < 20000000;
18 | 17|How many pay groups have a pay group name that contains the word 'Contract'?|SELECT COUNT(*) AS pay_groups_with_contract_in_name FROM paygroup WHERE pay_group_name LIKE '%Contract%';
19 | 18|What are the pay groups with IDs between '0d590557' and '565a25f6'?|SELECT * FROM paygroup WHERE id BETWEEN '0d590557' AND '565a25f6';
20 | 19|Which pay groups are not marked as 'Full Time'?|SELECT * FROM paygroup WHERE pay_group_name <> 'Full Time';
21 | 20|What is the average length of the pay group names in the table?|SELECT AVG(LENGTH(pay_group_name)) AS average_pay_group_name_length FROM paygroup;
22 | 21|How many pay groups have a remote ID that is an even number?|SELECT COUNT(*) AS even_remote_id_pay_groups FROM paygroup WHERE remote_id % 2 = 0;
23 | 22|What is the maximum length of the pay group names in the table?|SELECT MAX(LENGTH(pay_group_name)) AS max_pay_group_name_length FROM paygroup;
24 | 23|Which pay groups have a remote ID less than 50000000 or a pay group name starting with 'C'?|SELECT * FROM paygroup WHERE remote_id < 50000000 OR pay_group_name LIKE 'C%';
25 | 24|How many pay groups are marked as 'Contractor' and have a remote ID greater than 80000000?|SELECT COUNT(*) AS contractor_with_high_remote_id FROM paygroup WHERE pay_group_name = 'Contractor' AND remote_id > 80000000;
26 | 25|What are the pay groups with IDs that contain the substring '553'?|SELECT * FROM paygroup WHERE id LIKE '%553%';
27 | 26|How many pay groups have 'Temp' in the pay group name and FALSE in the 'remote_was_deleted' column?|SELECT COUNT(*) AS temp_pay_groups_not_deleted FROM paygroup WHERE pay_group_name = 'Temp' AND remote_was_deleted = FALSE;
28 | 27|What is the average remote ID among the pay groups marked as 'Full Time' and have been deleted remotely?|SELECT AVG(remote_id) AS avg_remote_id_full_time_deleted FROM paygroup WHERE pay_group_name = 'Full Time' AND remote_was_deleted = TRUE;
29 | 28|Which pay groups have a remote ID between 60000000 and 70000000, and the pay group name contains the word 'Time'?|SELECT * FROM paygroup WHERE remote_id BETWEEN 60000000 AND 70000000 AND pay_group_name LIKE '%Time%';
30 | 29|What is the minimum length of the pay group names in the table?|SELECT MIN(LENGTH(pay_group_name)) AS min_pay_group_name_length FROM paygroup;
31 | 30|Which pay groups are marked as 'Temp' and have a remote ID not equal to 81443812?|SELECT * FROM paygroup WHERE pay_group_name = 'Temp' AND remote_id <> 81443812;
32 | 31|What are the pay groups with IDs that have the substring 'f12d' and have been deleted remotely?|SELECT * FROM paygroup WHERE id LIKE '%f12d%' AND remote_was_deleted = TRUE;
33 | 32|How many pay groups have a remote ID greater than 90000000 and the pay group name containing the word 'Contractor'?|SELECT COUNT(*) AS high_remote_id_contractor_pay_groups FROM paygroup WHERE remote_id > 90000000 AND pay_group_name LIKE '%Contractor%';
34 | 33|What is the average length of the pay group names for the pay groups marked as 'Temp'?|SELECT AVG(LENGTH(pay_group_name)) AS average_pay_group_name_length_temp FROM paygroup WHERE pay_group_name = 'Temp';
35 | 34|Which pay groups have a remote ID that is a multiple of 1000000?|SELECT * FROM paygroup WHERE remote_id % 1000000 = 0;
36 | 35|How many pay groups are marked as 'Full Time' and have a remote ID less than 20000000 or greater than 90000000?|SELECT COUNT(*) AS full_time_with_extreme_remote_id FROM paygroup WHERE pay_group_name = 'Full Time' AND (remote_id < 20000000 OR remote_id > 90000000);
37 | 36|What is the maximum length of the pay group IDs in the table?|SELECT MAX(LENGTH(id)) AS max_id_length FROM paygroup;
38 | 37|Which pay groups have a remote ID between 40000000 and 60000000 and the pay group name ending with 'r'?|SELECT * FROM paygroup WHERE remote_id BETWEEN 40000000 AND 60000000 AND pay_group_name LIKE '%r';
39 | 38|How many pay groups have a pay group name that starts with 'T' and ends with 'p'?|SELECT COUNT(*) AS pay_groups_starting_with_T_and_ending_with_p FROM paygroup WHERE pay_group_name LIKE 'T%p';
40 | 39|What are the pay groups with IDs that contain the substring '9bb7' and have a pay group name not equal to 'Contractor'?|SELECT * FROM paygroup WHERE id LIKE '%9bb7%' AND pay_group_name <> 'Contractor';
41 | 40|How many pay groups are marked as 'Temp' and have a remote ID that is an odd number?|SELECT COUNT(*) AS temp_pay_groups_with_odd_remote_id FROM paygroup WHERE pay_group_name = 'Temp' AND remote_id % 2 <> 0;
--------------------------------------------------------------------------------
/data/example_queries/complete_set/payrollrun.csv:
--------------------------------------------------------------------------------
1 | Index|Question|SQL Query
2 | 1|How many employees started working in the year 2022?|SELECT COUNT(*) AS num_employees FROM payrollrun WHERE YEAR(start_date) = 2022
3 | 2|What is the total count of remote employees?|SELECT COUNT(*) AS total_remote_employees FROM payrollrun WHERE remote_id IS NOT NULL
4 | 3|How many payroll runs were approved?|SELECT COUNT(*) AS num_approved_runs FROM payrollrun WHERE run_state = 'APPROVED'
5 | 4|What is the average duration of a payroll run in days?|SELECT AVG(DATEDIFF(end_date, start_date)) AS avg_duration FROM payrollrun
6 | 5|How many terminated employees had failed payroll runs?|SELECT COUNT(*) AS num_terminated_failed FROM payrollrun WHERE run_type = 'TERMINATION' AND run_state = 'FAILED'
7 | 6|What is the earliest start date among the payroll runs?|SELECT MIN(start_date) AS earliest_start_date FROM payrollrun
8 | 7|How many payroll runs were corrections?|SELECT COUNT(*) AS num_correction_runs FROM payrollrun WHERE run_type = 'CORRECTION'
9 | 8|What is the percentage of deleted remote employees?|SELECT (COUNT(*) / (SELECT COUNT(*) FROM payrollrun WHERE remote_id IS NOT NULL)) * 100 AS deleted_remote_percentage FROM payrollrun WHERE remote_was_deleted = TRUE
10 | 9|How many payroll runs occurred on the check date?|SELECT COUNT(*) AS num_runs_on_check_date FROM payrollrun WHERE start_date = check_date
11 | 10|What is the maximum duration of a payroll run in days?|SELECT MAX(DATEDIFF(end_date, start_date)) AS max_duration FROM payrollrun
12 | 11|How many employees had payroll runs with a duration of more than 14 days?|SELECT COUNT(*) AS num_employees_long_duration FROM payrollrun WHERE DATEDIFF(end_date, start_date) > 14
13 | 12|What is the total count of payroll runs that were marked as deleted remote employees?|SELECT COUNT(*) AS num_deleted_remote_runs FROM payrollrun WHERE remote_was_deleted = TRUE
14 | 13|How many employees had payroll runs in the year 2023?|SELECT COUNT(DISTINCT id) AS num_employees_2023 FROM payrollrun WHERE YEAR(start_date) = 2023
15 | 14|What is the average number of payroll runs per employee?|SELECT COUNT(*) / COUNT(DISTINCT id) AS avg_runs_per_employee FROM payrollrun
16 | 15|How many payroll runs were of type 'SIGN_ON_BONUS'?|SELECT COUNT(*) AS num_sign_on_bonus_runs FROM payrollrun WHERE run_type = 'SIGN_ON_BONUS'
17 | 16|What is the maximum number of remote employees in a single payroll run?|SELECT MAX(remote_id) AS max_remote_employees FROM payrollrun
18 | 17|How many payroll runs were not approved?|SELECT COUNT(*) AS num_unapproved_runs FROM payrollrun WHERE run_state != 'APPROVED'
19 | 18|What is the earliest check date among the payroll runs?|SELECT MIN(check_date) AS earliest_check_date FROM payrollrun
20 | 19|How many terminated employees had payroll runs in the year 2022?|SELECT COUNT(*) AS num_terminated_runs_2022 FROM payrollrun WHERE run_type = 'TERMINATION' AND YEAR(start_date) = 2022
21 | 20|What is the average duration of payroll runs that occurred on the check date?|SELECT AVG(DATEDIFF(end_date, start_date)) AS avg_duration_on_check_date FROM payrollrun WHERE start_date = check_date
22 | 21|What is the average number of failed payroll runs per employee?|SELECT COUNT(*) / COUNT(DISTINCT id) AS avg_failed_runs_per_employee FROM payrollrun WHERE run_state = 'FAILED'
23 | 22|How many payroll runs occurred in December 2022?|SELECT COUNT(*) AS num_runs_december_2022 FROM payrollrun WHERE YEAR(start_date) = 2022 AND MONTH(start_date) = 12
24 | 23|What is the maximum duration of a payroll run for terminated employees?|SELECT MAX(DATEDIFF(end_date, start_date)) AS max_duration_terminated FROM payrollrun WHERE run_type = 'TERMINATION'
25 | 24|How many payroll runs were marked as deleted remote employees?|SELECT COUNT(*) AS num_deleted_remote_runs FROM payrollrun WHERE remote_was_deleted = TRUE
26 | 25|What is the total count of unique remote employees?|SELECT COUNT(DISTINCT remote_id) AS total_unique_remote_employees FROM payrollrun WHERE remote_id IS NOT NULL
27 | 26|How many payroll runs had the run state of 'FAILED' and the run type of 'CORRECTION'?|SELECT COUNT(*) AS num_failed_correction_runs FROM payrollrun WHERE run_state = 'FAILED' AND run_type = 'CORRECTION'
28 | 27|What is the average duration of payroll runs that occurred in January 2023?|SELECT AVG(DATEDIFF(end_date, start_date)) AS avg_duration_january_2023 FROM payrollrun WHERE YEAR(start_date) = 2023 AND MONTH(start_date) = 1
29 | 28|How many terminated employees had payroll runs of type 'SIGN_ON_BONUS'?|SELECT COUNT(*) AS num_terminated_sign_on_bonus FROM payrollrun WHERE run_type = 'SIGN_ON_BONUS' AND run_state = 'TERMINATION'
30 | 29|What is the total count of payroll runs that occurred on the same start and end dates?|SELECT COUNT(*) AS num_same_start_end_dates FROM payrollrun WHERE start_date = end_date
31 | 30|How many distinct remote employees had payroll runs in the year 2023?|SELECT COUNT(DISTINCT remote_id) AS num_distinct_remote_employees_2023 FROM payrollrun WHERE YEAR(start_date) = 2023
32 | 31|What is the average duration of payroll runs for employees marked as deleted remote employees?|SELECT AVG(DATEDIFF(end_date, start_date)) AS avg_duration_deleted_remote FROM payrollrun WHERE remote_was_deleted = TRUE
33 | 32|How many payroll runs occurred in the month of March?|SELECT COUNT(*) AS num_runs_march FROM payrollrun WHERE MONTH(start_date) = 3
34 | 33|What is the maximum duration of a payroll run for employees marked as deleted remote employees?|SELECT MAX(DATEDIFF(end_date, start_date)) AS max_duration_deleted_remote FROM payrollrun WHERE remote_was_deleted = TRUE
35 | 34|How many payroll runs were corrections for terminated employees?|SELECT COUNT(*) AS num_correction_runs_terminated FROM payrollrun WHERE run_type = 'CORRECTION' AND run_state = 'TERMINATION'
36 | 35|What is the total count of distinct remote employees with failed payroll runs?|SELECT COUNT(DISTINCT remote_id) AS num_distinct_remote_failed FROM payrollrun WHERE run_state = 'FAILED' AND remote_id IS NOT NULL
37 | 36|How many payroll runs occurred in the first quarter of 2023?|SELECT COUNT(*) AS num_runs_first_quarter_2023 FROM payrollrun WHERE YEAR(start_date) = 2023 AND MONTH(start_date) IN (1, 2, 3)
38 | 37|What is the average duration of payroll runs for terminated employees?|SELECT AVG(DATEDIFF(end_date, start_date)) AS avg_duration_terminated FROM payrollrun WHERE run_type = 'TERMINATION'
39 | 38|How many unique remote employees had payroll runs in January 2022?|SELECT COUNT(DISTINCT remote_id) AS num_unique_remote_employees_january_2022 FROM payrollrun WHERE YEAR(start_date) = 2022 AND MONTH(start_date) = 1
40 | 39|What is the total count of distinct remote employees with approved payroll runs?|SELECT COUNT(DISTINCT remote_id) AS num_distinct_remote_approved FROM payrollrun WHERE run_state = 'APPROVED' AND remote_id IS NOT NULL
41 | 40|How many payroll runs occurred on the same start and check dates?|SELECT COUNT(*) AS num_same_start_check_dates FROM payrollrun WHERE start_date = check_date
--------------------------------------------------------------------------------
/data/example_queries/retr_set/final_earning.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | What is the total amount of earnings for each distinct combination of employee ID and payroll run?|SELECT remote_id, employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id, employee_payroll_run;
3 | How many distinct payroll runs have earnings greater than 10000?|SELECT COUNT(DISTINCT employee_payroll_run) AS distinct_payroll_runs FROM earning WHERE amount > 10000;
4 | Which employees have received earnings in all available payroll runs?|SELECT remote_id FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT employee_payroll_run) = (SELECT COUNT(DISTINCT employee_payroll_run) FROM earning);
5 | How many distinct employee IDs have received earnings in all available payroll runs?|SELECT COUNT(DISTINCT remote_id) AS distinct_employee_ids FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT employee_payroll_run) = (SELECT COUNT(DISTINCT employee_payroll_run) FROM earning);
6 | What are the top 5 highest earning employees?|SELECT remote_id, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id ORDER BY total_earnings DESC LIMIT 5;
7 | What is the highest earning amount for each type of earning?|SELECT type, MAX(amount) AS highest_earning FROM earning GROUP BY type;
8 | What is the average amount of earnings for each employee's payroll run?|SELECT employee_payroll_run, AVG(amount) AS average_earnings FROM earning GROUP BY employee_payroll_run;
9 | How many employees have received more than one type of earning?|SELECT COUNT(DISTINCT remote_id) AS employees_with_multiple_earnings FROM earning WHERE remote_id IN (SELECT remote_id FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT type) > 1);
10 | What is the total amount of earnings for each type and payroll run?|SELECT type, employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY type, employee_payroll_run;
11 | What is the average amount of earnings for each type and employee payroll run?|SELECT type, employee_payroll_run, AVG(amount) AS average_earnings FROM earning GROUP BY type, employee_payroll_run;
12 | Which employees have received a bonus?|SELECT remote_id FROM earning WHERE type = 'BONUS';
13 | Which employees have received both salary and overtime pay?|SELECT remote_id FROM earning WHERE type IN ('SALARY', 'OVERTIME') GROUP BY remote_id HAVING COUNT(DISTINCT type) = 2;
14 | How many distinct employee IDs have received earnings greater than 5000?|SELECT COUNT(DISTINCT remote_id) AS distinct_employee_ids FROM earning WHERE amount > 5000;
15 | How many employees have received a bonus greater than 5000?|SELECT COUNT(*) AS employees_with_bonus_gt_5000 FROM earning WHERE type = 'BONUS' AND amount > 5000;
16 | What is the total amount of salary earned by each employee?|SELECT remote_id, SUM(amount) AS total_salary FROM earning WHERE type = 'SALARY' GROUP BY remote_id;
17 | How many earnings records are associated with each remote ID?|SELECT remote_id, COUNT(*) AS earnings_count FROM earning GROUP BY remote_id;
18 | How many earnings records have been deleted?|SELECT COUNT(*) AS deleted_earnings_count FROM earning WHERE remote_was_deleted = TRUE;
19 | What is the total amount of earnings for each payroll run?|SELECT employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY employee_payroll_run;
20 | Which employee has received earnings in all available types?|SELECT remote_id FROM earning GROUP BY remote_id HAVING COUNT(DISTINCT type) = (SELECT COUNT(DISTINCT type) FROM earning);
21 | What is the average amount of salary earned?|SELECT AVG(amount) AS average_salary FROM earning WHERE type = 'SALARY';
22 | What is the average amount of overtime pay?|SELECT AVG(amount) AS average_overtime_pay FROM earning WHERE type = 'OVERTIME';
23 | What is the minimum amount of salary earned?|SELECT MIN(amount) AS minimum_salary FROM earning WHERE type = 'SALARY';
24 | Which type of earning has the highest average amount?|SELECT type, AVG(amount) AS average_amount FROM earning GROUP BY type ORDER BY average_amount DESC LIMIT 1;
25 | What is the highest earning amount for each employee's payroll run?|SELECT employee_payroll_run, MAX(amount) AS highest_earning FROM earning GROUP BY employee_payroll_run;
26 | Which employee has the highest earning amount?|SELECT remote_id, MAX(amount) AS highest_earning FROM earning;
27 | How many distinct types of earnings are there?|SELECT COUNT(DISTINCT type) AS distinct_earnings_count FROM earning;
28 | What is the average amount of earnings for each distinct combination of employee ID and type?|SELECT remote_id, type, AVG(amount) AS average_earnings FROM earning GROUP BY remote_id, type;
29 | How many distinct employee IDs are there in the dataset?|SELECT COUNT(DISTINCT remote_id) AS distinct_employee_ids FROM earning;
30 | What are the distinct employee payroll runs in the dataset?|SELECT DISTINCT employee_payroll_run FROM earning;
31 | Which employee has the highest average earnings?|SELECT remote_id, AVG(amount) AS average_earnings FROM earning GROUP BY remote_id ORDER BY average_earnings DESC LIMIT 1;
32 | What is the total amount of earnings for each employee's payroll run?|SELECT employee_payroll_run, SUM(amount) AS total_earnings FROM earning GROUP BY employee_payroll_run;
33 | Which type of earning has the highest amount?|SELECT type, MAX(amount) AS highest_amount FROM earning GROUP BY type;
34 | What is the total amount of earnings for each type?|SELECT type, SUM(amount) AS total_earnings FROM earning GROUP BY type;
35 | How many earnings records are there in total?|SELECT COUNT(*) AS total_earnings_count FROM earning;
36 | Which employee has received the highest total earnings?|SELECT remote_id, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id ORDER BY total_earnings DESC LIMIT 1;
37 |
--------------------------------------------------------------------------------
/data/example_queries/retr_set/final_employee.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | Which team has the highest number of female employees?|SELECT team, COUNT(*) AS female_employee_count FROM employee WHERE gender = 'FEMALE' GROUP BY team ORDER BY female_employee_count DESC LIMIT 1
3 | How many employees have terminated their employment?|SELECT COUNT(*) AS terminated_employees FROM employee WHERE employment_status = 'TERMINATED'
4 | How many employees have a home location specified?|SELECT COUNT(*) AS employees_with_home_location FROM employee WHERE home_location IS NOT NULL
5 | How many employees have a work location specified?|SELECT COUNT(*) AS employees_with_work_location FROM employee WHERE work_location IS NOT NULL
6 | How many employees belong to each pay group?|SELECT pay_group, COUNT(*) AS employee_count FROM employee GROUP BY pay_group
7 | What is the marital status distribution among employees?|SELECT marital_status, COUNT(*) AS count FROM employee GROUP BY marital_status
8 | How many employees have a username starting with 'admin'?|SELECT COUNT(*) AS employees_with_admin_username FROM employee WHERE username LIKE 'admin%'
9 | How many employees have a work location in each city?|SELECT work_location, COUNT(*) AS employee_count FROM employee GROUP BY work_location
10 | How many employees have a work email that ends with '.org'?|SELECT COUNT(*) AS employees_with_org_email FROM employee WHERE work_email LIKE '%.org'
11 | How many employees are there in the company?|SELECT COUNT(*) AS total_employees FROM employee
12 | How many employees have a termination date that falls within a specific range?|SELECT COUNT(*) AS employees_with_specific_termination_date_range FROM employee WHERE termination_date BETWEEN '2022-01-01' AND '2022-12-31'
13 | What is the marital status of the employee with the employee ID '37515509'?|SELECT marital_status FROM employee WHERE id = '37515509'
14 | How many employees have a last name that starts with 'D' and contains the letter 'o'?|SELECT COUNT(*) AS employees_with_last_name_do FROM employee WHERE last_name LIKE 'D%o%'
15 | What is the total count of employees by company?|SELECT company, COUNT(*) AS employee_count FROM employee GROUP BY company
16 | How many employees have a username that contains the word 'admin'?|SELECT COUNT(*) AS employees_with_admin_username FROM employee WHERE username LIKE '%admin%'
17 | How many employees have a mobile phone number starting with '+1'?|SELECT COUNT(*) AS employees_with_us_mobile_number FROM employee WHERE mobile_phone_number LIKE '+1%'
18 | How many employees have a first name that starts with 'A' or 'B'?|SELECT COUNT(*) AS employees_with_initials_ab FROM employee WHERE first_name LIKE 'A%' OR first_name LIKE 'B%'
19 | Who is the oldest employee in the company?|SELECT first_name, last_name, date_of_birth FROM employee ORDER BY date_of_birth ASC LIMIT 1
20 | What is the marital status distribution among female employees?|SELECT marital_status, COUNT(*) AS count FROM employee WHERE gender = 'FEMALE' GROUP BY marital_status
21 | Which team has the highest number of employees with a work location in a specific city?|SELECT team, COUNT(*) AS employee_count FROM employee WHERE work_location = 'Specific City' GROUP BY team ORDER BY employee_count DESC LIMIT 1
22 | How many employees have both a work location and a home location specified?|SELECT COUNT(*) AS employees_with_work_and_home_location FROM employee WHERE work_location IS NOT NULL AND home_location IS NOT NULL
23 | Which manager has the highest number of employees?|SELECT manager, COUNT(*) AS employee_count FROM employee GROUP BY manager ORDER BY employee_count DESC LIMIT 1
24 | What is the marital status of the employee with employee number 23391?|SELECT marital_status FROM employee WHERE employee_number = 23391
25 | How many employees have a termination date after their start date?|SELECT COUNT(*) AS employees_with_valid_termination FROM employee WHERE termination_date > start_date
26 | How many employees have a personal email address?|SELECT COUNT(*) AS employees_with_personal_email FROM employee WHERE personal_email IS NOT NULL
27 |
--------------------------------------------------------------------------------
/data/example_queries/retr_set/final_employeepayrollrun.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | What is the total number of deleted payroll runs?|SELECT COUNT(*) FROM employeepayrollrun WHERE remote_was_deleted = TRUE
3 | What is the average gross pay for each payroll run?|SELECT payroll_run, AVG(gross_pay) AS average_gross_pay FROM employeepayrollrun GROUP BY payroll_run
4 | Which employees had a specific start date for their payroll run?|SELECT employee FROM employeepayrollrun WHERE start_date = 'specific_start_date'
5 | What is the total earnings for each employee?|SELECT employee, SUM(earnings) AS total_earnings FROM employeepayrollrun GROUP BY employee
6 | "Which employees had remote IDs starting with ""e095""?"|SELECT employee FROM employeepayrollrun WHERE remote_id LIKE 'e095%'
7 | What is the average net pay for each payroll run?|SELECT payroll_run, AVG(net_pay) AS average_net_pay FROM employeepayrollrun GROUP BY payroll_run
8 | What are the check dates for each payroll run?|SELECT payroll_run, check_date FROM employeepayrollrun
9 | What is the average tax amount for each payroll run?|SELECT payroll_run, AVG(taxes) AS average_tax_amount FROM employeepayrollrun GROUP BY payroll_run
10 | Which employees had the highest deductions?|SELECT employee, MAX(deductions) AS highest_deductions FROM employeepayrollrun GROUP BY employee
11 | Which employees have the highest net pay?|SELECT employee, MAX(net_pay) AS highest_net_pay FROM employeepayrollrun GROUP BY employee
12 | How many payroll runs were conducted?|SELECT COUNT(DISTINCT payroll_run) FROM employeepayrollrun
13 | How many employees had their payroll runs deleted?|SELECT COUNT(DISTINCT employee) FROM employeepayrollrun WHERE remote_was_deleted = TRUE
14 | What is the average net pay for each employee?|SELECT employee, AVG(net_pay) AS average_net_pay FROM employeepayrollrun GROUP BY employee
15 | How many employees had deductions in their payroll?|SELECT COUNT(DISTINCT employee) FROM employeepayrollrun WHERE deductions IS NOT NULL
16 | What are the unique employee IDs in the dataset?|SELECT DISTINCT employee FROM employeepayrollrun
17 |
--------------------------------------------------------------------------------
/data/example_queries/retr_set/final_group_final.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | What are the names of parent groups that are of type 'BUSINESS_UNIT' and have a remote ID greater than 60000000?|SELECT name FROM group_final WHERE type = 'BUSINESS_UNIT' AND remote_id > 60000000;
3 | How many parent groups have the word 'Department' in their names?|SELECT COUNT(*) AS department_count FROM group_final WHERE name LIKE '%Department%';
4 | Which parent groups have a remote ID less than 40000000 and are of type 'COST_CENTER'?|SELECT name FROM group_final WHERE remote_id < 40000000 AND type = 'COST_CENTER';
5 | How many records are there in the group_final table?|SELECT COUNT(*) AS record_count FROM group_final;
6 | What is the type of the parent group named 'Business Unit 570'?|SELECT type FROM group_final WHERE name = 'Business Unit 570';
7 | What is the count of each parent group type that has been remotely deleted?|SELECT type, COUNT(*) AS deleted_type_count FROM group_final WHERE remote_was_deleted = TRUE GROUP BY type;
8 | Which parent groups belong to the remote ID '43579094' and have not been remotely deleted?|SELECT name FROM group_final WHERE remote_id = 43579094 AND remote_was_deleted = FALSE;
9 | How many parent groups have a name containing the word 'Center'?|SELECT COUNT(*) AS center_count FROM group_final WHERE name LIKE '%Center%';
10 | What is the count of each parent group name starting with 'Cost Center'?|SELECT name, COUNT(*) AS name_count FROM group_final WHERE name LIKE 'Cost Center%' GROUP BY name;
11 | What is the count of each parent group name ending with '9768'?|SELECT name, COUNT(*) AS name_count FROM group_final WHERE name LIKE '%9768' GROUP BY name;
12 | How many parent groups are there of each type?|SELECT type, COUNT(*) AS type_count FROM group_final GROUP BY type;
13 | What are the names of all parent groups?|SELECT name FROM group_final;
14 | How many parent groups have a name starting with 'Business Unit'?|SELECT COUNT(*) AS business_unit_count FROM group_final WHERE name LIKE 'Business Unit%';
15 | How many parent groups belong to the remote ID '68325244' and have been remotely deleted?|SELECT COUNT(*) AS deleted_group_count FROM group_final WHERE remote_id = 68325244 AND remote_was_deleted = TRUE;
16 | What are the names of parent groups that start with 'Business' and end with 'Unit'?|SELECT name FROM group_final WHERE name LIKE 'Business%Unit';
17 | What is the type of the parent group with the ID '3c40a023-7cc5-4ad7-b350-dde04c59109f'?|SELECT type FROM group_final WHERE id = '3c40a023-7cc5-4ad7-b350-dde04c59109f';
18 | What is the name of the parent group with the ID '8f24d95b-9570-482c-8679-9191bf121dd9'?|SELECT name FROM group_final WHERE id = '8f24d95b-9570-482c-8679-9191bf121dd9';
19 | What is the name of the parent group with the ID '30371562'?|SELECT name FROM group_final WHERE id = '30371562';
20 | What are the names of parent groups that are of type 'COST_CENTER' and have not been remotely deleted?|SELECT name FROM group_final WHERE type = 'COST_CENTER' AND remote_was_deleted = FALSE;
21 | How many parent groups have a remote ID less than 20000000?|SELECT COUNT(*) AS id_less_than_count FROM group_final WHERE remote_id < 20000000;
22 | How many parent groups have a name containing the word 'Architect'?|SELECT COUNT(*) AS architect_count FROM group_final WHERE name LIKE '%Architect%';
23 | What is the ID of the parent group named 'Cost Center 4602'?|SELECT id FROM group_final WHERE name = 'Cost Center 4602';
24 | Which parent groups have a remote ID between 30000000 and 40000000, and are of type 'DEPARTMENT'?|SELECT name FROM group_final WHERE remote_id BETWEEN 30000000 AND 40000000 AND type = 'DEPARTMENT';
25 | What is the total count of each parent group type?|SELECT type, COUNT(*) AS type_count FROM group_final GROUP BY type;
26 | Which parent groups have a remote ID greater than 50000000?|SELECT name FROM group_final WHERE remote_id > 50000000;
27 | How many parent groups of each type have been remotely deleted?|SELECT type, COUNT(*) AS deleted_type_count FROM group_final WHERE remote_was_deleted = TRUE GROUP BY type;
28 | What is the total count of parent groups for each unique remote ID?|SELECT remote_id, COUNT(*) AS group_count FROM group_final GROUP BY remote_id;
29 | What is the count of each type of parent group that has not been remotely deleted?|SELECT type, COUNT(*) AS type_count FROM group_final WHERE remote_was_deleted = FALSE GROUP BY type;
30 | How many parent groups belong to the remote ID '1f264398-6315-4cdc-810e-9e94dea6794f'?|SELECT COUNT(*) AS group_count FROM group_final WHERE remote_id = '1f264398-6315-4cdc-810e-9e94dea6794f';
31 | Which parent groups are of type 'BUSINESS_UNIT' and not remotely deleted?|SELECT name FROM group_final WHERE type = 'BUSINESS_UNIT' AND remote_was_deleted = FALSE;
32 | What is the ID of the parent group with the name 'Architect Department'?|SELECT id FROM group_final WHERE name = 'Architect Department';
33 | What are the names of parent groups that have been remotely deleted?|SELECT name FROM group_final WHERE remote_was_deleted = TRUE;
34 | What is the count of each type of parent group that was remotely deleted?|SELECT type, COUNT(*) AS deleted_type_count FROM group_final WHERE remote_was_deleted = TRUE GROUP BY type;
35 | What are the names of the parent groups that are of type 'DEPARTMENT'?|SELECT name FROM group_final WHERE type = 'DEPARTMENT';
36 |
--------------------------------------------------------------------------------
/data/example_queries/retr_set/final_paygroup.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | What are the pay groups with IDs ending in '555'?|SELECT * FROM paygroup WHERE id LIKE '%555';
3 | Which pay groups have a remote ID that is a multiple of 1000000?|SELECT * FROM paygroup WHERE remote_id % 1000000 = 0;
4 | What are the pay groups with IDs that contain the substring '553'?|SELECT * FROM paygroup WHERE id LIKE '%553%';
5 | How many pay groups have both 'Contractor' as the pay group name and TRUE in the 'remote_was_deleted' column?|SELECT COUNT(*) AS pay_groups_with_contractor_and_remote_was_deleted FROM paygroup WHERE pay_group_name = 'Contractor' AND remote_was_deleted = TRUE;
6 | Which pay groups have a remote ID greater than 70000000?|SELECT * FROM paygroup WHERE remote_id > 70000000;
7 | How many pay groups are marked as 'Full Time' and have a remote ID less than 20000000 or greater than 90000000?|SELECT COUNT(*) AS full_time_with_extreme_remote_id FROM paygroup WHERE pay_group_name = 'Full Time' AND (remote_id < 20000000 OR remote_id > 90000000);
8 | What is the total number of rows in the paygroup table?|SELECT COUNT(*) AS total_rows FROM paygroup;
9 | How many pay groups have a remote ID greater than 90000000 and the pay group name containing the word 'Contractor'?|SELECT COUNT(*) AS high_remote_id_contractor_pay_groups FROM paygroup WHERE remote_id > 90000000 AND pay_group_name LIKE '%Contractor%';
10 | What is the minimum remote ID among the pay groups marked as 'Full Time'?|SELECT MIN(remote_id) AS min_remote_id FROM paygroup WHERE pay_group_name = 'Full Time';
11 | How many pay groups are marked as 'Contractor' and have a remote ID greater than 80000000?|SELECT COUNT(*) AS contractor_with_high_remote_id FROM paygroup WHERE pay_group_name = 'Contractor' AND remote_id > 80000000;
12 | What is the maximum length of the pay group names in the table?|SELECT MAX(LENGTH(pay_group_name)) AS max_pay_group_name_length FROM paygroup;
13 | What are the pay groups with IDs between '0d590557' and '565a25f6'?|SELECT * FROM paygroup WHERE id BETWEEN '0d590557' AND '565a25f6';
14 | How many pay groups have 'Temp' in the pay group name and FALSE in the 'remote_was_deleted' column?|SELECT COUNT(*) AS temp_pay_groups_not_deleted FROM paygroup WHERE pay_group_name = 'Temp' AND remote_was_deleted = FALSE;
15 | What is the average number of remote IDs per pay group?|SELECT AVG(remote_id) AS average_remote_ids_per_paygroup FROM paygroup;
16 | What is the average length of the pay group names for the pay groups marked as 'Temp'?|SELECT AVG(LENGTH(pay_group_name)) AS average_pay_group_name_length_temp FROM paygroup WHERE pay_group_name = 'Temp';
17 | Which pay groups are marked as 'Temp' and have a remote ID not equal to 81443812?|SELECT * FROM paygroup WHERE pay_group_name = 'Temp' AND remote_id <> 81443812;
18 | Which pay groups are marked as 'Temp' and have not been deleted remotely?|SELECT * FROM paygroup WHERE pay_group_name = 'Temp' AND remote_was_deleted = FALSE;
19 | What is the sum of the remote IDs for all pay groups?|SELECT SUM(remote_id) AS total_remote_id_sum FROM paygroup;
20 | What are the pay groups with IDs that have the substring 'f12d' and have been deleted remotely?|SELECT * FROM paygroup WHERE id LIKE '%f12d%' AND remote_was_deleted = TRUE;
21 | How many pay groups have a pay group name that contains the word 'Contract'?|SELECT COUNT(*) AS pay_groups_with_contract_in_name FROM paygroup WHERE pay_group_name LIKE '%Contract%';
22 | How many pay groups are marked as 'Temp' and have a remote ID that is an odd number?|SELECT COUNT(*) AS temp_pay_groups_with_odd_remote_id FROM paygroup WHERE pay_group_name = 'Temp' AND remote_id % 2 <> 0;
23 | What is the minimum length of the pay group names in the table?|SELECT MIN(LENGTH(pay_group_name)) AS min_pay_group_name_length FROM paygroup;
24 | How many pay groups have a pay group name that starts with 'T' and ends with 'p'?|SELECT COUNT(*) AS pay_groups_starting_with_T_and_ending_with_p FROM paygroup WHERE pay_group_name LIKE 'T%p';
25 | What is the maximum length of the pay group IDs in the table?|SELECT MAX(LENGTH(id)) AS max_id_length FROM paygroup;
26 | Which pay groups have an ID starting with '87f'?|SELECT * FROM paygroup WHERE id LIKE '87f%';
27 | Which pay groups have a remote ID less than 50000000 or a pay group name starting with 'C'?|SELECT * FROM paygroup WHERE remote_id < 50000000 OR pay_group_name LIKE 'C%';
28 | What is the average length of the pay group names in the table?|SELECT AVG(LENGTH(pay_group_name)) AS average_pay_group_name_length FROM paygroup;
29 | What are the pay groups with IDs that contain the substring '9bb7' and have a pay group name not equal to 'Contractor'?|SELECT * FROM paygroup WHERE id LIKE '%9bb7%' AND pay_group_name <> 'Contractor';
30 | Which pay groups are not marked as 'Full Time'?|SELECT * FROM paygroup WHERE pay_group_name <> 'Full Time';
31 | What is the average remote ID among the pay groups marked as 'Full Time' and have been deleted remotely?|SELECT AVG(remote_id) AS avg_remote_id_full_time_deleted FROM paygroup WHERE pay_group_name = 'Full Time' AND remote_was_deleted = TRUE;
32 | How many pay groups have TRUE in the 'remote_was_deleted' column?|SELECT COUNT(*) AS pay_groups_with_remote_was_deleted FROM paygroup WHERE remote_was_deleted = TRUE;
33 | What is the average length of the pay group IDs in the table?|SELECT AVG(LENGTH(id)) AS average_id_length FROM paygroup;
34 | How many pay groups have a remote ID that is an even number?|SELECT COUNT(*) AS even_remote_id_pay_groups FROM paygroup WHERE remote_id % 2 = 0;
35 | Which pay groups have a remote ID between 60000000 and 70000000, and the pay group name contains the word 'Time'?|SELECT * FROM paygroup WHERE remote_id BETWEEN 60000000 AND 70000000 AND pay_group_name LIKE '%Time%';
36 | What are the remote IDs of the pay groups with 'Temp' as the pay group name?|SELECT remote_id FROM paygroup WHERE pay_group_name = 'Temp';
37 |
--------------------------------------------------------------------------------
/data/example_queries/retr_set/final_payrollrun.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | What is the average number of failed payroll runs per employee?|SELECT COUNT(*) / COUNT(DISTINCT id) AS avg_failed_runs_per_employee FROM payrollrun WHERE run_state = 'FAILED'
3 | How many payroll runs were marked as deleted remote employees?|SELECT COUNT(*) AS num_deleted_remote_runs FROM payrollrun WHERE remote_was_deleted = TRUE
4 | How many payroll runs were of type 'SIGN_ON_BONUS'?|SELECT COUNT(*) AS num_sign_on_bonus_runs FROM payrollrun WHERE run_type = 'SIGN_ON_BONUS'
5 | What is the total count of remote employees?|SELECT COUNT(*) AS total_remote_employees FROM payrollrun WHERE remote_id IS NOT NULL
6 | What is the total count of distinct remote employees with failed payroll runs?|SELECT COUNT(DISTINCT remote_id) AS num_distinct_remote_failed FROM payrollrun WHERE run_state = 'FAILED' AND remote_id IS NOT NULL
7 | What is the total count of payroll runs that occurred on the same start and end dates?|SELECT COUNT(*) AS num_same_start_end_dates FROM payrollrun WHERE start_date = end_date
8 | How many payroll runs were not approved?|SELECT COUNT(*) AS num_unapproved_runs FROM payrollrun WHERE run_state != 'APPROVED'
9 | How many terminated employees had failed payroll runs?|SELECT COUNT(*) AS num_terminated_failed FROM payrollrun WHERE run_type = 'TERMINATION' AND run_state = 'FAILED'
10 | How many payroll runs occurred on the check date?|SELECT COUNT(*) AS num_runs_on_check_date FROM payrollrun WHERE start_date = check_date
11 | What is the total count of payroll runs that were marked as deleted remote employees?|SELECT COUNT(*) AS num_deleted_remote_runs FROM payrollrun WHERE remote_was_deleted = TRUE
12 | What is the earliest start date among the payroll runs?|SELECT MIN(start_date) AS earliest_start_date FROM payrollrun
13 | What is the percentage of deleted remote employees?|SELECT (COUNT(*) / (SELECT COUNT(*) FROM payrollrun WHERE remote_id IS NOT NULL)) * 100 AS deleted_remote_percentage FROM payrollrun WHERE remote_was_deleted = TRUE
14 | How many payroll runs had the run state of 'FAILED' and the run type of 'CORRECTION'?|SELECT COUNT(*) AS num_failed_correction_runs FROM payrollrun WHERE run_state = 'FAILED' AND run_type = 'CORRECTION'
15 | How many payroll runs were approved?|SELECT COUNT(*) AS num_approved_runs FROM payrollrun WHERE run_state = 'APPROVED'
16 | What is the total count of distinct remote employees with approved payroll runs?|SELECT COUNT(DISTINCT remote_id) AS num_distinct_remote_approved FROM payrollrun WHERE run_state = 'APPROVED' AND remote_id IS NOT NULL
17 | How many payroll runs were corrections for terminated employees?|SELECT COUNT(*) AS num_correction_runs_terminated FROM payrollrun WHERE run_type = 'CORRECTION' AND run_state = 'TERMINATION'
18 | What is the maximum number of remote employees in a single payroll run?|SELECT MAX(remote_id) AS max_remote_employees FROM payrollrun
19 | What is the average number of payroll runs per employee?|SELECT COUNT(*) / COUNT(DISTINCT id) AS avg_runs_per_employee FROM payrollrun
20 |
--------------------------------------------------------------------------------
/data/example_queries/test_set/final_earning.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | How many different types of earnings does each employee have?|SELECT remote_id, COUNT(DISTINCT type) AS distinct_earning_types FROM earning GROUP BY remote_id;
3 | Which employee has received the highest bonus?|SELECT remote_id FROM earning WHERE type = 'BONUS' ORDER BY amount DESC LIMIT 1;
4 | Which earnings records have not been deleted?|SELECT * FROM earning WHERE remote_was_deleted = FALSE;
5 | What is the total amount of earnings for each employee?|SELECT remote_id, SUM(amount) AS total_earnings FROM earning GROUP BY remote_id;
6 | What is the average amount of bonus earned by each employee?|SELECT remote_id, AVG(amount) AS average_bonus FROM earning WHERE type = 'BONUS' GROUP BY remote_id;
7 |
--------------------------------------------------------------------------------
/data/example_queries/test_set/final_employee.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | "How many employees have a work location in a ""new york""?"|SELECT COUNT(*) AS employees_in_specific_state FROM employee WHERE work_location LIKE '%new york%'
3 | How many employees have a work email ending with '@acme.com'?|SELECT COUNT(*) AS employees_with_acme_email FROM employee WHERE work_email LIKE '%@acme.com'
4 | How many employees have a first name starting with 'J' and a last name starting with 'S'?|SELECT COUNT(*) AS employees_with_initials_js FROM employee WHERE first_name LIKE 'J%' AND last_name LIKE 'S%'
5 | How many employees have a remote ID that starts with 'ABC'?|SELECT COUNT(*) AS employees_with_remote_id FROM employee WHERE remote_id LIKE 'ABC%'
6 | What is the marital status distribution among male employees?|SELECT marital_status, COUNT(*) AS count FROM employee WHERE gender = 'MALE' GROUP BY marital_status
7 |
--------------------------------------------------------------------------------
/data/example_queries/test_set/final_employeepayrollrun.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | Which payroll runs had the highest gross pay?|SELECT payroll_run, MAX(gross_pay) AS highest_gross_pay FROM employeepayrollrun GROUP BY payroll_run
3 | How many employees had multiple payroll runs?|SELECT employee, COUNT(DISTINCT payroll_run) AS payroll_run_count FROM employeepayrollrun GROUP BY employee HAVING COUNT(DISTINCT payroll_run) > 1
4 | What is the total tax amount for each payroll run?|SELECT payroll_run, SUM(taxes) AS total_tax_amount FROM employeepayrollrun GROUP BY payroll_run
5 | How many employees had a specific earnings type?|SELECT COUNT(DISTINCT employee) FROM employeepayrollrun WHERE earnings = 'specific_earnings_type'
6 | What are the start and end dates for each payroll run?|SELECT payroll_run, start_date, end_date FROM employeepayrollrun
7 |
--------------------------------------------------------------------------------
/data/example_queries/test_set/final_group_final.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | How many parent groups belong to the remote ID '68325244'?|SELECT COUNT(*) AS group_count FROM group_final WHERE remote_id = 68325244;
3 | What is the name of the parent group with the ID '1e9ccf03-ac55-4ede-982e-8c54cf454be7'?|SELECT name FROM group_final WHERE id = '1e9ccf03-ac55-4ede-982e-8c54cf454be7';
4 | What is the count of each parent group type that has not been remotely deleted?|SELECT type, COUNT(*) AS type_count FROM group_final WHERE remote_was_deleted = FALSE GROUP BY type;
5 | How many parent groups have been remotely deleted and belong to the remote ID '44546129'?|SELECT COUNT(*) AS deleted_group_count FROM group_final WHERE remote_was_deleted = TRUE AND remote_id = 44546129;
6 | How many parent groups have been deleted remotely?|SELECT COUNT(*) AS deleted_count FROM group_final WHERE remote_was_deleted = TRUE;
7 |
--------------------------------------------------------------------------------
/data/example_queries/test_set/final_paygroup.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | How many pay groups are marked as 'Full Time' and have been deleted remotely?|SELECT COUNT(*) AS pay_groups_with_full_time_and_remote_deleted FROM paygroup WHERE pay_group_name = 'Full Time' AND remote_was_deleted = TRUE;
3 | Which pay groups have 'Temp' as the pay group name and have a remote ID less than 20000000?|SELECT * FROM paygroup WHERE pay_group_name = 'Temp' AND remote_id < 20000000;
4 | Which pay groups have a remote ID between 40000000 and 60000000 and the pay group name ending with 'r'?|SELECT * FROM paygroup WHERE remote_id BETWEEN 40000000 AND 60000000 AND pay_group_name LIKE '%r';
5 | How many unique pay groups are there?|SELECT COUNT(DISTINCT pay_group_name) AS unique_pay_groups FROM paygroup;
6 | What is the maximum remote ID in the paygroup table?|SELECT MAX(remote_id) AS max_remote_id FROM paygroup;
7 |
--------------------------------------------------------------------------------
/data/example_queries/test_set/final_payrollrun.csv:
--------------------------------------------------------------------------------
1 | Question|SQL Query
2 | How many terminated employees had payroll runs of type 'SIGN_ON_BONUS'?|SELECT COUNT(*) AS num_terminated_sign_on_bonus FROM payrollrun WHERE run_type = 'SIGN_ON_BONUS' AND run_state = 'TERMINATION'
3 | What is the total count of unique remote employees?|SELECT COUNT(DISTINCT remote_id) AS total_unique_remote_employees FROM payrollrun WHERE remote_id IS NOT NULL
4 | How many payroll runs occurred on the same start and check dates?|SELECT COUNT(*) AS num_same_start_check_dates FROM payrollrun WHERE start_date = check_date
5 | What is the earliest check date among the payroll runs?|SELECT MIN(check_date) AS earliest_check_date FROM payrollrun
6 | How many payroll runs were corrections?|SELECT COUNT(*) AS num_correction_runs FROM payrollrun WHERE run_type = 'CORRECTION'
7 |
--------------------------------------------------------------------------------
/data/example_queries/trash/final_employee_old.csv:
--------------------------------------------------------------------------------
1 | Question SQL Query
2 | Which team has the highest number of female employees? SELECT team, COUNT(*) AS female_employee_count FROM employee WHERE gender = 'FEMALE' GROUP BY team ORDER BY female_employee_count DESC LIMIT 1
3 | How many employees have terminated their employment? SELECT COUNT(*) AS terminated_employees FROM employee WHERE employment_status = 'TERMINATED'
4 | How many employees have a home location specified? SELECT COUNT(*) AS employees_with_home_location FROM employee WHERE home_location IS NOT NULL
5 | How many employees have a work location specified? SELECT COUNT(*) AS employees_with_work_location FROM employee WHERE work_location IS NOT NULL
6 | How many employees belong to each pay group? SELECT pay_group, COUNT(*) AS employee_count FROM employee GROUP BY pay_group
7 | What is the marital status distribution among employees? SELECT marital_status, COUNT(*) AS count FROM employee GROUP BY marital_status
8 | How many employees have a username starting with 'admin'? SELECT COUNT(*) AS employees_with_admin_username FROM employee WHERE username LIKE 'admin%'
9 | How many employees have a work location in each city? SELECT work_location, COUNT(*) AS employee_count FROM employee GROUP BY work_location
10 | How many employees have a work email that ends with '.org'? SELECT COUNT(*) AS employees_with_org_email FROM employee WHERE work_email LIKE '%.org'
11 | How many employees are there in the company? SELECT COUNT(*) AS total_employees FROM employee
12 | How many employees have a termination date that falls within a specific range? SELECT COUNT(*) AS employees_with_specific_termination_date_range FROM employee WHERE termination_date BETWEEN '2022-01-01' AND '2022-12-31'
13 | What is the marital status of the employee with the employee ID '37515509'? SELECT marital_status FROM employee WHERE id = '37515509'
14 | How many employees have a last name that starts with 'D' and contains the letter 'o'? SELECT COUNT(*) AS employees_with_last_name_do FROM employee WHERE last_name LIKE 'D%o%'
15 | What is the total count of employees by company? SELECT company, COUNT(*) AS employee_count FROM employee GROUP BY company
16 | How many employees have a username that contains the word 'admin'? SELECT COUNT(*) AS employees_with_admin_username FROM employee WHERE username LIKE '%admin%'
17 | How many employees have a mobile phone number starting with '+1'? SELECT COUNT(*) AS employees_with_us_mobile_number FROM employee WHERE mobile_phone_number LIKE '+1%'
18 | How many employees have a first name that starts with 'A' or 'B'? SELECT COUNT(*) AS employees_with_initials_ab FROM employee WHERE first_name LIKE 'A%' OR first_name LIKE 'B%'
19 | Who is the oldest employee in the company? SELECT first_name, last_name, date_of_birth FROM employee ORDER BY date_of_birth ASC LIMIT 1
20 | What is the marital status distribution among female employees? SELECT marital_status, COUNT(*) AS count FROM employee WHERE gender = 'FEMALE' GROUP BY marital_status
21 | Which team has the highest number of employees with a work location in a specific city? SELECT team, COUNT(*) AS employee_count FROM employee WHERE work_location = 'Specific City' GROUP BY team ORDER BY employee_count DESC LIMIT 1
22 | How many employees have both a work location and a home location specified? SELECT COUNT(*) AS employees_with_work_and_home_location FROM employee WHERE work_location IS NOT NULL AND home_location IS NOT NULL
23 | Which manager has the highest number of employees? SELECT manager, COUNT(*) AS employee_count FROM employee GROUP BY manager ORDER BY employee_count DESC LIMIT 1
24 | What is the marital status of the employee with employee number 23391? SELECT marital_status FROM employee WHERE employee_number = 23391
25 | How many employees have a termination date after their start date? SELECT COUNT(*) AS employees_with_valid_termination FROM employee WHERE termination_date > start_date
26 | How many employees have a personal email address? SELECT COUNT(*) AS employees_with_personal_email FROM employee WHERE personal_email IS NOT NULL
27 |
--------------------------------------------------------------------------------
/data/example_queries/trash/final_employee_old_test.csv:
--------------------------------------------------------------------------------
1 | Question SQL Query
2 | How many employees have a work location in a "new york"? SELECT COUNT(*) AS employees_in_specific_state FROM employee WHERE work_location LIKE '%new york%'
3 | How many employees have a work email ending with '@acme.com'? SELECT COUNT(*) AS employees_with_acme_email FROM employee WHERE work_email LIKE '%@acme.com'
4 | How many employees have a first name starting with 'J' and a last name starting with 'S'? SELECT COUNT(*) AS employees_with_initials_js FROM employee WHERE first_name LIKE 'J%' AND last_name LIKE 'S%'
5 | How many employees have a remote ID that starts with 'ABC'? SELECT COUNT(*) AS employees_with_remote_id FROM employee WHERE remote_id LIKE 'ABC%'
6 | What is the marital status distribution among male employees? SELECT marital_status, COUNT(*) AS count FROM employee WHERE gender = 'MALE' GROUP BY marital_status
7 |
--------------------------------------------------------------------------------
/data/prompts/synthetic_dataset_prompts.txt:
--------------------------------------------------------------------------------
1 | Prompt 1
2 | Give SQL query for the following -
3 |
4 | Question:
5 | What are the different types of salary?
6 |
7 | Table Schema: id,remote_id,employee,payroll_run,gross_pay,net_pay,start_date,end_date,check_date,earnings,deductions,taxes,remote_was_deleted
8 |
9 | Table Name : employeepayrollrun
10 |
11 | Some rows in the table looks like this:
12 | b6eef2cf-065b-43f9-915a-3c3c0345ce0d,20229641,e095842f-9f90-4ca0-88d3-b49e0004eb82,b8a164a2-ff9c-499e-9181-c695ff50f5d1,1726,508,2023-01-24,2023-02-07,2023-02-07,cb993fa6-98e3-49e6-abfb-f367e74a50b4,,"4c23cabf-f3ea-402f-8da4-9a0666a5a054,2b523c94-67ae-42aa-9707-450432782fce",FALSE
13 | 6d778bc4-53c3-4950-81b7-b9546ad5baf2,20465772,e095842f-9f90-4ca0-88d3-b49e0004eb82,44cf3101-aea6-4dea-bd56-4ddb53600701,5413,928,2023-02-07,2023-02-21,2023-02-21,95bbaec8-4bec-47ba-bce5-294c97725650,,"6ac67280-ab9e-4f63-8b7d-7051f737bc36,33b3b4b2-3420-4728-9d16-637476b9cc48",FALSE
14 | 1db71384-a20e-477e-9f4b-6a150773ec51,25316785,5d20ce0a-44fa-4f45-8833-555fae93d34f,3888d17b-094f-4fbe-b6e2-72de31e25b88,1437,359,2022-09-20,2022-10-04,2022-10-04,6594dc4a-086e-4ad2-97e8-3ccaf1fe84d9,0dd722d7-0b8d-4d69-970d-07afa60123b8,"e9c3fa35-5680-4634-987b-5bbc7253843b,490fd403-df58-4870-bf68-44cda95ca8ea",TRUE
15 |
16 | Prompt 2
17 | Can you give me 10 different natural language questions and their corresponding SQL questions specific to this dataset in the format of a combined CSV file in the following format
18 |
19 | Index, Question, SQL Query
20 |
21 | The delimiter should be "|" instead of comma as a txt file
22 |
23 |
24 | Prompt 3
25 | Give me 10 more different natural language questions and the corresponding SQL queries in the same format starting from index 11 in a txt file
--------------------------------------------------------------------------------
/data/prompts/synthetic_dataset_template.txt:
--------------------------------------------------------------------------------
1 | --- Prompt 1 ---
2 | Give SQL query for the following -
3 |
4 | Question:
5 |
6 |
7 | Table Schema: [*, id, remote_id]
8 | Table Name :
9 |
10 | Some rows in the table looks like this:
11 |
12 |
13 |
14 |
15 | --- Prompt 2 ---
16 | Can you give me 10 different natural language questions and their corresponding SQL questions specific to this dataset in the format of a combined CSV file in the following format
17 |
18 | Index, Question, SQL Query
19 |
20 | The delimiter should be "|" instead of comma as a txt file
21 |
22 |
23 | --- Prompt 3 ---
24 | Give me 10 more different natural language questions and the corresponding SQL queries in the same format starting from index 11 in a txt file
25 |
26 | --- Prompt 4 ---
27 | Give me 10 more different natural language questions and the corresponding SQL queries in the same format starting from index 21 in a txt file
28 |
29 | --- Prompt 5 ---
30 | Give me 10 more different natural language questions and the corresponding SQL queries in the same format starting from index 31 in a txt file
--------------------------------------------------------------------------------
/data/samples.json:
--------------------------------------------------------------------------------
1 |
2 | {
3 | "earning" : "fa9724ac-f812-4d99-8c12-a994e15ad570,63606114,3888d17b-094f-4fbe-b6e2-72de31e25b88,3321,SALARY,FALSE",
4 | "employee" : "38f3c8af-3198-4d3d-90c7-bf50d9b3a6f3,37515509,23391,3479aeef-f3fa-44ef-a319-83db557bbc62,Roy,Thornton,Roy Thornton,Roy.Thornton,\"5256fd65-0021-429a-8046-732cd3b77521,c734f48d-572e-4d9d-900d-226936605b96,d0785220-e24e-4a35-ac5e-f7543e6b720f,d5a21ad4-9afe-4d0f-9838-4b13bb2fe477,1af65d67-0b9b-4e9e-aa8b-31def4126f24,9ebf104e-963f-4441-af79-440b77ee3ffd,83601212-47d7-402e-91ce-953fba897f5b,8d809e31-6bf0-4840-b31d-e445027f2306,426c39ad-295e-4c3e-9042-10805bbcc630\",Roy.Thornton@ACME-United.com,diazkimberly@example.org,2409570416,\"24b5fd5f-1373-4f2c-8536-9d08540b12f7,9436abe3-2945-43b3-8189-376dd320410c,355e211c-a946-49e5-b0a9-b5df68006f17,7225a344-fc77-41f0-90a3-e63b383421ad,c47cbeb9-7152-42df-8984-c2b67f6cca8b\",69a60420-05a7-4302-a89a-610fa388dff9,69a60420-05a7-4302-a89a-610fa388dff9,aa7541b1-e3fb-4fed-8d48-7682c0c4ed1e,83601212-47d7-402e-91ce-953fba897f5b,b5fad43e-0e69-45ee-ac2f-4781b560d69b,606-96-8827,FEMALE,HISPANIC,WIDOWED,1962-08-16,2013-08-06,2021-06-23,ACTIVE,2019-03-31,https://dummyimage.com/769x456",
5 | "employeepayrollrun" : "4991bdc5-8059-4ad1-a28f-e76fbe44e3b5,64853460,4c527df3-7cfb-4e5b-b250-8e7575d3a3bd,d7d4f172-c018-4db3-8cb9-d08f12b25322,25948,2965,2022-10-18,2022-11-01,2022-11-01,\"f500f4a4-f338-4def-9c56-3797a5c5d614,c7967e1c-9ad3-4657-988f-6df95c4d8950,2850c0aa-d89a-438d-b3cf-7c09e5970873,accfe151-03be-4e87-95cc-a1f2a8bb7c7e,3099e1b9-7997-4c9c-a2c9-002ca819b592\",c9ab64c4-37cd-40ca-843d-aa7834b6291d,\"84e3ade4-8755-4183-8b84-16b55ac64b61,8d3f31f1-640f-4004-9221-52d8e6a54423,eb78913c-86f7-49ca-ad69-87d6f097cbd8\",TRUE",
6 | "group" : "abcd1234,abcd1235,Magazine features editor Department,DEPARTMENT,FALSE",
7 | "paygroup" : "abcd1234,50179773,Contractor,FALSE",
8 | "payrollrun" : ""
9 | }
--------------------------------------------------------------------------------
/data/sql/group_final.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE IF NOT EXISTS group_final (
2 | id TEXT,
3 | remote_id INTEGER,
4 | parent_group TEXT,
5 | name TEXT,
6 | type TEXT,
7 | remote_was_deleted TEXT
8 | );
9 | INSERT INTO group_final VALUES ('43cad245-e8b4-4c53-971e-1e1176ed092b',47091065,'3a52ba46-eadd-4747-9526-1b3f215b07c9','Cost Center 9695','COST_CENTER','FALSE'),
10 | ('0dcd4ea7-7eb4-4745-8ac4-c7ddd7fdc11f',19475308,'2df0ae47-2570-450b-be98-b376d0573969','Environmental education officer Department','DEPARTMENT','FALSE'),
11 | ('bcde1581-c99c-4207-98b1-219d15ff99b4',18307488,'e834a380-e1f8-4148-8b06-f4fbcc7b180c','Group 8594','GROUP','FALSE'),
12 | ('27e4de9c-4602-4c58-b032-fdc2a5818ebd',29661672,'4772cf10-9a09-4aa8-acf0-b979044c3aad','Cost Center 4492','COST_CENTER','FALSE'),
13 | ('425bceca-8a35-4255-a904-c69e5a432952',74969187,'bd44ff2e-56cb-4c84-9738-91b98404ab3d','Astronomer Department','DEPARTMENT','FALSE'),
14 | ('600aa46e-fd55-4c14-8d03-43f2c2fca81b',78724147,'856542fb-0faa-4a89-bcd1-7d771806d918','Research scientist (physical sciences) Department','DEPARTMENT','FALSE'),
15 | ('04181780-ac6d-47e3-8126-cbf773966335',10827323,'e8ff8f5c-3e9b-4d03-b198-a2810cf3d96f','Meteorologist Department','DEPARTMENT','FALSE'),
16 | ('99434549-f240-4945-a218-f749af3232cc',25386746,'10a9fb19-e382-4be9-a103-7a076a314c79','Community development worker Team','TEAM','FALSE'),
17 | ('e011e638-77ec-420d-a92f-f4bc42519f60',15831193,'0dcd4ea7-7eb4-4745-8ac4-c7ddd7fdc11f','Cost Center 2950','COST_CENTER','FALSE'),
18 | ('d0785220-e24e-4a35-ac5e-f7543e6b720f',33248761,'3c3bec4e-00b3-41b1-83bf-4f5074ce7665','Cost Center 2362','COST_CENTER','TRUE'),
19 | ('27f2a393-c77d-4b91-8501-434256f52d3b',14996181,'4c8922c4-d0c3-4dd7-8206-84dd5b4c3ef9','Art gallery manager Team','TEAM','FALSE'),
20 | ('9ebf104e-963f-4441-af79-440b77ee3ffd',76224061,'3a52ba46-eadd-4747-9526-1b3f215b07c9','Business Unit 9648','BUSINESS_UNIT','FALSE'),
21 | ('65ac5e8a-7844-45b1-9c26-e13a785a4339',55193679,'27e4de9c-4602-4c58-b032-fdc2a5818ebd','Geoscientist Team','TEAM','TRUE'),
22 | ('78b871a7-77b1-4406-8488-c49d78a1f4a1',86189128,'3a52ba46-eadd-4747-9526-1b3f215b07c9','Paramedic Team','TEAM','TRUE'),
23 | ('0800ea55-9a59-46e7-9317-708c5810306a',49310398,'1e9ccf03-ac55-4ede-982e-8c54cf454be7','Cost Center 8326','COST_CENTER','FALSE'),
24 | ('0d6d3c34-05e3-4765-b2e4-d4916d3524fa',46087924,'89c56a05-effe-4809-b15a-08ff9bb45f59','Agricultural consultant Team','TEAM','TRUE'),
25 | ('ed9bee4f-02e6-4003-8062-8b4bd698cd75',32513187,'7946a616-917c-4c17-b98a-f75061037648','Cost Center 2799','COST_CENTER','FALSE'),
26 | ('d690f763-6741-4e02-87ca-60b8a8088407',58007777,'825cf3ef-1947-4959-b2f8-99829e640987','Business Unit 6102','BUSINESS_UNIT','TRUE'),
27 | ('db1d4c22-3fe4-40c1-be1b-b8b6e2a7385f',43579094,'1e9ccf03-ac55-4ede-982e-8c54cf454be7','Business Unit 570','BUSINESS_UNIT','TRUE'),
28 | ('8f24d95b-9570-482c-8679-9191bf121dd9',44546129,'1f264398-6315-4cdc-810e-9e94dea6794f','Cost Center 9768','COST_CENTER','FALSE'),
29 | ('d9ab05ce-b74f-43db-bc81-9b0ac9a9f3a3',30371562,'dc7faa72-2045-4f8c-8186-ba0f38e6db0f','Cost Center 4602','COST_CENTER','FALSE'),
30 | ('1e9ccf03-ac55-4ede-982e-8c54cf454be7',68325244,'3c40a023-7cc5-4ad7-b350-dde04c59109f','Architect Department','DEPARTMENT','TRUE'),
31 | ('3063d441-8375-4313-8e57-26bbafeda1b8',19492908,'300b3cea-3ae0-43c3-b07d-4d0b9e91306a','Radiographer, diagnostic Team','TEAM','TRUE'),
32 | ('923f897a-d74e-4edc-aae3-e824c2677494',93566334,'5256fd65-0021-429a-8046-732cd3b77521','Business Unit 1502','BUSINESS_UNIT','TRUE'),
33 | ('157253ec-8d3c-491d-b401-3d1e032b07b0',23394453,'c685f3cb-5682-4981-93c3-86265b2c2e57','Business Unit 6935','BUSINESS_UNIT','FALSE'),
34 | ('9ee7c415-8bb3-4509-8673-b2e2fe72f423',14567439,'5b876ea5-b369-4d97-b302-6d6a73f30cc8','Production manager Department','DEPARTMENT','TRUE'),
35 | ('319a6653-6eaf-4fa4-9bc2-405241069bc8',70562204,'c3cafba4-8ebf-41c0-b106-10ec64bfe876','Business Unit 2457','BUSINESS_UNIT','TRUE'),
36 | ('7fd1b412-90a2-472a-90b6-fa01d10339c6',97699467,'39dfcc6b-bae4-421b-91a5-d60e4966e70e','Business Unit 4015','BUSINESS_UNIT','TRUE'),
37 | ('5999ad24-c76a-4571-9b50-a82aec515a0b',17519229,'dc7faa72-2045-4f8c-8186-ba0f38e6db0f','Cost Center 8503','COST_CENTER','TRUE'),
38 | ('8175d866-21d5-483c-928a-5ebc33779a36',14833059,'dc7faa72-2045-4f8c-8186-ba0f38e6db0f','Group 6491','GROUP','TRUE'),
39 | ('87698d19-8072-4884-afff-a9b455c071e7',59438341,'0800ea55-9a59-46e7-9317-708c5810306a','Health service manager Team','TEAM','TRUE'),
40 | ('0bcb6c5a-e2f7-45d5-ab87-49a2f5121515',14344856,'b16df9bf-9248-4b0a-a5d2-72fdd1ceb440','Cost Center 5912','COST_CENTER','FALSE'),
41 | ('c247c69f-0a6d-47e6-912f-ce404ce39920',75521383,'977695b6-d1bd-4ad9-b250-71ea7ebf010e','Cost Center 3830','COST_CENTER','FALSE'),
42 | ('1f264398-6315-4cdc-810e-9e94dea6794f',65333758,'12064004-a2ac-4207-bc78-4bafdd056bfc','Business Unit 9796','BUSINESS_UNIT','TRUE'),
43 | ('137e2127-5003-422c-8ae0-7fe5fe10c355',53163195,'3c3bec4e-00b3-41b1-83bf-4f5074ce7665','Architect Department','DEPARTMENT','FALSE'),
44 | ('f6d0055d-6697-42eb-aecf-e8d6ad68ed95',74014197,'8f43edf1-ae6c-4cdb-ad6d-30c0cc662763','Wellsite geologist Department','DEPARTMENT','FALSE'),
45 | ('a35da513-294a-4a45-aaae-86006eef6510',48021302,'d690f763-6741-4e02-87ca-60b8a8088407','Radiographer, therapeutic Department','DEPARTMENT','TRUE'),
46 | ('0fd7d20c-bac7-45cb-9008-17808f877597',84694614,'d0785220-e24e-4a35-ac5e-f7543e6b720f','Tax adviser Team','TEAM','FALSE'),
47 | ('c2fc7bc9-4559-4d09-9e55-2c3e08c59ba4',21379896,'4772cf10-9a09-4aa8-acf0-b979044c3aad','Dance movement psychotherapist Team','TEAM','FALSE'),
48 | ('3c40a023-7cc5-4ad7-b350-dde04c59109f',33032111,'f66a1688-32f7-4e3c-a8b2-b815a03dad9b','Emergency planning/management officer Department','DEPARTMENT','TRUE'),
49 | ('3d195912-b964-4fdf-b73b-42d7634be2e2',47447527,'99434549-f240-4945-a218-f749af3232cc','Pharmacist, hospital Department','DEPARTMENT','FALSE'),
50 | ('977695b6-d1bd-4ad9-b250-71ea7ebf010e',12559104,'bd44ff2e-56cb-4c84-9738-91b98404ab3d','Accountant, chartered certified Department','DEPARTMENT','FALSE'),
51 | ('d503c50a-0f24-4f6b-acd6-29e476805ce0',59214811,'a35da513-294a-4a45-aaae-86006eef6510','Cost Center 4131','COST_CENTER','FALSE'),
52 | ('7963b58c-0be9-4594-8bd2-dae1e02b4bb2',71709818,'d0785220-e24e-4a35-ac5e-f7543e6b720f','Horticultural consultant Department','DEPARTMENT','TRUE'),
53 | ('406dbcf9-72c8-4bf5-b8b9-07b03f70af8e',15346875,'0fd7d20c-bac7-45cb-9008-17808f877597','Scientist, product/process development Team','TEAM','FALSE'),
54 | ('825cf3ef-1947-4959-b2f8-99829e640987',56096673,'aeead089-00c2-4446-a7cf-1d3bb8584895','Paediatric nurse Department','DEPARTMENT','TRUE'),
55 | ('c3cafba4-8ebf-41c0-b106-10ec64bfe876',55887103,'be475fed-b0d7-4942-b2c3-3405f82f0f3b','Business Unit 252','BUSINESS_UNIT','FALSE'),
56 | ('6d982475-d6a9-40c4-8f59-a0c06dc94af9',54168189,'f6d0055d-6697-42eb-aecf-e8d6ad68ed95','English as a second language teacher Team','TEAM','FALSE'),
57 | ('ea387ad0-9326-49df-b90a-e063cc99752c',89261236,'08e4686f-e0de-42dd-9fac-dcd6558f80bb','Business Unit 4893','BUSINESS_UNIT','TRUE'),
58 | ('4c8922c4-d0c3-4dd7-8206-84dd5b4c3ef9',24812941,'1d12d4d5-2310-459d-8873-232386635252','Cost Center 8348','COST_CENTER','TRUE'),
59 | ('43c3bc79-910d-4c47-9e47-cd5251089123',46773267,'dc7faa72-2045-4f8c-8186-ba0f38e6db0f','Business Unit 6267','BUSINESS_UNIT','FALSE'),
60 | ('6990d80d-1cac-42b9-bfd2-0610974abfa6',79678206,'3c40a023-7cc5-4ad7-b350-dde04c59109f','Solicitor, Scotland Team','TEAM','FALSE'),
61 | ('72b2cb2d-297e-4890-9c5c-b575da7f03dd',94430287,'5a447713-b8dc-4a1c-aab3-3ec3e1460c51','Advertising copywriter Team','TEAM','TRUE'),
62 | ('c536f3ab-0912-442c-b53e-496af749d25e',26538427,'3c3bec4e-00b3-41b1-83bf-4f5074ce7665','Diplomatic Services operational officer Team','TEAM','TRUE'),
63 | ('b150da1e-7a3f-43a4-9204-e657b74af8f7',12035199,'27e4de9c-4602-4c58-b032-fdc2a5818ebd','Cost Center 2084','COST_CENTER','FALSE'),
64 | ('cbec3de6-077f-463a-8e76-0b094cbc2fb4',98333885,'425bceca-8a35-4255-a904-c69e5a432952','Airline pilot Team','TEAM','FALSE'),
65 | ('de6248e1-7c8d-43c6-9fc8-2292e46b0f05',99878246,'682d41f9-500c-4de4-8eb7-6d9b97c8f35e','Writer Department','DEPARTMENT','FALSE'),
66 | ('dc7faa72-2045-4f8c-8186-ba0f38e6db0f',26844391,'99434549-f240-4945-a218-f749af3232cc','Group 8260','GROUP','FALSE'),
67 | ('1d12d4d5-2310-459d-8873-232386635252',27966909,'5a447713-b8dc-4a1c-aab3-3ec3e1460c51','Group 3676','GROUP','FALSE'),
68 | ('682d41f9-500c-4de4-8eb7-6d9b97c8f35e',49375157,'127c4ea1-1a06-4896-aae5-1d1995dc4545','Producer, television/film/video Team','TEAM','FALSE'),
69 | ('3a52ba46-eadd-4747-9526-1b3f215b07c9',73099570,'08e4686f-e0de-42dd-9fac-dcd6558f80bb','Group 5785','GROUP','TRUE'),
70 | ('5256fd65-0021-429a-8046-732cd3b77521',25143577,'1f264398-6315-4cdc-810e-9e94dea6794f','Quality manager Department','DEPARTMENT','TRUE'),
71 | ('c685f3cb-5682-4981-93c3-86265b2c2e57',65867775,'27e4de9c-4602-4c58-b032-fdc2a5818ebd','Business Unit 8270','BUSINESS_UNIT','FALSE'),
72 | ('8d809e31-6bf0-4840-b31d-e445027f2306',40198956,'562a1979-da9f-400c-92b1-f72d2a4c3d11','Patent examiner Department','DEPARTMENT','FALSE'),
73 | ('04ed7a5c-88df-42f1-92d0-bf140006dc0e',55481288,'426c39ad-295e-4c3e-9042-10805bbcc630','Cost Center 5579','COST_CENTER','FALSE'),
74 | ('4772cf10-9a09-4aa8-acf0-b979044c3aad',68547732,'70a081b3-2331-4fd9-bab6-d601331f266d','Cost Center 320','COST_CENTER','TRUE'),
75 | ('83601212-47d7-402e-91ce-953fba897f5b',26911053,'682d41f9-500c-4de4-8eb7-6d9b97c8f35e','Cost Center 6786','COST_CENTER','TRUE'),
76 | ('e7735b41-1e0c-4539-a3f1-3907a51cb02b',23211719,'319a6653-6eaf-4fa4-9bc2-405241069bc8','Cost Center 9923','COST_CENTER','FALSE'),
77 | ('e8ff8f5c-3e9b-4d03-b198-a2810cf3d96f',20552980,'935e995e-d883-41f7-823a-7ceefbf6b2f3','Animator Department','DEPARTMENT','TRUE'),
78 | ('37c57196-2dcd-4946-ae87-9cb2ea72558c',23203593,'c247c69f-0a6d-47e6-912f-ce404ce39920','Business Unit 9624','BUSINESS_UNIT','FALSE'),
79 | ('889c8643-a459-4389-a5be-ee8207542071',58624420,'3c40a023-7cc5-4ad7-b350-dde04c59109f','Business Unit 6427','BUSINESS_UNIT','FALSE'),
80 | ('31829cf9-823f-4f53-8005-eab6f5ec2fe0',13100868,'c786220f-71c9-466d-8233-f3f818ed5781','Cost Center 418','COST_CENTER','TRUE'),
81 | ('cee0583b-30fe-4651-88b7-bdcd440b10a8',11264857,'39dfcc6b-bae4-421b-91a5-d60e4966e70e','Administrator, local government Department','DEPARTMENT','FALSE'),
82 | ('70a081b3-2331-4fd9-bab6-d601331f266d',36705171,'426c39ad-295e-4c3e-9042-10805bbcc630','Engineer, mining Department','DEPARTMENT','FALSE'),
83 | ('f66a1688-32f7-4e3c-a8b2-b815a03dad9b',37433417,'4c8922c4-d0c3-4dd7-8206-84dd5b4c3ef9','Horticulturist, commercial Department','DEPARTMENT','TRUE'),
84 | ('a1dbafc8-2523-461e-83ba-0338582351e5',62025074,'5b876ea5-b369-4d97-b302-6d6a73f30cc8','Group 8799','GROUP','TRUE'),
85 | ('bd44ff2e-56cb-4c84-9738-91b98404ab3d',37945977,'6a4e5d60-ae91-4bd4-a30d-71e2ffe666c5','Dealer Team','TEAM','FALSE'),
86 | ('ccd5f4c5-00cd-47b6-995c-d14cdbd76db2',23762385,'c536f3ab-0912-442c-b53e-496af749d25e','Physiological scientist Department','DEPARTMENT','TRUE'),
87 | ('5b876ea5-b369-4d97-b302-6d6a73f30cc8',83771923,'d1056987-b78f-49af-8ca5-28f8955f4464','Technical sales engineer Team','TEAM','FALSE'),
88 | ('39dfcc6b-bae4-421b-91a5-d60e4966e70e',64586221,'87698d19-8072-4884-afff-a9b455c071e7','Cost Center 3008','COST_CENTER','FALSE'),
89 | ('a8e50a03-8a64-4bed-af76-494201dcb4c1',59985811,'e906b606-8395-4e7d-ae22-9ac7a3b9f3ef','Business Unit 5573','BUSINESS_UNIT','TRUE'),
90 | ('08e4686f-e0de-42dd-9fac-dcd6558f80bb',65412397,'74692f13-59e9-4ee7-b8f4-a53a7eeebac5','Magazine features editor Department','DEPARTMENT','FALSE'),
91 | ('12064004-a2ac-4207-bc78-4bafdd056bfc',51815649,'0c1b3501-5bfc-48c2-a309-90e33d58549a','Group 7074','GROUP','FALSE'),
92 | ('6a4e5d60-ae91-4bd4-a30d-71e2ffe666c5',53043545,'1a15d48e-f29d-4245-86d0-4f2de1fe99a8','Business Unit 4069','BUSINESS_UNIT','TRUE'),
93 | ('c365541a-9ad0-472d-b83d-8e013da8159b',97482688,'1d12d4d5-2310-459d-8873-232386635252','Business Unit 5827','BUSINESS_UNIT','TRUE'),
94 | ('b16df9bf-9248-4b0a-a5d2-72fdd1ceb440',94355331,'0c1b3501-5bfc-48c2-a309-90e33d58549a','Business Unit 6174','BUSINESS_UNIT','TRUE'),
95 | ('dc5580d7-0ec4-4830-885c-8a99b9b7a04b',52057019,'c734f48d-572e-4d9d-900d-226936605b96','Cost Center 9710','COST_CENTER','TRUE'),
96 | ('562a1979-da9f-400c-92b1-f72d2a4c3d11',22208450,'f4fe6d7a-98fa-4e3a-b4a2-83d9caac3e4f','Cost Center 4702','COST_CENTER','TRUE'),
97 | ('74692f13-59e9-4ee7-b8f4-a53a7eeebac5',75544406,'5999ad24-c76a-4571-9b50-a82aec515a0b','Business Unit 6124','BUSINESS_UNIT','FALSE'),
98 | ('cc07b0af-0cdc-4b69-83cc-5482b04ee7d6',86298291,'930d79b1-c130-48f2-b26f-30f9f8677f4a','Land Team','TEAM','FALSE'),
99 | ('73451513-0e2d-4a5d-911f-a79e85635400',44017564,'3a52ba46-eadd-4747-9526-1b3f215b07c9','Engineer, drilling Department','DEPARTMENT','FALSE'),
100 | ('d5a21ad4-9afe-4d0f-9838-4b13bb2fe477',98784608,'ccd5f4c5-00cd-47b6-995c-d14cdbd76db2','Editor, magazine features Team','TEAM','TRUE'),
101 | ('127c4ea1-1a06-4896-aae5-1d1995dc4545',60019693,'515b8deb-95da-4930-b74a-3fb43b425aa7','Cost Center 5511','COST_CENTER','TRUE'),
102 | ('30f58e59-ace9-491e-9bcb-16ddb783448f',52765100,'be475fed-b0d7-4942-b2c3-3405f82f0f3b','Business Unit 5623','BUSINESS_UNIT','TRUE'),
103 | ('ccde1d75-60f5-4983-9cce-5f8afd6df5cf',49419992,'abea7ee3-788a-4a1f-8826-185b85f4d579','Furniture designer Team','TEAM','FALSE'),
104 | ('5a447713-b8dc-4a1c-aab3-3ec3e1460c51',18596650,'65ac5e8a-7844-45b1-9c26-e13a785a4339','Group 3969','GROUP','FALSE'),
105 | ('1a15d48e-f29d-4245-86d0-4f2de1fe99a8',97110412,'137e2127-5003-422c-8ae0-7fe5fe10c355','Cost Center 3848','COST_CENTER','FALSE'),
106 | ('24247e71-7f53-4ac1-be5d-6a7cb49d0c21',51674757,'bd44ff2e-56cb-4c84-9738-91b98404ab3d','Radiographer, diagnostic Team','TEAM','TRUE'),
107 | ('1a2ed032-7acd-4c53-bc75-22dc71be5cbd',53568113,'6990d80d-1cac-42b9-bfd2-0610974abfa6','Business Unit 4836','BUSINESS_UNIT','FALSE'),
108 | ('e906b606-8395-4e7d-ae22-9ac7a3b9f3ef',82656384,'c2fc7bc9-4559-4d09-9e55-2c3e08c59ba4','Cost Center 3606','COST_CENTER','TRUE'),
109 | ('75c44d49-1e2b-4324-ba05-526e1c8b8837',48648713,'5a447713-b8dc-4a1c-aab3-3ec3e1460c51','Cost Center 406','COST_CENTER','TRUE'),
110 | ('b80f9630-36e6-4bb1-9d98-7b5599f6949a',39400067,'e906b606-8395-4e7d-ae22-9ac7a3b9f3ef','Group 7497','GROUP','TRUE'),
111 | ('1af65d67-0b9b-4e9e-aa8b-31def4126f24',72768689,'f66a1688-32f7-4e3c-a8b2-b815a03dad9b','Engineer, automotive Team','TEAM','TRUE'),
112 | ('b825b37d-437e-4ed0-8a66-3d74621c7af2',68563434,'923f897a-d74e-4edc-aae3-e824c2677494','Business Unit 8839','BUSINESS_UNIT','TRUE'),
113 | ('8f43edf1-ae6c-4cdb-ad6d-30c0cc662763',30898269,'83601212-47d7-402e-91ce-953fba897f5b','Cost Center 6523','COST_CENTER','FALSE'),
114 | ('10a9fb19-e382-4be9-a103-7a076a314c79',39643760,'1a15d48e-f29d-4245-86d0-4f2de1fe99a8','Administrator, Civil Service Department','DEPARTMENT','FALSE'),
115 | ('426c39ad-295e-4c3e-9042-10805bbcc630',11725552,'9ebf104e-963f-4441-af79-440b77ee3ffd','Group 5559','GROUP','FALSE'),
116 | ('f91075f8-6c78-4f89-83c3-6a4cd8e91f44',38925454,'f002ee36-6481-4150-b240-4b00bf371d8b','Group 6722','GROUP','TRUE'),
117 | ('6df33dbe-e26c-4673-b524-3678dca2b350',91620010,'c247c69f-0a6d-47e6-912f-ce404ce39920','Engineer, maintenance (IT) Team','TEAM','FALSE'),
118 | ('cdc70c48-27b9-4f13-aaef-08ba2f4867aa',50875228,'27e4de9c-4602-4c58-b032-fdc2a5818ebd','Firefighter Department','DEPARTMENT','FALSE'),
119 | ('7946a616-917c-4c17-b98a-f75061037648',71653905,'83601212-47d7-402e-91ce-953fba897f5b','Telecommunications researcher Department','DEPARTMENT','FALSE'),
120 | ('606f3cf3-85f8-4cff-9766-1b78228b1f8c',31614993,'6d982475-d6a9-40c4-8f59-a0c06dc94af9','Group 9122','GROUP','FALSE'),
121 | ('d95502bb-26d5-46d7-acf8-5ec3a6b266f9',27912310,'b150da1e-7a3f-43a4-9204-e657b74af8f7','Glass blower/designer Department','DEPARTMENT','FALSE'),
122 | ('05454df5-8d4a-458b-bafb-1e4fe2a93779',99888610,'34008cf4-6c47-44e7-bb9f-88dab099e2f6','Group 8555','GROUP','FALSE'),
123 | ('06de6277-bde1-43c8-b3be-9e94ea92393c',76312038,'319a6653-6eaf-4fa4-9bc2-405241069bc8','Technical author Department','DEPARTMENT','FALSE'),
124 | ('de016ec9-0530-4a60-994a-6c6c4bb00714',24386152,'78b871a7-77b1-4406-8488-c49d78a1f4a1','Group 3115','GROUP','TRUE'),
125 | ('935e995e-d883-41f7-823a-7ceefbf6b2f3',40715065,'43c3bc79-910d-4c47-9e47-cd5251089123','Psychotherapist Department','DEPARTMENT','FALSE'),
126 | ('3c3bec4e-00b3-41b1-83bf-4f5074ce7665',61730749,'b697cd77-5381-448c-85e7-5fc748f6a71a','Cost Center 8932','COST_CENTER','TRUE'),
127 | ('89c56a05-effe-4809-b15a-08ff9bb45f59',11033733,'b91f019a-cda0-4700-8061-b37c265a9a98','Scientist, audiological Team','TEAM','FALSE'),
128 | ('2b3b2bea-5f85-468d-9e49-01ac6fbb68ff',68779615,'426c39ad-295e-4c3e-9042-10805bbcc630','Business Unit 6061','BUSINESS_UNIT','FALSE'),
129 | ('550f743c-4ece-4d8f-9299-bd34cdd45335',92767293,'5a447713-b8dc-4a1c-aab3-3ec3e1460c51','Business Unit 8888','BUSINESS_UNIT','FALSE'),
130 | ('df198e9d-ce4c-4c13-8d3f-bb138bd661ff',86341642,'a35da513-294a-4a45-aaae-86006eef6510','Cost Center 9558','COST_CENTER','FALSE'),
131 | ('300b3cea-3ae0-43c3-b07d-4d0b9e91306a',39913294,'0bcb6c5a-e2f7-45d5-ab87-49a2f5121515','Cost Center 9519','COST_CENTER','TRUE'),
132 | ('abea7ee3-788a-4a1f-8826-185b85f4d579',53963421,'977695b6-d1bd-4ad9-b250-71ea7ebf010e','Business Unit 4104','BUSINESS_UNIT','FALSE'),
133 | ('d5cedcb8-eff3-4c6b-82a0-86dc93c57944',99830984,'bcde1581-c99c-4207-98b1-219d15ff99b4','Cost Center 4037','COST_CENTER','TRUE'),
134 | ('b256af9f-8012-4bdc-aa3e-18c3e3c3763b',30832163,'5a447713-b8dc-4a1c-aab3-3ec3e1460c51','Cost Center 2065','COST_CENTER','FALSE'),
135 | ('939a3149-6e18-490c-b4bc-b86822d1ca68',40640578,'f6d0055d-6697-42eb-aecf-e8d6ad68ed95','Art therapist Department','DEPARTMENT','TRUE'),
136 | ('5c3d162f-54b5-4dbc-8fa5-e5f5171097b3',77592829,'9ebf104e-963f-4441-af79-440b77ee3ffd','Cost Center 8968','COST_CENTER','TRUE'),
137 | ('8b3caf64-5b4d-41d0-b585-479cf88d1ff6',81473320,'04181780-ac6d-47e3-8126-cbf773966335','Business Unit 215','BUSINESS_UNIT','FALSE'),
138 | ('2df0ae47-2570-450b-be98-b376d0573969',13618612,'b256af9f-8012-4bdc-aa3e-18c3e3c3763b','Group 3416','GROUP','FALSE'),
139 | ('f002ee36-6481-4150-b240-4b00bf371d8b',48599683,'04181780-ac6d-47e3-8126-cbf773966335','Therapist, sports Department','DEPARTMENT','FALSE'),
140 | ('c113c784-80ec-40b1-ba68-713da5327f99',85617294,'aeead089-00c2-4446-a7cf-1d3bb8584895','Animator Department','DEPARTMENT','FALSE'),
141 | ('eef4e627-e22b-4ade-865d-643a76f4d8f0',69616207,'06de6277-bde1-43c8-b3be-9e94ea92393c','Psychiatric nurse Department','DEPARTMENT','TRUE'),
142 | ('c734f48d-572e-4d9d-900d-226936605b96',27003253,'606f3cf3-85f8-4cff-9766-1b78228b1f8c','Cost Center 6327','COST_CENTER','TRUE'),
143 | ('c786220f-71c9-466d-8233-f3f818ed5781',53343350,'3063d441-8375-4313-8e57-26bbafeda1b8','Theme park manager Department','DEPARTMENT','TRUE'),
144 | ('515b8deb-95da-4930-b74a-3fb43b425aa7',45696510,'27f2a393-c77d-4b91-8501-434256f52d3b','Group 6765','GROUP','FALSE'),
145 | ('f4fe6d7a-98fa-4e3a-b4a2-83d9caac3e4f',55443287,'1a2ed032-7acd-4c53-bc75-22dc71be5cbd','Group 5769','GROUP','FALSE'),
146 | ('0c1b3501-5bfc-48c2-a309-90e33d58549a',60061042,'0bcb6c5a-e2f7-45d5-ab87-49a2f5121515','Accountant, chartered certified Team','TEAM','FALSE'),
147 | ('a108d4ae-cdbd-4e44-ac57-0bafb2b76df9',83262234,'c113c784-80ec-40b1-ba68-713da5327f99','Group 9592','GROUP','TRUE'),
148 | ('b90d3726-fff2-4c4a-872e-dd3091bf0c44',47183580,'515b8deb-95da-4930-b74a-3fb43b425aa7','Veterinary surgeon Team','TEAM','TRUE'),
149 | ('6429c582-ed9e-4530-86c2-e2b1d31a3b92',31335523,'550f743c-4ece-4d8f-9299-bd34cdd45335','Television camera operator Team','TEAM','FALSE'),
150 | ('856542fb-0faa-4a89-bcd1-7d771806d918',97844833,'dc7faa72-2045-4f8c-8186-ba0f38e6db0f','Business Unit 1309','BUSINESS_UNIT','TRUE'),
151 | ('e834a380-e1f8-4148-8b06-f4fbcc7b180c',91168613,'cdc70c48-27b9-4f13-aaef-08ba2f4867aa','Cost Center 1015','COST_CENTER','FALSE'),
152 | ('930d79b1-c130-48f2-b26f-30f9f8677f4a',69807144,'d5a21ad4-9afe-4d0f-9838-4b13bb2fe477','Therapist, drama Department','DEPARTMENT','TRUE'),
153 | ('d1056987-b78f-49af-8ca5-28f8955f4464',85259132,'562a1979-da9f-400c-92b1-f72d2a4c3d11','Group 1556','GROUP','FALSE'),
154 | ('b697cd77-5381-448c-85e7-5fc748f6a71a',70247034,'3063d441-8375-4313-8e57-26bbafeda1b8','Business Unit 5949','BUSINESS_UNIT','TRUE'),
155 | ('aeead089-00c2-4446-a7cf-1d3bb8584895',84492052,'1f264398-6315-4cdc-810e-9e94dea6794f','Group 2758','GROUP','FALSE'),
156 | ('be475fed-b0d7-4942-b2c3-3405f82f0f3b',29589018,'b80f9630-36e6-4bb1-9d98-7b5599f6949a','Group 2498','GROUP','FALSE'),
157 | ('34008cf4-6c47-44e7-bb9f-88dab099e2f6',34588239,'3d195912-b964-4fdf-b73b-42d7634be2e2','Microbiologist Department','DEPARTMENT','TRUE'),
158 | ('b91f019a-cda0-4700-8061-b37c265a9a98',90959828,'550f743c-4ece-4d8f-9299-bd34cdd45335','Cost Center 5136','COST_CENTER','TRUE');
159 |
--------------------------------------------------------------------------------
/data/sql/paygroup.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE IF NOT EXISTS paygroup (
2 | id TEXT,
3 | remote_id INTEGER,
4 | pay_group_name TEXT,
5 | remote_was_deleted TEXT
6 | );
7 | INSERT INTO paygroup VALUES ('e3811390-eb58-4df1-b584-45cec0fb782a',94082417,'Intern','FALSE'),
8 | ('960fc07d-5498-4434-9ede-260f71953829',14527836,'Intern','TRUE'),
9 | ('04790b29-05b1-4940-a6c2-10a531eec967',38075219,'Part Time','FALSE'),
10 | ('741688e4-24af-40e9-a8c1-a17858b65182',16969302,'Part Time','FALSE'),
11 | ('888d5ee8-f2b8-4d26-af49-4d2db2dbd4cc',60159115,'Full Time','TRUE'),
12 | ('32fe1da2-17a9-45c1-a739-750b87ff3208',44695553,'Part Time','TRUE'),
13 | ('864829a5-e07d-4b8f-9071-114241cef12d',58005905,'Full Time','TRUE'),
14 | ('0d590557-b001-49b3-9bb7-9ced50d9f0f1',19846453,'Temp','FALSE'),
15 | ('565a25f6-c3a8-4170-898c-e2c849ed36e4',81443812,'Temp','TRUE'),
16 | ('87fdc45a-257f-40a3-96a9-d3ce4b678555',50179773,'Contractor','FALSE'),
17 | ('b5fad43e-0e69-45ee-ac2f-4781b560d69b',25315543,'Intern','FALSE'),
18 | ('d85019d4-19c2-4329-8dc2-fdb7e8ec68f9',14396625,'Temp','FALSE'),
19 | ('6c47f4a0-f48c-4736-ad24-d5559c374e01',72729901,'Intern','TRUE'),
20 | ('a2124a15-ee9f-4f95-842d-cc628b981aa7',13270787,'Full Time','TRUE'),
21 | ('3cc64f00-0a08-4ba0-86de-9336d27a5231',94726128,'Temp','FALSE'),
22 | ('99d0217a-a881-47a8-aaa5-60252b9b1f2a',18376598,'Full Time','TRUE'),
23 | ('263b3fa3-fd1a-4958-adef-14074b7bbb89',84449283,'Part Time','FALSE');
24 |
--------------------------------------------------------------------------------
/data/sql/payrollrun.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE IF NOT EXISTS payrollrun (
2 | id TEXT,
3 | remote_id INTEGER,
4 | run_state TEXT,
5 | run_type TEXT,
6 | start_date TEXT,
7 | end_date TEXT,
8 | check_date TEXT,
9 | remote_was_deleted TEXT
10 | );
11 | INSERT INTO payrollrun VALUES ('3888d17b-094f-4fbe-b6e2-72de31e25b88',86137645,'CLOSED','SIGN_ON_BONUS','2022-09-20 00:00:00','2022-10-04 00:00:00','2022-10-04 00:00:00','TRUE'),
12 | ('0562335a-25b5-4bfd-bc1e-f3fe8bd70cd0',35245594,'CLOSED','TERMINATION','2022-10-04 00:00:00','2022-10-18 00:00:00','2022-10-18 00:00:00','TRUE'),
13 | ('d7d4f172-c018-4db3-8cb9-d08f12b25322',39524768,'PAID','REGULAR','2022-10-18 00:00:00','2022-11-01 00:00:00','2022-11-01 00:00:00','FALSE'),
14 | ('8790b00f-e440-482a-b7fc-a3fb95c0d760',22630897,'FAILED','CORRECTION','2022-11-01 00:00:00','2022-11-15 00:00:00','2022-11-15 00:00:00','FALSE'),
15 | ('7adc002c-5276-4ba7-b3ae-1a018d6abf0b',53756676,'PAID','REGULAR','2022-11-15 00:00:00','2022-11-29 00:00:00','2022-11-29 00:00:00','FALSE'),
16 | ('03f33031-d61c-4b55-babc-a46d94003bd0',59174739,'DRAFT','CORRECTION','2022-11-29 00:00:00','2022-12-13 00:00:00','2022-12-13 00:00:00','TRUE'),
17 | ('9798a296-e5d8-4480-94c2-d5681fdb0ac7',15247010,'APPROVED','SIGN_ON_BONUS','2022-12-13 00:00:00','2022-12-27 00:00:00','2022-12-27 00:00:00','FALSE'),
18 | ('528e3da5-47d7-4f10-a4df-8bbfcd158329',79542196,'FAILED','TERMINATION','2022-12-27 00:00:00','2023-01-10 00:00:00','2023-01-10 00:00:00','TRUE'),
19 | ('d6d1dfbf-1fef-477a-8943-844feddb183c',97959295,'APPROVED','CORRECTION','2023-01-10 00:00:00','2023-01-24 00:00:00','2023-01-24 00:00:00','FALSE'),
20 | ('b8a164a2-ff9c-499e-9181-c695ff50f5d1',54741712,'PAID','SIGN_ON_BONUS','2023-01-24 00:00:00','2023-02-07 00:00:00','2023-02-07 00:00:00','TRUE'),
21 | ('44cf3101-aea6-4dea-bd56-4ddb53600701',59441461,'DRAFT','OFF_CYCLE','2023-02-07 00:00:00','2023-02-21 00:00:00','2023-02-21 00:00:00','TRUE'),
22 | ('9491c51c-aa1d-476a-9e13-ccf0a2928ef0',39072415,'DRAFT','TERMINATION','2023-02-21 00:00:00','2023-03-07 00:00:00','2023-03-07 00:00:00','TRUE'),
23 | ('21872d6d-a0bc-4073-bbb0-9d3423e4f5c8',23250898,'DRAFT','CORRECTION','2023-03-07 00:00:00','2023-03-21 00:00:00','2023-03-21 00:00:00','FALSE'),
24 | ('8961771d-735d-4cae-83f1-226c75cfb04e',45166759,'PAID','SIGN_ON_BONUS','2023-03-21 00:00:00','2023-04-04 00:00:00','2023-04-04 00:00:00','FALSE'),
25 | ('914ec8c0-2bfb-4155-8b50-d0af1eb943ae',87155677,'DRAFT','CORRECTION','2023-04-04 00:00:00','2023-04-18 00:00:00','2023-04-18 00:00:00','TRUE'),
26 | ('0886a73a-4efe-4268-9341-048fd9ee56a0',18984129,'DRAFT','SIGN_ON_BONUS','2023-04-18 00:00:00','2023-05-02 00:00:00','2023-05-02 00:00:00','TRUE'),
27 | ('2783b97e-fc42-40ae-8d52-d510e883350b',17607453,'APPROVED','OFF_CYCLE','2023-05-02 00:00:00','2023-05-16 00:00:00','2023-05-16 00:00:00','FALSE');
28 |
--------------------------------------------------------------------------------
/images/stage1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/images/stage1.png
--------------------------------------------------------------------------------
/images/stage2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/images/stage2.png
--------------------------------------------------------------------------------
/images/stage3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/images/stage3.png
--------------------------------------------------------------------------------
/images/stage4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/images/stage4.png
--------------------------------------------------------------------------------
/images/stage5a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/images/stage5a.png
--------------------------------------------------------------------------------
/images/stage5b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nirav0999/NL2SQL-LLM/a28b819232ca6ff8d1873f1447dc13a6f977b94a/images/stage5b.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # This file may be used to create an environment using:
2 | # $ conda create --name --file
3 | # platform: osx-64
4 | _anaconda_depends=2023.03=py310_0
5 | aiohttp=3.8.4=pypi_0
6 | aiosignal=1.3.1=pypi_0
7 | alabaster=0.7.12=pyhd3eb1b0_0
8 | anaconda=custom=py310_1
9 | anyio=3.5.0=py310hecd8cb5_0
10 | appdirs=1.4.4=pyhd3eb1b0_0
11 | applaunchservices=0.3.0=py310hecd8cb5_0
12 | appnope=0.1.2=py310hecd8cb5_1001
13 | appscript=1.1.2=py310hca72f7f_0
14 | argon2-cffi=21.3.0=pyhd3eb1b0_0
15 | argon2-cffi-bindings=21.2.0=py310hca72f7f_0
16 | arrow=1.2.3=py310hecd8cb5_1
17 | astroid=2.14.2=py310hecd8cb5_0
18 | astropy=5.1=py310h4e76f89_0
19 | asttokens=2.0.5=pyhd3eb1b0_0
20 | async-timeout=4.0.2=pypi_0
21 | atomicwrites=1.4.0=py_0
22 | attrs=22.1.0=py310hecd8cb5_0
23 | automat=20.2.0=py_0
24 | autopep8=1.6.0=pyhd3eb1b0_1
25 | babel=2.11.0=py310hecd8cb5_0
26 | backcall=0.2.0=pyhd3eb1b0_0
27 | bcrypt=3.2.0=py310hca72f7f_1
28 | beautifulsoup4=4.12.2=py310hecd8cb5_0
29 | binaryornot=0.4.4=pyhd3eb1b0_1
30 | black=23.3.0=py310hecd8cb5_0
31 | blas=1.0=openblas
32 | bleach=4.1.0=pyhd3eb1b0_0
33 | blosc=1.21.3=hcec6c5f_0
34 | bokeh=2.4.3=py310hecd8cb5_0
35 | bottleneck=1.3.5=py310h4e76f89_0
36 | brotli=1.0.9=hca72f7f_7
37 | brotli-bin=1.0.9=hca72f7f_7
38 | brotlipy=0.7.0=py310hca72f7f_1002
39 | brunsli=0.1=h23ab428_0
40 | bzip2=1.0.8=h1de35cc_0
41 | c-ares=1.19.0=h6c40b1e_0
42 | ca-certificates=2023.5.7=h8857fd0_0
43 | certifi=2023.5.7=pyhd8ed1ab_0
44 | cffi=1.15.1=py310h6c40b1e_3
45 | cfitsio=3.470=hbd21bf8_7
46 | chardet=4.0.0=py310hecd8cb5_1003
47 | charls=2.2.0=h23ab428_0
48 | charset-normalizer=2.0.4=pyhd3eb1b0_0
49 | click=8.0.4=py310hecd8cb5_0
50 | cloudpickle=2.2.1=py310hecd8cb5_0
51 | colorama=0.4.6=py310hecd8cb5_0
52 | colorcet=3.0.1=py310hecd8cb5_0
53 | comm=0.1.2=py310hecd8cb5_0
54 | constantly=15.1.0=py310hecd8cb5_0
55 | contourpy=1.0.5=py310haf03e11_0
56 | cookiecutter=1.7.3=pyhd3eb1b0_0
57 | cryptography=39.0.1=py310hf6deb26_0
58 | cssselect=1.1.0=pyhd3eb1b0_0
59 | curl=7.88.1=h6c40b1e_0
60 | cycler=0.11.0=pyhd3eb1b0_0
61 | cytoolz=0.12.0=py310hca72f7f_0
62 | dask=2023.4.1=py310hecd8cb5_0
63 | dask-core=2023.4.1=py310hecd8cb5_0
64 | dataclasses-json=0.5.7=pypi_0
65 | datashader=0.14.4=py310hecd8cb5_0
66 | datashape=0.5.4=py310hecd8cb5_1
67 | debugpy=1.5.1=py310he9d5cce_0
68 | decorator=5.1.1=pyhd3eb1b0_0
69 | defusedxml=0.7.1=pyhd3eb1b0_0
70 | diff-match-patch=20200713=pyhd3eb1b0_0
71 | dill=0.3.6=py310hecd8cb5_0
72 | distributed=2023.4.1=py310hecd8cb5_0
73 | docstring-to-markdown=0.11=py310hecd8cb5_0
74 | docutils=0.18.1=py310hecd8cb5_3
75 | entrypoints=0.4=py310hecd8cb5_0
76 | et_xmlfile=1.1.0=py310hecd8cb5_0
77 | exceptiongroup=1.0.4=py310hecd8cb5_0
78 | executing=0.8.3=pyhd3eb1b0_0
79 | filelock=3.9.0=py310hecd8cb5_0
80 | flake8=6.0.0=py310hecd8cb5_0
81 | flask=2.2.2=py310hecd8cb5_0
82 | fonttools=4.25.0=pyhd3eb1b0_0
83 | freetype=2.12.1=hd8bbffd_0
84 | frozenlist=1.3.3=pypi_0
85 | fsspec=2023.4.0=py310hecd8cb5_0
86 | future=0.18.3=py310hecd8cb5_0
87 | gensim=4.3.0=py310h3ea8b11_0
88 | gettext=0.21.0=he85b6c0_1
89 | giflib=5.2.1=h6c40b1e_3
90 | glib=2.69.1=hfff2838_2
91 | gmp=6.2.1=he9d5cce_3
92 | gmpy2=2.1.2=py310hd5de756_0
93 | greenlet=2.0.1=py310hcec6c5f_0
94 | gst-plugins-base=1.14.1=hcec6c5f_1
95 | gstreamer=1.14.1=h6c40b1e_1
96 | h5py=3.7.0=py310h6c517f8_0
97 | hdf5=1.10.6=h10fe05b_1
98 | heapdict=1.0.1=pyhd3eb1b0_0
99 | holoviews=1.15.4=py310hecd8cb5_0
100 | huggingface_hub=0.10.1=py310hecd8cb5_0
101 | hvplot=0.8.2=py310hecd8cb5_0
102 | hyperlink=21.0.0=pyhd3eb1b0_0
103 | icu=58.2=h0a44026_3
104 | idna=3.4=py310hecd8cb5_0
105 | imagecodecs=2021.8.26=py310hf5cf8d7_2
106 | imageio=2.26.0=py310hecd8cb5_0
107 | imagesize=1.4.1=py310hecd8cb5_0
108 | imbalanced-learn=0.10.1=py310hecd8cb5_0
109 | importlib-metadata=6.0.0=py310hecd8cb5_0
110 | importlib_metadata=6.0.0=hd3eb1b0_0
111 | incremental=21.3.0=pyhd3eb1b0_0
112 | inflection=0.5.1=py310hecd8cb5_0
113 | iniconfig=1.1.1=pyhd3eb1b0_0
114 | intake=0.6.8=py310hecd8cb5_0
115 | intervaltree=3.1.0=pyhd3eb1b0_0
116 | ipykernel=6.19.2=py310h20db666_0
117 | ipython=8.12.0=py310hecd8cb5_0
118 | ipython_genutils=0.2.0=pyhd3eb1b0_1
119 | ipywidgets=8.0.4=py310hecd8cb5_0
120 | isort=5.9.3=pyhd3eb1b0_0
121 | itemadapter=0.3.0=pyhd3eb1b0_0
122 | itemloaders=1.0.4=pyhd3eb1b0_1
123 | itsdangerous=2.0.1=pyhd3eb1b0_0
124 | jaraco.classes=3.2.1=pyhd3eb1b0_0
125 | jedi=0.18.1=py310hecd8cb5_1
126 | jellyfish=0.9.0=py310hca72f7f_0
127 | jinja2=3.1.2=py310hecd8cb5_0
128 | jinja2-time=0.2.0=pyhd3eb1b0_3
129 | jmespath=0.10.0=pyhd3eb1b0_0
130 | joblib=1.1.1=py310hecd8cb5_0
131 | jpeg=9e=h6c40b1e_1
132 | jq=1.6=h9ed2024_1000
133 | json5=0.9.6=pyhd3eb1b0_0
134 | jsonschema=4.17.3=py310hecd8cb5_0
135 | jupyter=1.0.0=py310hecd8cb5_8
136 | jupyter_client=8.1.0=py310hecd8cb5_0
137 | jupyter_console=6.6.3=py310hecd8cb5_0
138 | jupyter_core=5.3.0=py310hecd8cb5_0
139 | jupyter_server=1.23.4=py310hecd8cb5_0
140 | jupyterlab=3.5.3=py310hecd8cb5_0
141 | jupyterlab_pygments=0.1.2=py_0
142 | jupyterlab_server=2.22.0=py310hecd8cb5_0
143 | jupyterlab_widgets=3.0.5=py310hecd8cb5_0
144 | jxrlib=1.1=haf1e3a3_2
145 | keyring=23.13.1=py310hecd8cb5_0
146 | kiwisolver=1.4.4=py310hcec6c5f_0
147 | krb5=1.19.4=hdba6334_0
148 | langchain=0.0.171=pypi_0
149 | lazy-object-proxy=1.6.0=py310hca72f7f_0
150 | lazy_loader=0.1=py310hecd8cb5_0
151 | lcms2=2.12=hf1fd2bf_0
152 | lerc=3.0=he9d5cce_0
153 | libaec=1.0.4=hb1e8313_1
154 | libbrotlicommon=1.0.9=hca72f7f_7
155 | libbrotlidec=1.0.9=hca72f7f_7
156 | libbrotlienc=1.0.9=hca72f7f_7
157 | libclang=14.0.6=default_hd95374b_1
158 | libclang13=14.0.6=default_habbcc1a_1
159 | libcurl=7.88.1=ha585b31_0
160 | libcxx=14.0.6=h9765a3e_0
161 | libdeflate=1.17=hb664fd8_0
162 | libedit=3.1.20221030=h6c40b1e_0
163 | libev=4.33=h9ed2024_1
164 | libffi=3.4.4=hecd8cb5_0
165 | libgfortran=5.0.0=11_3_0_hecd8cb5_28
166 | libgfortran5=11.3.0=h9dfd629_28
167 | libiconv=1.16=hca72f7f_2
168 | libllvm14=14.0.6=h91fad77_3
169 | libnghttp2=1.46.0=ha29bfda_0
170 | libopenblas=0.3.21=h54e7dc3_0
171 | libpng=1.6.39=h6c40b1e_0
172 | libpq=12.9=h1c9f633_3
173 | libprotobuf=3.20.3=hfff2838_0
174 | libsodium=1.0.18=h1de35cc_0
175 | libspatialindex=1.9.3=h23ab428_0
176 | libssh2=1.10.0=h0a4fc7d_0
177 | libtiff=4.5.0=hcec6c5f_2
178 | libuv=1.44.2=h6c40b1e_0
179 | libwebp=1.2.4=hf6ce154_1
180 | libwebp-base=1.2.4=h6c40b1e_1
181 | libxml2=2.10.3=h930c0e2_0
182 | libxslt=1.1.37=h6d1eb0e_0
183 | libzopfli=1.0.3=hb1e8313_0
184 | llvm-openmp=14.0.6=h0dcd299_0
185 | llvmlite=0.40.0=py310hfff2838_0
186 | locket=1.0.0=py310hecd8cb5_0
187 | lxml=4.9.2=py310h6c40b1e_0
188 | lz4-c=1.9.4=hcec6c5f_0
189 | lzo=2.10=haf1e3a3_2
190 | markdown=3.4.1=py310hecd8cb5_0
191 | markupsafe=2.1.1=py310hca72f7f_0
192 | marshmallow=3.19.0=pypi_0
193 | marshmallow-enum=1.5.1=pypi_0
194 | matplotlib=3.7.1=py310hecd8cb5_1
195 | matplotlib-base=3.7.1=py310ha533b9c_1
196 | matplotlib-inline=0.1.6=py310hecd8cb5_0
197 | mccabe=0.7.0=pyhd3eb1b0_0
198 | mistune=0.8.4=py310hca72f7f_1000
199 | mock=4.0.3=pyhd3eb1b0_0
200 | more-itertools=8.12.0=pyhd3eb1b0_0
201 | mpc=1.1.0=h6ef4df4_1
202 | mpfr=4.0.2=h9066e36_1
203 | mpmath=1.2.1=pypi_0
204 | msgpack-python=1.0.3=py310haf03e11_0
205 | multidict=6.0.4=pypi_0
206 | multipledispatch=0.6.0=py310hecd8cb5_0
207 | munkres=1.1.4=py_0
208 | mypy_extensions=0.4.3=py310hecd8cb5_1
209 | nbclassic=0.5.5=py310hecd8cb5_0
210 | nbclient=0.5.13=py310hecd8cb5_0
211 | nbconvert=6.5.4=py310hecd8cb5_0
212 | nbformat=5.7.0=py310hecd8cb5_0
213 | ncurses=6.4=hcec6c5f_0
214 | nest-asyncio=1.5.6=py310hecd8cb5_0
215 | networkx=2.8.4=py310hecd8cb5_1
216 | ninja=1.10.2=hecd8cb5_5
217 | ninja-base=1.10.2=haf03e11_5
218 | nltk=3.7=pyhd3eb1b0_0
219 | notebook=6.5.4=py310hecd8cb5_0
220 | notebook-shim=0.2.2=py310hecd8cb5_0
221 | nspr=4.33=he9d5cce_0
222 | nss=3.74=h47edf6a_0
223 | numba=0.57.0=py310h3ea8b11_0
224 | numexpr=2.8.4=py310he50c29a_1
225 | numpy=1.24.3=py310he50c29a_0
226 | numpy-base=1.24.3=py310h992e150_0
227 | numpydoc=1.5.0=py310hecd8cb5_0
228 | oniguruma=6.9.7.1=h9ed2024_0
229 | openai=0.27.6=pypi_0
230 | openapi-schema-pydantic=1.2.4=pypi_0
231 | openjpeg=2.4.0=h66ea3da_0
232 | openpyxl=3.0.10=py310hca72f7f_0
233 | openssl=1.1.1t=hfd90126_0
234 | packaging=23.0=py310hecd8cb5_0
235 | pandas=1.5.3=py310h3ea8b11_0
236 | pandocfilters=1.5.0=pyhd3eb1b0_0
237 | panel=0.14.3=py310hecd8cb5_0
238 | param=1.12.3=py310hecd8cb5_0
239 | parsel=1.6.0=py310hecd8cb5_0
240 | parso=0.8.3=pyhd3eb1b0_0
241 | partd=1.2.0=pyhd3eb1b0_1
242 | pathspec=0.10.3=py310hecd8cb5_0
243 | patsy=0.5.3=py310hecd8cb5_0
244 | pcre=8.45=h23ab428_0
245 | pep8=1.7.1=py310hecd8cb5_1
246 | pexpect=4.8.0=pyhd3eb1b0_3
247 | pickleshare=0.7.5=pyhd3eb1b0_1003
248 | pillow=9.4.0=py310hcec6c5f_0
249 | pip=23.0.1=py310hecd8cb5_0
250 | platformdirs=2.5.2=py310hecd8cb5_0
251 | plotly=5.9.0=py310hecd8cb5_0
252 | pluggy=1.0.0=py310hecd8cb5_1
253 | ply=3.11=py310hecd8cb5_0
254 | pooch=1.4.0=pyhd3eb1b0_0
255 | poyo=0.5.0=pyhd3eb1b0_0
256 | prometheus_client=0.14.1=py310hecd8cb5_0
257 | prompt-toolkit=3.0.36=py310hecd8cb5_0
258 | prompt_toolkit=3.0.36=hd3eb1b0_0
259 | protego=0.1.16=py_0
260 | protobuf=3.20.1=pypi_0
261 | psutil=5.9.0=py310hca72f7f_0
262 | ptyprocess=0.7.0=pyhd3eb1b0_2
263 | pure_eval=0.2.2=pyhd3eb1b0_0
264 | pyasn1=0.4.8=pyhd3eb1b0_0
265 | pyasn1-modules=0.2.8=py_0
266 | pycodestyle=2.10.0=py310hecd8cb5_0
267 | pycparser=2.21=pyhd3eb1b0_0
268 | pyct=0.5.0=py310hecd8cb5_0
269 | pycurl=7.45.2=py310hdb2fb19_0
270 | pydantic=1.10.7=pypi_0
271 | pydispatcher=2.0.5=py310hecd8cb5_2
272 | pydocstyle=6.3.0=py310hecd8cb5_0
273 | pyerfa=2.0.0=py310hca72f7f_0
274 | pyflakes=3.0.1=py310hecd8cb5_0
275 | pygments=2.15.1=py310hecd8cb5_0
276 | pylint=2.16.2=py310hecd8cb5_0
277 | pylint-venv=2.3.0=py310hecd8cb5_0
278 | pyls-spyder=0.4.0=pyhd3eb1b0_0
279 | pyobjc-core=9.0=py310h9205ec4_1
280 | pyobjc-framework-cocoa=9.0=py310h9205ec4_0
281 | pyobjc-framework-coreservices=9.0=py310h46256e1_0
282 | pyobjc-framework-fsevents=9.0=py310hecd8cb5_0
283 | pyodbc=4.0.34=py310he9d5cce_0
284 | pyopenssl=23.0.0=py310hecd8cb5_0
285 | pyparsing=3.0.9=py310hecd8cb5_0
286 | pyqt=5.15.7=py310he9d5cce_0
287 | pyqt5-sip=12.11.0=pypi_0
288 | pyqtwebengine=5.15.7=py310he9d5cce_0
289 | pyrsistent=0.18.0=py310hca72f7f_0
290 | pysocks=1.7.1=py310hecd8cb5_0
291 | pytables=3.7.0=py310h59775c6_1
292 | pytest=7.3.1=py310hecd8cb5_0
293 | python=3.10.11=h218abb5_2
294 | python-dateutil=2.8.2=pyhd3eb1b0_0
295 | python-fastjsonschema=2.16.2=py310hecd8cb5_0
296 | python-lmdb=1.4.1=py310hcec6c5f_0
297 | python-lsp-black=1.2.1=py310hecd8cb5_0
298 | python-lsp-jsonrpc=1.0.0=pyhd3eb1b0_0
299 | python-lsp-server=1.7.2=py310hecd8cb5_0
300 | python-slugify=5.0.2=pyhd3eb1b0_0
301 | python-snappy=0.6.1=py310hcec6c5f_0
302 | python.app=3=py310hca72f7f_0
303 | python_abi=3.10=2_cp310
304 | pytoolconfig=1.2.5=py310hecd8cb5_1
305 | pytorch=1.13.1=cpu_py310h9e40b02_0
306 | pytz=2022.7=py310hecd8cb5_0
307 | pyviz_comms=2.0.2=pyhd3eb1b0_0
308 | pywavelets=1.4.1=py310h6c40b1e_0
309 | pyyaml=6.0=py310h6c40b1e_1
310 | pyzmq=25.0.2=py310hcec6c5f_0
311 | qdarkstyle=3.0.2=pyhd3eb1b0_0
312 | qstylizer=0.2.2=py310hecd8cb5_0
313 | qt-main=5.15.2=h51e0635_8
314 | qt-webengine=5.15.9=h90a370e_4
315 | qtawesome=1.2.2=py310hecd8cb5_0
316 | qtconsole=5.4.2=py310hecd8cb5_0
317 | qtpy=2.2.0=py310hecd8cb5_0
318 | qtwebkit=5.212=hbfab81c_5
319 | queuelib=1.5.0=py310hecd8cb5_0
320 | readline=8.2=hca72f7f_0
321 | regex=2022.7.9=py310hca72f7f_0
322 | requests=2.29.0=py310hecd8cb5_0
323 | requests-file=1.5.1=pyhd3eb1b0_0
324 | rope=1.7.0=py310hecd8cb5_0
325 | rtree=1.0.1=py310hecd8cb5_0
326 | scikit-image=0.20.0=py310hcec6c5f_0
327 | scikit-learn=1.2.2=py310hcec6c5f_0
328 | scipy=1.10.1=py310ha516a68_1
329 | scrapy=2.8.0=py310hecd8cb5_0
330 | seaborn=0.12.2=py310hecd8cb5_0
331 | send2trash=1.8.0=pyhd3eb1b0_1
332 | sentence-transformers=2.2.2=pyhd8ed1ab_0
333 | sentencepiece=0.1.95=pypi_0
334 | service_identity=18.1.0=pyhd3eb1b0_1
335 | setuptools=66.0.0=py310hecd8cb5_0
336 | sip=6.6.2=py310he9d5cce_0
337 | six=1.16.0=pyhd3eb1b0_1
338 | smart_open=5.2.1=py310hecd8cb5_0
339 | snappy=1.1.9=he9d5cce_0
340 | sniffio=1.2.0=py310hecd8cb5_1
341 | snowballstemmer=2.2.0=pyhd3eb1b0_0
342 | sortedcontainers=2.4.0=pyhd3eb1b0_0
343 | soupsieve=2.4=py310hecd8cb5_0
344 | sphinx=5.0.2=py310hecd8cb5_0
345 | sphinxcontrib-applehelp=1.0.2=pyhd3eb1b0_0
346 | sphinxcontrib-devhelp=1.0.2=pyhd3eb1b0_0
347 | sphinxcontrib-htmlhelp=2.0.0=pyhd3eb1b0_0
348 | sphinxcontrib-jsmath=1.0.1=pyhd3eb1b0_0
349 | sphinxcontrib-qthelp=1.0.3=pyhd3eb1b0_0
350 | sphinxcontrib-serializinghtml=1.1.5=pyhd3eb1b0_0
351 | spyder=5.4.3=py310hecd8cb5_1
352 | spyder-kernels=2.4.3=py310hecd8cb5_0
353 | sqlalchemy=1.4.39=py310hca72f7f_0
354 | sqlite=3.41.2=h6c40b1e_0
355 | stack_data=0.2.0=pyhd3eb1b0_0
356 | statsmodels=0.13.5=py310h7b7cdfe_1
357 | sympy=1.11.1=py310hecd8cb5_0
358 | tabulate=0.8.10=py310hecd8cb5_0
359 | tbb=2021.8.0=ha357a0b_0
360 | tbb4py=2021.8.0=py310ha357a0b_0
361 | tblib=1.7.0=pyhd3eb1b0_0
362 | tenacity=8.2.2=py310hecd8cb5_0
363 | terminado=0.17.1=py310hecd8cb5_0
364 | text-unidecode=1.3=pyhd3eb1b0_0
365 | textdistance=4.2.1=pyhd3eb1b0_0
366 | threadpoolctl=2.2.0=pyh0d69192_0
367 | three-merge=0.1.1=pyhd3eb1b0_0
368 | tifffile=2021.7.2=pyhd3eb1b0_2
369 | tinycss2=1.2.1=py310hecd8cb5_0
370 | tk=8.6.12=h5d9f67b_0
371 | tldextract=3.2.0=pyhd3eb1b0_0
372 | tokenizers=0.11.4=py310h8776b5c_1
373 | toml=0.10.2=pyhd3eb1b0_0
374 | tomli=2.0.1=py310hecd8cb5_0
375 | tomlkit=0.11.1=py310hecd8cb5_0
376 | toolz=0.12.0=py310hecd8cb5_0
377 | torchvision=0.14.1=cpu_py310hd5ee960_0
378 | tornado=6.2=py310hca72f7f_0
379 | tqdm=4.65.0=py310h20db666_0
380 | traitlets=5.7.1=py310hecd8cb5_0
381 | transformers=4.24.0=py310hecd8cb5_0
382 | twisted=22.10.0=py310h6c40b1e_0
383 | typing-extensions=4.5.0=py310hecd8cb5_0
384 | typing-inspect=0.8.0=pypi_0
385 | typing_extensions=4.5.0=py310hecd8cb5_0
386 | tzdata=2023c=h04d1e81_0
387 | ujson=5.4.0=py310he9d5cce_0
388 | unidecode=1.2.0=pyhd3eb1b0_0
389 | unixodbc=2.3.11=hb456775_0
390 | urllib3=1.26.15=py310hecd8cb5_0
391 | w3lib=1.21.0=pyhd3eb1b0_0
392 | watchdog=2.1.6=py310h999c104_0
393 | wcwidth=0.2.5=pyhd3eb1b0_0
394 | webencodings=0.5.1=py310hecd8cb5_1
395 | websocket-client=0.58.0=py310hecd8cb5_4
396 | werkzeug=2.2.3=py310hecd8cb5_0
397 | whatthepatch=1.0.2=py310hecd8cb5_0
398 | wheel=0.38.4=py310hecd8cb5_0
399 | widgetsnbextension=4.0.5=py310hecd8cb5_0
400 | wrapt=1.14.1=py310hca72f7f_0
401 | wurlitzer=3.0.2=py310hecd8cb5_0
402 | xarray=2022.11.0=py310hecd8cb5_0
403 | xlwings=0.29.1=py310hecd8cb5_0
404 | xz=5.4.2=h6c40b1e_0
405 | yaml=0.2.5=haf1e3a3_0
406 | yapf=0.31.0=pyhd3eb1b0_0
407 | yarl=1.9.2=pypi_0
408 | zeromq=4.3.4=h23ab428_0
409 | zfp=0.5.5=he9d5cce_6
410 | zict=2.2.0=py310hecd8cb5_0
411 | zipp=3.11.0=py310hecd8cb5_0
412 | zlib=1.2.13=h4dc903c_0
413 | zope=1.0=py310hecd8cb5_1
414 | zope.interface=5.4.0=py310hca72f7f_0
415 | zstd=1.5.5=hc035e20_0
416 |
--------------------------------------------------------------------------------
/src/check_retr_data.py:
--------------------------------------------------------------------------------
1 | from utils import loadJsonFile, dumpJsonFile
2 | import os
3 | import sqlite3
4 | import pandas as pd
5 | import random
6 |
7 | # Table Name
8 | table_name = "group_final"
9 |
10 | # Loading SQL Database
11 | DB_FILEPATH = "../data/database/final_db.db"
12 | conn = sqlite3.connect(DB_FILEPATH)
13 |
14 | # Load the example queries
15 | df = pd.read_csv("../data/example_queries/complete_set/" + table_name + ".csv", header="infer", delimiter="|")
16 |
17 | final_examples = {}
18 |
19 | print(df.columns)
20 |
21 | # Check if the query is valid
22 | for index, row in df.iterrows():
23 | question = row['Question']
24 | query = row['SQL Query']
25 | print(" ========= SQL Result ========= ")
26 | try:
27 | print("Question: ", question)
28 | print("Query: ", query)
29 | print(conn.execute(query).fetchall())
30 | final_examples[question] = query
31 | except:
32 | print("Query Failed")
33 |
34 | conn.close()
35 |
36 | print("Total Number of Sucessful Examples: ", len(final_examples))
37 |
38 | final_dataset = []
39 | for example in final_examples:
40 | final_dataset.append([example, final_examples[example]])
41 |
42 | random.shuffle(final_dataset)
43 |
44 | # Picking the first 5 examples as test set
45 | test_data = pd.DataFrame(final_dataset[:5], columns = ['Question', 'SQL Query'])
46 |
47 | # Picking the rest as retrieval set
48 | retr_data = pd.DataFrame(final_dataset[5:], columns = ['Question', 'SQL Query'])
49 |
50 | test_data.to_csv("../data/example_queries/test_set/final_" + table_name + ".csv", sep="|", index=False)
51 | retr_data.to_csv("../data/example_queries/retr_set/final_" + table_name + ".csv", sep="|", index=False)
--------------------------------------------------------------------------------
/src/create_db.py:
--------------------------------------------------------------------------------
1 | import os
2 | from utils import dumpJsonFile, loadJsonFile, createSQLDB
3 | import sqlite3
4 |
5 | DB_FILEPATH = "../data/database/final_db.db"
6 |
7 | earning_db_path = "../data/sql/earning.sql"
8 | employee_db_path = "../data/sql/employee.sql"
9 | employeepayrollrun_db_path = "../data/sql/employeepayrollrun.sql"
10 | group_db_path = "../data/sql/group_final.sql"
11 | payrollrun_db_path = "../data/sql/payrollrun.sql"
12 | paygroup_db_path = "../data/sql/paygroup.sql"
13 |
14 | conn = sqlite3.connect(DB_FILEPATH)
15 |
16 | with open(earning_db_path, 'r') as sql_file:
17 | conn.executescript(sql_file.read())
18 |
19 | with open(earning_db_path, 'r') as sql_file:
20 | conn.executescript(sql_file.read())
21 |
22 | with open(employee_db_path, 'r') as sql_file:
23 | conn.executescript(sql_file.read())
24 |
25 | with open(employeepayrollrun_db_path, 'r') as sql_file:
26 | conn.executescript(sql_file.read())
27 |
28 | with open(group_db_path, 'r') as sql_file:
29 | conn.executescript(sql_file.read())
30 |
31 | with open(payrollrun_db_path, 'r') as sql_file:
32 | conn.executescript(sql_file.read())
33 |
34 | with open(paygroup_db_path, 'r') as sql_file:
35 | conn.executescript(sql_file.read())
36 |
37 | conn.close()
38 |
39 |
40 |
--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
1 | import os
2 | from utils import dumpJsonFile, loadJsonFile
3 | import sqlite3
4 | import pandas as pd
5 | from text_sim import *
6 | from tqdm import tqdm
7 | import time
8 |
9 | import warnings
10 | warnings.filterwarnings('ignore')
11 |
12 |
13 | from typing import List
14 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
15 |
16 |
17 | # Model taken from Huggingface Github Repo
18 | MODEL_PATH = "juierror/flan-t5-text2sql-with-schema"
19 | TOKENIZER_PATH = "juierror/flan-t5-text2sql-with-schema"
20 | COLUMNS_JSON_FILE = "/Users/niravdiwan/Desktop/projects/text2sql/text2sql-LLM/data/columns.json"
21 |
22 | def prepare_input(question: str, table: List[str]):
23 | table_prefix = "table:"
24 | table_name_prefix = "table_name:"
25 | sample_prefix = ""
26 | question_prefix = "question:"
27 | join_table = ",".join(table)
28 | inputs = f"""
29 | You are an SQL Query expert who can write SQL queries for the below table.
30 |
31 | {table_prefix} {join_table}
32 |
33 | Answer the following question:
34 | question : {question}
35 | """
36 | # print("\t ---- Prompt ----- \t")
37 | # print(inputs)
38 |
39 | input_ids = tokenizer(inputs, max_length=512, return_tensors="pt").input_ids
40 | return input_ids
41 |
42 |
43 | def cot_prepare_input(question: str, table: List[str], questions : List[str], example_queries : List[str]):
44 | table_prefix = "table:"
45 | table_name_prefix = "table_name:"
46 | sample_prefix = ""
47 | question_prefix = "question:"
48 | join_table = ",".join(table)
49 | inputs = f"""
50 | You are an SQL Query expert who can write SQL queries for the below table.
51 | {table_prefix} {join_table}
52 | For the below questions, you are given the example queries. You need to write the SQL query for the last question.
53 | """
54 |
55 | for question_no, s_question in enumerate(questions):
56 | inputs += f"""
57 | {s_question}
58 | {example_queries[question_no]}
59 | """
60 |
61 | inputs += f"""
62 | Only answer the following question:
63 | {question},
64 | """
65 |
66 | input_ids = tokenizer(inputs, max_length=512, return_tensors="pt").input_ids
67 | return input_ids
68 |
69 |
70 | def inference(question: str, table: List[str]) -> str:
71 | input_data = prepare_input(question=question, table=table)
72 | input_data = input_data.to(model.device)
73 | outputs = model.generate(inputs=input_data, num_beams=10, top_k=10, max_length=1024)
74 | result = tokenizer.decode(token_ids=outputs[0], skip_special_tokens=True)
75 | return result
76 |
77 | def cot_inference(question: str, table: List[str], questions : List[str], example_queries : List[str]) -> str:
78 | input_data = cot_prepare_input(question=question, table=table, questions = questions, example_queries = example_queries)
79 | input_data = input_data.to(model.device)
80 | outputs = model.generate(inputs=input_data, num_beams=10, top_k=10, max_length=1024)
81 | result = tokenizer.decode(token_ids=outputs[0], skip_special_tokens=True)
82 | return result
83 |
84 |
85 |
86 | def test(question, DB_FILEPATH = "../data/database/final_db.db", k = 5, table_name = "employee"):
87 |
88 | # Retrive the samples
89 | test_df = pd.read_csv("../data/example_queries/test_set/final_" + table_name + ".csv", delimiter="|")
90 | retr_df = pd.read_csv("../data/example_queries/retr_set/final_" + table_name + ".csv", delimiter="|")
91 |
92 | sample_questions = []
93 | sample_queries = []
94 |
95 | print("\n")
96 |
97 | for index, row in retr_df.iterrows():
98 | s_question = row["Question"]
99 | sql_query = row["SQL Query"]
100 | sample_questions.append(s_question)
101 | sample_queries.append(sql_query)
102 |
103 | print("Loading SQL Database ...")
104 | conn = sqlite3.connect(DB_FILEPATH)
105 | for i in tqdm(range(2)):
106 | time.sleep(1)
107 |
108 |
109 | print("\n")
110 |
111 | print("Retrieving top " + str(k) + " similar queries from the dataset to perform Chain of Thought (CoT) Prompting ...")
112 | top_k_indices = get_top_k_similar(question, sample_questions, k = k)
113 |
114 | sample_questions = [sample_questions[i] for i in top_k_indices]
115 | sample_queries = [sample_queries[i] for i in top_k_indices]
116 |
117 | for question_no, s_question in enumerate(sample_questions):
118 | print("Question : ", sample_questions[question_no])
119 | print("SQL Query :", sample_queries[question_no])
120 |
121 | print("\n")
122 |
123 | # print(" ========= Zero-Shot Test SQL ========= ")
124 | gen_query1 = inference(question, table)
125 | gen_query1 = gen_query1.replace(" table", " " + table_name)
126 |
127 | print("Generated Query using Zero-Shot Prompting = ", gen_query1)
128 | try:
129 | print(conn.execute(gen_query1).fetchall())
130 | print("Zero-Shot Query Works!")
131 | except:
132 | print("Error in Zero-Shot SQL Query")
133 |
134 | # print(" ========= CoT Test SQL ========= ")
135 | gen_query2 = cot_inference(question, table, sample_questions, sample_queries)
136 | gen_query2 = gen_query2.replace(" table", " " + table_name)
137 |
138 | print("Generated Query using Chain of Thought (CoT) Prompting = ", gen_query2)
139 | try:
140 | print(conn.execute(gen_query2).fetchall())
141 | print("CoT Query Works!")
142 | except:
143 | print("Error in CoT SQL Query!")
144 |
145 |
146 |
147 | def test_dataset(DB_FILEPATH = "../data/database/final_db.db", k = 5, table_name = "employee"):
148 |
149 | test_df = pd.read_csv("../data/example_queries/test_set/final_" + table_name + ".csv", delimiter="|")
150 | retr_df = pd.read_csv("../data/example_queries/retr_set/final_" + table_name + ".csv", delimiter="|")
151 |
152 | sample_questions = []
153 | sample_queries = []
154 |
155 |
156 | print("\n")
157 |
158 | for index, row in retr_df.iterrows():
159 | s_question = row["Question"]
160 | sql_query = row["SQL Query"]
161 | sample_questions.append(s_question)
162 | sample_queries.append(sql_query)
163 |
164 | print("Loading SQL Database ...")
165 | conn = sqlite3.connect(DB_FILEPATH)
166 | for i in tqdm(range(2)):
167 | time.sleep(1)
168 |
169 | for index, row in test_df.iterrows():
170 | print("\n\n\n\n\n")
171 | print(" ========= Test Query ========= ")
172 | question = row["Question"]
173 |
174 | top_k_indices = get_top_k_similar(question, sample_questions, k=5)
175 |
176 | print("Loading Sample ")
177 |
178 | sample_questions = [sample_questions[i] for i in top_k_indices]
179 | sample_queries = [sample_queries[i] for i in top_k_indices]
180 |
181 | print("Sample Questions = ", sample_questions)
182 | print("Sample Queries = ", sample_queries)
183 |
184 | print(" ========= Zero-Shot Test SQL ========= ")
185 | sql_query = row["SQL Query"]
186 | gen_query = inference(question, table)
187 | gen_query = gen_query.replace(" table", " " + table_name)
188 | print("Generated Query = ", gen_query)
189 | print("Original Query = ", sql_query)
190 | try:
191 | print(conn.execute(gen_query).fetchall())
192 | except:
193 | print("Error in SQL Query")
194 |
195 |
196 | print("\n\n\n\n\n")
197 |
198 | print(" ========= CoT Test SQL ========= ")
199 | gen_query = cot_inference(question, table, sample_questions, sample_queries)
200 | gen_query = gen_query.replace(" table", " " + table_name)
201 | print("Generated Query = ", gen_query)
202 | print("Original Query = ", sql_query)
203 | try:
204 | print(conn.execute(gen_query).fetchall())
205 | except:
206 | print("Error in SQL Query")
207 |
208 | conn.close()
209 |
210 |
211 |
212 | print("\n\n\n")
213 | print("Hi! I am a SQL Query Generator. I can generate SQL queries for you. Please enter the table name for which you want to generate the SQL query.")
214 | table_name = input("Enter the table name : ")
215 |
216 | columns = loadJsonFile("../data/columns.json", verbose=False)
217 | table = columns[table_name]
218 |
219 | print("\n")
220 |
221 | print("Loading the model ...")
222 | tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
223 | model = AutoModelForSeq2SeqLM.from_pretrained(TOKENIZER_PATH)
224 |
225 | test_bool = input("Do you want to enter your own question (y) or perform an evaluation on the test dataset (n) ?")
226 |
227 | print("\n")
228 |
229 | if test_bool == "y":
230 | question = input("Enter your question : ")
231 | test(question, DB_FILEPATH = "../data/database/final_db.db", k = 5, table_name = table_name)
232 | else:
233 | test_dataset(DB_FILEPATH = "../data/database/final_db.db", k = 5, table_name = table_name)
--------------------------------------------------------------------------------
/src/text_sim.py:
--------------------------------------------------------------------------------
1 | from sentence_transformers import SentenceTransformer
2 | import pandas as pd
3 | import numpy as np
4 |
5 | model = SentenceTransformer('all-MiniLM-L6-v2')
6 |
7 | def cosine_similarity(a, b):
8 | return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
9 |
10 | def get_embedding(sentence):
11 | embed = model.encode(sentence)
12 | return embed
13 |
14 | def get_similarity(query, sentence):
15 | query_embedding = get_embedding(query)
16 | sentence_embedding = get_embedding(sentence)
17 | return cosine_similarity(query_embedding, sentence_embedding)
18 |
19 | def get_top_k_similar(query, sentences, k=5):
20 | similarities = []
21 | for sentence in sentences:
22 | similarity = get_similarity(query, sentence)
23 | similarities.append(similarity)
24 | similarities = np.array(similarities)
25 | top_k_indices = np.argsort(similarities, axis=0)[-k:]
26 | return top_k_indices
27 |
28 | if __name__ == "__main__":
29 | pass
--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
1 | import json
2 | import csv
3 | import collections
4 | import pickle
5 | import os
6 | import sqlite3
7 | import csv
8 |
9 | #-------------------------JSON Functions-------------------------------------------------------
10 | def dumpJsonFile(dictionary, filepath, verbose = True, print_dict = False):
11 | """
12 | Dump a json file
13 | """
14 | if verbose == True : print("Dumping a dictionary to filepath",filepath,"...............")
15 |
16 | with open(filepath,"w+") as jsonFile:
17 | json.dump(dictionary, jsonFile, indent = 4, sort_keys = True)
18 |
19 | if print_dict == True : print(json.dumps(dictionary,indent = 4))
20 | if verbose == True : print("Dumped Successfully")
21 |
22 |
23 | def loadJsonFile(filepath, verbose = True, print_dict = False):
24 | """
25 | Load a json file
26 | """
27 | if verbose == True : print("Loading a dictionary to filepath",filepath,".........")
28 | dictionary = {}
29 |
30 | with open(filepath) as jsonFile:
31 | dictionary = json.load(jsonFile)
32 |
33 | if verbose == True : print("Loaded Successfully")
34 | if print_dict == True : print(json.dumps(dictionary,indent = 4))
35 |
36 | return dictionary
37 |
38 | def createSQLDB(db_path, sql_path):
39 | """
40 | Create a SQL database from a SQL file
41 | """
42 | conn = sqlite3.connect(db_path)
43 | with open(sql_path, 'r') as sql_file:
44 | conn.executescript(sql_file.read())
45 | conn.close()
46 |
47 |
48 |
49 | def convert_delimiter(input_file, output_file):
50 | with open(input_file, 'r', newline='') as file:
51 | reader = csv.reader(file, delimiter='\t')
52 | rows = list(reader)
53 |
54 | with open(output_file, 'w', newline='') as file:
55 | writer = csv.writer(file, delimiter='|')
56 | writer.writerows(rows)
57 |
58 | print(f"Conversion complete. Converted file saved as {output_file}")
59 |
60 | if __name__ == "__main__":
61 | pass
--------------------------------------------------------------------------------