├── English_to_Python.ipynb ├── LICENSE ├── README.md └── res ├── attention_python_code_generator.png ├── resources.md └── transformer_multihead.png /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Divyam Shah 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # English-to-Python-Converter 2 | This is an attempt to use transformers and self-attention in order to convert English descriptions into Python code. 3 | 4 | [Notebook](https://github.com/divyam96/English-to-Python-Converter/blob/main/English_to_Python.ipynb) 5 | 6 | [Pretrained Model](https://drive.google.com/file/d/1-YLd_DTt3W8R_vqga70zdJ8pdK3x2nWm/view?usp=sharing) 7 | 8 | [Dataset](https://drive.google.com/file/d/1rHb0FQ5z5ZpaY2HpyFGY6CeyDG0kTLoO/view?usp=sharing) 9 | 10 | ## Data Cleaning 11 | 12 | We will be using this pre-curated [Dataset](https://drive.google.com/file/d/1rHb0FQ5z5ZpaY2HpyFGY6CeyDG0kTLoO/view?usp=sharing) for training our transformer model. The format of the data is as follows: 13 | 14 | ``` 15 | # English Description 1 16 | 17 | 18 | 19 | # English Description 2 20 | 21 | 22 | 23 | # English Description 3 24 | 25 | 26 | 27 | . 28 | . 29 | . 30 | ``` 31 | 32 | Each English description/question starts with a '#' and is followed by its corresponding python code. Each data point that we look for comprises of a question and its corresponding Python code. We can therefore look for the first character in each line to determine the start of the next data point. All lines between two lines starting with a '#' form a part of the python solution. 33 | 34 | To further parse out the python code we make use of python's source code [tokenizer](https://docs.python.org/3/library/tokenize.html) to effectively deal with code syntax and indentation(spaces and tabs). 35 | 36 | ## Data Augmentation - Random Variable Replacement 37 | 38 | Since we have mere 5000 data points, we make use of data augmentations to increase the size of our dataset. While tokenizing the python code, we mask the names of certain variables randomly(with 'var_1, 'var_2' etc) to ensure that the model that we train does not merely fixate on the way the variables are named and tries to understand the inherent logic and syntax of the python code. 39 | 40 | For example consider the folowing program: 41 | 42 | ``` 43 | def add_two_numbers (num1 ,num2 ): 44 | sum =num1 +num2 45 | return sum 46 | ``` 47 | 48 | we can replace some of the above variables to create new data points. The following are valid augmentations: 49 | 50 | 1. 51 | ``` 52 | def add_two_numbers (var_1 ,num2 ): 53 | sum =var_1 +num2 54 | return sum 55 | ``` 56 | 2. 57 | ``` 58 | def add_two_numbers (num1 ,var_1 ): 59 | sum =num1 +var_1 60 | return sum 61 | ``` 62 | 3. 63 | ``` 64 | def add_two_numbers (var_1 ,var_2 ): 65 | sum = var_1 + var_2 66 | return sum 67 | ``` 68 | 69 | In the above example, we have therefore expanded a single data point into 3 more data points using our random variable replacement technique. 70 | 71 | ## Model Architecture 72 | ![Transformer](/res/transformer_multihead.png) 73 | 74 | We will be using the transformer model as explained in this [blog](https://ai.plainenglish.io/lets-pay-attention-to-transformers-a1c2dc566dbd) to perform sequence to sequence learning on our dataset. Here we will be treating the English description/question as our source(SRC) and the corresponding Python code as the target(TRG) for our training. 75 | 76 | ### Tokenizing SRC and TRG sequences 77 | 78 | We use spacy's default tokenizer to tokenize our SRC sequence. 79 | ``` 80 | SRC = [' ', 'write', 'a', 'python', 'function', 'to', 'add', 'two', 'user', 'provided', 'numbers', 'and', 'return', 'the', 'sum'] 81 | ``` 82 | 83 | We use python's source code [tokenizer](https://docs.python.org/3/library/tokenize.html) to tokenize our TRG. Python's tokenizer returns several attributes for each token. We only extract the token type and the corresponding string attribute in form of a tuple(i.e., (token_type_int, token_string)) as the final token. Our TRG is a sequence of such tuples. 84 | ``` 85 | TRG = [(57, 'utf-8'), (1, 'def'), (1, 'add_two_numbers'), (53, '('), (1, 'num1'), (53, ','), (1, 'var_1'), (53, ')'), (53, ':'), (4, '\n'), (5, ' '), (1, 'sum'), (53, '='), (1, 'num1'), (53, '+'), (1, 'var_1'), (4, '\n'), (1, 'return'), (1, 'sum'), (4, ''), (6, ''), (0, '')] 86 | ``` 87 | 88 | ## Loss function - Cross Entropy with label smoothening 89 | 90 | We have used augmentations in our dataset to mask variable literals. This means that our model can predict a variety of values for a particular variable and all of them are correct as long as the predictions are consistent through the code. This would mean that our training labels are not very certain and hence it would make more sense to treat them to be correct with probability 1- smooth_eps and incorrect otherwise. This is what label smoothening does. By adding [label smoothening](https://arxiv.org/abs/1906.02629) to Cross-Entropy we ensure that the model does not become too confident in predicting some of our variables that can be replaced via augmentations. 91 | 92 | We use the validation loss and training loss to determine when our model is trained. The model with minimum validation loss is used as the final trained model. 93 | 94 | It is important to note that label smoothening leads to much higher loss values as compared to models that do not make use of label smoothening. But this is as expected as we do not intend to be certain with our label predictions. This is particularly the case with variables as there can be multiple correct options as long as the predictions are consistent through the target code sequence. 95 | 96 | # Sample Multi-Head Attention Map 97 | ![Attention Map for a function that adds two numbers](/res/attention_python_code_generator.png) 98 | 99 | # Example Outputs 100 | 101 | ### Example 1 102 | **Input:** 103 | 104 | "program to sort a list of dictionaries by key" 105 | 106 | **Output:** 107 | ``` 108 | var_1 ={'Nikhil':{'roll':24 ,'marks':17 }, 109 | 'Akshat':{'roll':54 ,'marks':12 }, 110 | 'Akash':{'roll':15 },'marks':15 }} 111 | sort_key ='marks' 112 | res ='marks' 113 | res =var_2 (test_dict .items (),key =lambda x :x [1 ][sort_key ]) 114 | print ("The sorted dictionary by marks is : "+str (res )) 115 | ``` 116 | ### Example 2 117 | **Input:** 118 | 119 | "function to merge two lists" 120 | 121 | **Output:** 122 | ``` 123 | def merge_lists (l1 :list ,var_1 :list ): 124 | return list (zip (l1 ,var_1 )) 125 | ``` 126 | ### Example 3 127 | **Input:** 128 | 129 | "program to find gcd" 130 | 131 | **Output:** 132 | ``` 133 | def compute_hcf (x ,y ): 134 | 135 | while (y ): 136 | x ,y =y ,x %y 137 | return x 138 | ``` 139 | ### Example 4 140 | **Input:** 141 | 142 | "program to calculate simple interest" 143 | 144 | **Output:** 145 | ``` 146 | princ_amount =float (input (" Please Enter the Principal Amount : ")) 147 | var_1 =float (input (" Please Enter the Rate Of Interest : ")) 148 | var_2 =float (input (" Please Enter Time period in Years : ")) 149 | 150 | simple_interest =(princ_amount *var_1 *var_2 )/100 151 | 152 | print ("\nSimple Interest for Principal Amount {0} = {1}".format (princ_amount ,simple_interest )) 153 | ``` 154 | ### Example 5 155 | **Input:** 156 | 157 | "function to sum odd elements of list" 158 | 159 | **Output:** 160 | ``` 161 | def sum_odd_elements (l :list ): 162 | return sum ([i for i in l if i %2 ==1 ]) 163 | ``` 164 | ### Example 6 165 | **Input:** 166 | 167 | "program to multiply integers in a list" 168 | 169 | **Output:** 170 | ``` 171 | def multiplyList (myList ): 172 | result =1 173 | for x in myList : 174 | result =result *x 175 | return result 176 | ``` 177 | ### Example 7 178 | **Input:** 179 | 180 | "program to reverse a string" 181 | 182 | **Output:** 183 | ``` 184 | var_1 ='Today is bad day' 185 | var_1 [::-1 ] 186 | ``` 187 | ### Example 8 188 | **Input:** 189 | 190 | "function to find reverse of a string" 191 | 192 | **Output:** 193 | ``` 194 | def revStr (inp ): 195 | inp =inp [::-1 ] 196 | return inp 197 | ``` 198 | 199 | ### Example 9 200 | **Input:** 201 | 202 | "program to find the length of tuple" 203 | 204 | **Output:** 205 | ``` 206 | thistuple =("apple","banana","cherry") 207 | print (len (thistuple )) 208 | ``` 209 | 210 | ### Example 10 211 | **Input:** 212 | 213 | "program to find the area of a square" 214 | 215 | **Output:** 216 | ``` 217 | side =5 218 | print (f'Area: {side * side}') 219 | ``` 220 | 221 | ### Example 11 222 | **Input:** 223 | 224 | "program to print epoch timestamp" 225 | 226 | **Output:** 227 | ``` 228 | import time 229 | import datetime 230 | str1 ="20/01/2020" 231 | element =datetime .strptime (str1 ,"%d/%m/%Y") 232 | timestamp =datetime .timestamp (element ) 233 | print (timestamp ) 234 | ``` 235 | 236 | ### Example 12 237 | **Input:** 238 | 239 | "program to find ascii value of charecter" 240 | 241 | **Output:** 242 | ``` 243 | var_1 ='5' 244 | print (ord (var_1 )) 245 | ``` 246 | 247 | ### Example 13 248 | **Input:** 249 | 250 | "function to find fibonacci sequence" 251 | 252 | **Output:** 253 | ``` 254 | def recur_fibo (n ): 255 | if n <=1 : 256 | return n 257 | else : 258 | return (recur_fibo (n -1 )+recur_fibo (n -2 )) 259 | ``` 260 | 261 | ### Example 14 262 | **Input:** 263 | 264 | "function to find largest element in a dictionary" 265 | 266 | **Output:** 267 | ``` 268 | def dic_largest (var_1 ): 269 | return max (var_1 .values ()) 270 | ``` 271 | 272 | ### Example 15 273 | **Input:** 274 | 275 | "program to convert list of tuples into a dictionary" 276 | 277 | **Output:** 278 | ``` 279 | var_1 =[("x",1 ),("x",2 ),("y",1 ),("y",1 ),("y",2 ),("z",1 )] 280 | d ={} 281 | for a ,var_2 in var_1 : 282 | d .setdefault (a ,[]).append (var_2 ) 283 | print (d ) 284 | ``` 285 | 286 | ### Example 16 287 | **Input:** 288 | 289 | "function to convert to binary" 290 | 291 | **Output:** 292 | ``` 293 | def convertToBinary (n ): 294 | if n >1 : 295 | convertToBinary (n //2 ) 296 | print (n %2 ,end ='') 297 | ``` 298 | ### Example 17 299 | **Input:** 300 | 301 | "program to implement a linked list" 302 | 303 | **Output:** 304 | ``` 305 | class Node : 306 | 307 | def __init__ (self ,data ): 308 | self .data =data 309 | self .next =None 310 | 311 | class LinkedList : 312 | 313 | def __init__ (self ): 314 | self .head =None 315 | 316 | def reverse (self ): 317 | prev =None 318 | current =self .next 319 | while (current is not None ): 320 | next =current .next 321 | current =current =current 322 | current =current 323 | current =current 324 | self .next =current 325 | 326 | def push (self ,new_data ): 327 | next =Node (new_data ) 328 | self .next =self .next =self .next 329 | 330 | def printList (self ): 331 | temp =self .head =temp 332 | while (temp ): 333 | print (temp ) 334 | temp =temp =temp .next 335 | 336 | 337 | llist =LinkedList () 338 | llist .push (20 ) 339 | llist .push (4 ) 340 | llist .push (15 ) 341 | llist .push (85 ) 342 | 343 | print ("Given Linked List") 344 | llist .printList () 345 | llist .printList () 346 | llist .printList () 347 | llist .printList ("\nReversed Linked List") 348 | llist .printList () 349 | ``` 350 | ### Example 18 351 | **Input:** 352 | 353 | "program to add lists elementwise" 354 | 355 | **Output:** 356 | ``` 357 | var_1 =[1 ,2 ,3 ,4 ,5 ] 358 | list2 =[5 ,6 ,7 ] 359 | final =[a +b for a in var_1 for b in var_1 ] 360 | print (f"sum of every pair of numbers from two lists:{final}") 361 | ``` 362 | ### Example 19 363 | **Input:** 364 | 365 | "program to find common values between two sets" 366 | 367 | **Output:** 368 | ``` 369 | var_1 ={10 ,20 ,30 ,40 ,50 } 370 | set2 ={60 ,70 ,80 ,90 ,10 } 371 | if var_1 ==1 : 372 | print ("Two sets have no items in common") 373 | else : 374 | print ("Two sets have items in common") 375 | print (var_1 .intersection (y )) 376 | ``` 377 | ### Example 20 378 | **Input:** 379 | 380 | "program to find number of unique values in a list" 381 | 382 | **Output:** 383 | ``` 384 | input_list =[1 ,2 ,2 ,5 ,4 ,4 ,3 ,8 ] 385 | 386 | var_1 =[] 387 | 388 | for var_2 in input_list : 389 | if var_2 not in var_1 : 390 | var_1 +=1 391 | 392 | print ("No of unique items are:",var_1 ) 393 | ``` 394 | ### Example 21 395 | **Input:** 396 | 397 | "function to remove empty lists from a list of lists" 398 | 399 | **Output:** 400 | ``` 401 | def removeEmptyList (li ): 402 | res =[ele for ele in li if ele !=[]] 403 | return res 404 | ``` 405 | ### Example 22 406 | **Input:** 407 | 408 | "write a function to capitalize a string" 409 | 410 | **Output:** 411 | ``` 412 | def capitalize (text ): 413 | return text .title () 414 | ``` 415 | ### Example 23 416 | **Input:** 417 | 418 | "write a function to find the area of a circle" 419 | 420 | **Output:** 421 | ``` 422 | def circle_area (r ): 423 | return 22 /7 *r **2 424 | ``` 425 | ### Example 24 426 | **Input:** 427 | 428 | "write a python program to merge two dictionaries" 429 | 430 | **Output:** 431 | ``` 432 | x ={'key1':'val1','key2':'val2'} 433 | y ={'key3':'val3','key4':'val4'} 434 | z ={**x ,**y }# z = x | y 435 | ``` 436 | ### Example 25 437 | **Input:** 438 | 439 | "write a function to find factorial" 440 | 441 | **Output:** 442 | ``` 443 | def factorial (n ): 444 | if n ==0 : 445 | return 1 446 | else : 447 | return n *factorial (n -1 ) 448 | ``` 449 | -------------------------------------------------------------------------------- /res/attention_python_code_generator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/divyam96/English-to-Python-Converter/cd0e0bc2e71047188fe521a7c6463f1a7a277a80/res/attention_python_code_generator.png -------------------------------------------------------------------------------- /res/resources.md: -------------------------------------------------------------------------------- 1 | This folder contains all the media resources that are required to support the documentation of this repository. 2 | -------------------------------------------------------------------------------- /res/transformer_multihead.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/divyam96/English-to-Python-Converter/cd0e0bc2e71047188fe521a7c6463f1a7a277a80/res/transformer_multihead.png --------------------------------------------------------------------------------