├── English_to_Python.ipynb
├── LICENSE
├── README.md
└── res
    ├── attention_python_code_generator.png
    ├── resources.md
    └── transformer_multihead.png


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Divyam Shah
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # English-to-Python-Converter
  2 | This is an attempt to use transformers and self-attention in order to convert English descriptions into Python code.
  3 | 
  4 | [Notebook](https://github.com/divyam96/English-to-Python-Converter/blob/main/English_to_Python.ipynb) 
  5 | 
  6 | [Pretrained Model](https://drive.google.com/file/d/1-YLd_DTt3W8R_vqga70zdJ8pdK3x2nWm/view?usp=sharing) 
  7 | 
  8 | [Dataset](https://drive.google.com/file/d/1rHb0FQ5z5ZpaY2HpyFGY6CeyDG0kTLoO/view?usp=sharing)
  9 | 
 10 | ## Data Cleaning
 11 | 
 12 | We will be using this pre-curated [Dataset](https://drive.google.com/file/d/1rHb0FQ5z5ZpaY2HpyFGY6CeyDG0kTLoO/view?usp=sharing) for training our transformer model. The format of the data is as follows:
 13 | 
 14 | ```
 15 | # English Description 1
 16 | 
 17 | <Python Code 1>
 18 | 
 19 | # English Description 2
 20 | 
 21 | <Python Code 2>
 22 | 
 23 | # English Description 3
 24 | 
 25 | <Python Code 3>
 26 | 
 27 | .
 28 | .
 29 | .
 30 | ```
 31 | 
 32 | Each English description/question starts with a '#' and is followed by its corresponding python code. Each data point that we look for comprises of a question and its corresponding Python code. We can therefore look for the first character in each line to determine the start of the next data point. All lines between two lines starting with a '#' form a part of the python solution.
 33 | 
 34 | To further parse out the python code we make use of python's source code [tokenizer](https://docs.python.org/3/library/tokenize.html) to effectively deal with code syntax and indentation(spaces and tabs). 
 35 | 
 36 | ## Data Augmentation - Random Variable Replacement
 37 | 
 38 | Since we have mere 5000 data points, we make use of data augmentations to increase the size of our dataset. While tokenizing the python code, we mask the names of certain variables randomly(with 'var_1, 'var_2' etc) to ensure that the model that we train does not merely fixate on the way the variables are named and tries to understand the inherent logic and syntax of the python code.
 39 | 
 40 | For example consider the folowing program:
 41 | 
 42 | ```
 43 | def add_two_numbers (num1 ,num2 ):
 44 |     sum =num1 +num2 
 45 |     return sum
 46 | ```
 47 | 
 48 | we can replace some of the above variables to create new data points. The following are valid augmentations:
 49 | 
 50 | 1. 
 51 | ```
 52 | def add_two_numbers (var_1 ,num2 ):
 53 |     sum =var_1 +num2 
 54 |     return sum
 55 | ```
 56 | 2.
 57 |  ```
 58 | def add_two_numbers (num1 ,var_1 ):
 59 |     sum =num1 +var_1 
 60 |     return sum
 61 | ```
 62 | 3.
 63 | ```
 64 | def add_two_numbers (var_1 ,var_2 ):
 65 |     sum = var_1 + var_2 
 66 |     return sum
 67 | ```
 68 | 
 69 | In the above example, we have therefore expanded a single data point into 3 more data points using our random variable replacement technique.
 70 | 
 71 | ## Model Architecture
 72 | ![Transformer](/res/transformer_multihead.png)
 73 | 
 74 | We will be using the transformer model as explained in this [blog](https://ai.plainenglish.io/lets-pay-attention-to-transformers-a1c2dc566dbd) to perform sequence to sequence learning on our dataset. Here we will be treating the English description/question as our source(SRC) and the corresponding Python code as the target(TRG) for our training. 
 75 | 
 76 | ### Tokenizing SRC and TRG sequences
 77 | 
 78 | We use spacy's default tokenizer to tokenize our SRC sequence.
 79 | ```
 80 | SRC = [' ', 'write', 'a', 'python', 'function', 'to', 'add', 'two', 'user', 'provided', 'numbers', 'and', 'return', 'the', 'sum']
 81 | ```
 82 | 
 83 | We use python's source code [tokenizer](https://docs.python.org/3/library/tokenize.html) to tokenize our TRG. Python's tokenizer returns several attributes for each token. We only extract the token type and the corresponding string attribute in form of a tuple(i.e., (token_type_int, token_string)) as the final token. Our TRG is a sequence of such tuples.
 84 | ```
 85 | TRG = [(57, 'utf-8'), (1, 'def'), (1, 'add_two_numbers'), (53, '('), (1, 'num1'), (53, ','), (1, 'var_1'), (53, ')'), (53, ':'), (4, '\n'), (5, '    '), (1, 'sum'), (53, '='), (1, 'num1'), (53, '+'), (1, 'var_1'), (4, '\n'), (1, 'return'), (1, 'sum'), (4, ''), (6, ''), (0, '')]
 86 | ```
 87 | 
 88 | ## Loss function -  Cross Entropy with label smoothening
 89 | 
 90 | We have used augmentations in our dataset to mask variable literals. This means that our model can predict a variety of values for a particular variable and all of them are correct as long as the predictions are consistent through the code. This would mean that our training labels are not very certain and hence it would make more sense to treat them to be correct with probability 1- smooth_eps and incorrect otherwise. This is what label smoothening does. By adding [label smoothening](https://arxiv.org/abs/1906.02629) to Cross-Entropy we ensure that the model does not become too confident in predicting some of our variables that can be replaced via augmentations. 
 91 | 
 92 | We use the validation loss and training loss to determine when our model is trained. The model with minimum validation loss is used as the final trained model. 
 93 | 
 94 | It is important to note that label smoothening leads to much higher loss values as compared to models that do not make use of label smoothening. But this is as expected as we do not intend to be certain with our label predictions. This is particularly the case with variables as there can be multiple correct options as long as the predictions are consistent through the target code sequence. 
 95 | 
 96 | # Sample Multi-Head Attention Map
 97 | ![Attention Map for a function that adds two numbers](/res/attention_python_code_generator.png)
 98 | 
 99 | # Example Outputs
100 | 
101 | ### Example 1
102 | **Input:**
103 | 
104 | "program to sort a list of dictionaries by key"
105 | 
106 | **Output:**
107 | ```
108 | var_1 ={'Nikhil':{'roll':24 ,'marks':17 },
109 | 'Akshat':{'roll':54 ,'marks':12 },
110 | 'Akash':{'roll':15 },'marks':15 }}
111 | sort_key ='marks'
112 | res ='marks'
113 | res =var_2 (test_dict .items (),key =lambda x :x [1 ][sort_key ])
114 | print ("The sorted dictionary by marks is : "+str (res ))
115 | ```
116 | ### Example 2
117 | **Input:**
118 | 
119 | "function to merge two lists"
120 | 
121 | **Output:**
122 | ```
123 | def merge_lists (l1 :list ,var_1 :list ):
124 |     return list (zip (l1 ,var_1 ))
125 | ```
126 | ### Example 3
127 | **Input:**
128 | 
129 | "program to find gcd"
130 | 
131 | **Output:**
132 | ```
133 | def compute_hcf (x ,y ):
134 | 
135 |    while (y ):
136 |        x ,y =y ,x %y 
137 |    return x 
138 | ```
139 | ### Example 4
140 | **Input:**
141 | 
142 | "program to calculate simple interest"
143 | 
144 | **Output:**
145 | ```
146 | princ_amount =float (input (" Please Enter the Principal Amount : "))
147 | var_1 =float (input (" Please Enter the Rate Of Interest   : "))
148 | var_2 =float (input (" Please Enter Time period in Years   : "))
149 | 
150 | simple_interest =(princ_amount *var_1 *var_2 )/100 
151 | 
152 | print ("\nSimple Interest for Principal Amount {0} = {1}".format (princ_amount ,simple_interest ))
153 | ```
154 | ### Example 5
155 | **Input:**
156 | 
157 | "function to sum odd elements of list"
158 | 
159 | **Output:**
160 | ```
161 | def sum_odd_elements (l :list ):
162 |     return sum ([i for i in l if i %2 ==1 ])
163 | ```
164 | ### Example 6
165 | **Input:**
166 | 
167 | "program to multiply integers in a list"
168 | 
169 | **Output:**
170 | ```
171 | def multiplyList (myList ):
172 |     result =1 
173 |     for x in myList :
174 |          result =result *x 
175 |     return result 
176 | ```
177 | ### Example 7
178 | **Input:**
179 | 
180 | "program to reverse a string"
181 | 
182 | **Output:**
183 | ```
184 | var_1 ='Today is bad day'
185 | var_1 [::-1 ]
186 | ```
187 | ### Example 8
188 | **Input:**
189 | 
190 | "function to find reverse of a string"
191 | 
192 | **Output:**
193 | ```
194 | def revStr (inp ):
195 |   inp =inp [::-1 ]
196 |   return inp 
197 | ```
198 | 
199 | ### Example 9
200 | **Input:**
201 | 
202 | "program to find the length of tuple"
203 | 
204 | **Output:**
205 | ```
206 | thistuple =("apple","banana","cherry")
207 | print (len (thistuple ))
208 | ```
209 | 
210 | ### Example 10
211 | **Input:**
212 | 
213 | "program to find the area of a square"
214 | 
215 | **Output:**
216 | ```
217 | side =5 
218 | print (f'Area: {side * side}')
219 | ```
220 | 
221 | ### Example 11
222 | **Input:**
223 | 
224 | "program to print epoch timestamp"
225 | 
226 | **Output:**
227 | ```
228 | import time 
229 | import datetime 
230 | str1 ="20/01/2020"
231 | element =datetime .strptime (str1 ,"%d/%m/%Y")
232 | timestamp =datetime .timestamp (element )
233 | print (timestamp )
234 | ```
235 | 
236 | ### Example 12
237 | **Input:**
238 | 
239 | "program to find ascii value of charecter"
240 | 
241 | **Output:**
242 | ```
243 | var_1 ='5'
244 | print (ord (var_1 ))
245 | ```
246 | 
247 | ### Example 13
248 | **Input:**
249 | 
250 | "function to find fibonacci sequence"
251 | 
252 | **Output:**
253 | ```
254 | def recur_fibo (n ):
255 |    if n <=1 :
256 |        return n 
257 |    else :
258 |        return (recur_fibo (n -1 )+recur_fibo (n -2 ))
259 | ```
260 | 
261 | ### Example 14
262 | **Input:**
263 | 
264 | "function to find largest element in a dictionary"
265 | 
266 | **Output:**
267 | ```
268 | def dic_largest (var_1 ):
269 |     return max (var_1 .values ())
270 | ```
271 | 
272 | ### Example 15
273 | **Input:**
274 | 
275 | "program to convert list of tuples into a dictionary"
276 | 
277 | **Output:**
278 | ```
279 | var_1 =[("x",1 ),("x",2 ),("y",1 ),("y",1 ),("y",2 ),("z",1 )]
280 | d ={}
281 | for a ,var_2 in var_1 :
282 |     d .setdefault (a ,[]).append (var_2 )
283 | print (d )
284 | ```
285 | 
286 | ### Example 16
287 | **Input:**
288 | 
289 | "function to convert to binary"
290 | 
291 | **Output:**
292 | ```
293 | def convertToBinary (n ):
294 |    if n >1 :
295 |        convertToBinary (n //2 )
296 |    print (n %2 ,end ='')
297 | ```
298 | ### Example 17
299 | **Input:**
300 | 
301 | "program to implement a linked list"
302 | 
303 | **Output:**
304 | ```
305 | class Node :
306 | 
307 |     def __init__ (self ,data ):
308 |         self .data =data 
309 |         self .next =None 
310 | 
311 | class LinkedList :
312 | 
313 |     def __init__ (self ):
314 |         self .head =None 
315 | 
316 |     def reverse (self ):
317 |         prev =None 
318 |         current =self .next 
319 |         while (current is not None ):
320 |             next =current .next 
321 |             current =current =current 
322 |             current =current 
323 |             current =current 
324 |         self .next =current 
325 | 
326 |     def push (self ,new_data ):
327 |         next =Node (new_data )
328 |         self .next =self .next =self .next 
329 | 
330 |     def printList (self ):
331 |         temp =self .head =temp 
332 |         while (temp ):
333 |             print (temp )
334 |             temp =temp =temp .next 
335 | 
336 | 
337 | llist =LinkedList ()
338 | llist .push (20 )
339 | llist .push (4 )
340 | llist .push (15 )
341 | llist .push (85 )
342 | 
343 | print ("Given Linked List")
344 | llist .printList ()
345 | llist .printList ()
346 | llist .printList ()
347 | llist .printList ("\nReversed Linked List")
348 | llist .printList ()
349 | ```
350 | ### Example 18
351 | **Input:**
352 | 
353 | "program to add lists elementwise"
354 | 
355 | **Output:**
356 | ```
357 | var_1 =[1 ,2 ,3 ,4 ,5 ]
358 | list2 =[5 ,6 ,7 ]
359 | final =[a +b for a in var_1 for b in var_1 ]
360 | print (f"sum of every pair of numbers from two lists:{final}")
361 | ```
362 | ### Example 19
363 | **Input:**
364 | 
365 | "program to find common values between two sets"
366 | 
367 | **Output:**
368 | ```
369 | var_1 ={10 ,20 ,30 ,40 ,50 }
370 | set2 ={60 ,70 ,80 ,90 ,10 }
371 | if var_1 ==1 :
372 |     print ("Two sets have no items in common")
373 | else :
374 |   print ("Two sets have items in common")
375 |   print (var_1 .intersection (y ))
376 | ```
377 | ### Example 20
378 | **Input:**
379 | 
380 | "program to find number of unique values in a list"
381 | 
382 | **Output:**
383 | ```
384 | input_list =[1 ,2 ,2 ,5 ,4 ,4 ,3 ,8 ]
385 | 
386 | var_1 =[]
387 | 
388 | for var_2 in input_list :
389 |     if var_2 not in var_1 :
390 |         var_1 +=1 
391 | 
392 | print ("No of unique items are:",var_1 )
393 | ```
394 | ### Example 21
395 | **Input:**
396 | 
397 | "function to remove empty lists from a list of lists"
398 | 
399 | **Output:**
400 | ```
401 | def removeEmptyList (li ):
402 |   res =[ele for ele in li if ele !=[]]
403 |   return res
404 | ```
405 | ### Example 22
406 | **Input:**
407 | 
408 | "write a function to capitalize a string"
409 | 
410 | **Output:**
411 | ```
412 | def capitalize (text ):
413 |     return text .title ()
414 | ```
415 | ### Example 23
416 | **Input:**
417 | 
418 | "write a function to find the area of a circle"
419 | 
420 | **Output:**
421 | ```
422 | def circle_area (r ):
423 |     return 22 /7 *r **2
424 | ```
425 | ### Example 24
426 | **Input:**
427 | 
428 | "write a python program to merge two dictionaries"
429 | 
430 | **Output:**
431 | ```
432 | x ={'key1':'val1','key2':'val2'}
433 | y ={'key3':'val3','key4':'val4'}
434 | z ={**x ,**y }# z = x | y
435 | ```
436 | ### Example 25
437 | **Input:**
438 | 
439 | "write a function to find factorial"
440 | 
441 | **Output:**
442 | ```
443 | def factorial (n ):
444 |     if n ==0 :
445 |         return 1 
446 |     else :
447 |         return n *factorial (n -1 )
448 | ```
449 | 


--------------------------------------------------------------------------------
/res/attention_python_code_generator.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/divyam96/English-to-Python-Converter/cd0e0bc2e71047188fe521a7c6463f1a7a277a80/res/attention_python_code_generator.png


--------------------------------------------------------------------------------
/res/resources.md:
--------------------------------------------------------------------------------
1 | This folder contains all the media resources that are required to support the documentation of this repository.
2 | 


--------------------------------------------------------------------------------
/res/transformer_multihead.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/divyam96/English-to-Python-Converter/cd0e0bc2e71047188fe521a7c6463f1a7a277a80/res/transformer_multihead.png


--------------------------------------------------------------------------------