├── .gitignore
├── README.md
├── dtLSTM_Implementation
└── workspace
    ├── data
        ├── clean_test.py
        ├── defective_test.py
        └── testdata_test.py
    ├── main.py
    ├── modules
        ├── models
        │   ├── child_sum_tree_lstm.py
        │   ├── logistic_regression.py
        │   ├── rnn_example.py
        │   └── tree_lstm_base.py
        └── nn_modules
        │   ├── rnn_defect.py
        │   └── tree_lstm_defect.py
    ├── pipelines
        └── defect_prediction.py
    └── test_logistic_regression.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | /__pycache__
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DeepLearningPipelines
 2 | 
 3 | This is an experimental Repository in cooperation with a 'Deep Learning for Code Generation' project, that tries to consist of referenced and self-constructed Code Generation Tasks, written in Python3.
 4 | 
 5 | 
 6 | ## Structure
 7 | 
 8 | The different tasks can be found as whole in 'pipelines'
 9 | and separated into its core-processes in 'modules'
10 | 
11 | 
12 | ## Tasks include:
13 | 
14 | - **Defect Prediction with Deep-Tree LSTM:**
15 | 	An Implementation attempt to 'A deep tree-based model for software defect prediction'
16 | 	-> https://arxiv.org/abs/1802.00921
17 | 
18 | - **Prediction of semantic relatedness of two sentence &**
19 | - **Sentiment Classification with Tree-Structured LSTM:**
20 | 	Python Adaptation and Module Extraction of
21 | 	'Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks'
22 | 	-> https://arxiv.org/abs/1503.00075
23 | 	An already public implementation can be found here: https://github.com/stanfordnlp/treelstm
24 | 	
25 | 


--------------------------------------------------------------------------------
/dtLSTM_Implementation:
--------------------------------------------------------------------------------
 1 | Progress Documentation:
 2 | 
 3 | 
 4 | 
 5 | A deep tree based model for software defect prediction - Self Implementation
 6 | 
 7 | - paper algorithm has most preprocesses within the LSTM Unit calculation
 8 | 	-> will be extracted and fit to our modules
 9 | - data is not being embedded beforehand but rather makes use of an embedding matrix that is used within the NN processes in oder to look up the vectors
10 | 	-> The embedding is only for the actual node names and doesnt represent any AST build
11 | 	-> thats how i will process it for now but it might ignore important data?
12 | - tree implementation will help here but LSTM unit processes will have to be adjusted (DefectTreeLSTM)
13 | 
14 | - internal PROBLEMS:
15 | 	Everything regarding the parent prediction makes sense:
16 | 	We train on clean data to see which ast child-parent configurations are the most common.
17 | 		That is done by iterating over Asts starting from the branches, predicting parent from children
18 | 		(and context) and adjusting the weights based on the diffrenece of prediction and outcome
19 | 	The resulting, trained network is then used to recurisively iterate over ast nodes and doing some
20 | 	LSTM Processes on each node, to obtain a vector which is then classified(?)
21 | 	BUT:
22 | 	- Now how is defective data used?
23 | 	- How does training and predicting differenciate?
24 | 	- How do we instanciate and train the classification process
25 | 
26 | - Paper Documentation PROBLEMS:
27 | 	- The actual defect prediction learning is not explained
28 | 	- Defect Prediction Algorithm doesnt say a lot about what it does
29 | 	- What TreeLSTM is actually used and how is it modified?? (ChildSum)
30 | 	- What is a vector representation of an AST, how can one obtain it, what does it actually stand for?
31 | 
32 | 
33 | => In order to understand the overall Functionality of the paper LUA TreeLSTM will be put aside for now, so the NN process will be simulated by an RNN and extended lateron
34 | 
35 | 
36 | 
37 | Future Steps for only Implementation:
38 | (X)- Understanding/Adapting actual Prediction Model
39 | (*)- Include dummy RNN (note: doesnt yet really run)
40 | (X)- Adjust RNN Training and Predicting
41 | ( )- Research and add classification training
42 | ( )- Manage Embedding Training??
43 | ( )- Include TreeLSTM
44 | 	( )- Build and Understand TreeLSTM (ChildSum?)
45 | 	( )- Adjust Training and Predicting
46 | 
47 | Future Steps for after connecting it to COGE™
48 | ( )- Expanding data crawlings (PROMISE Dataset)
49 | 
50 | 
51 | 
52 | Explicit Comments:
53 | - Defect Prediction done like the training, i.e. all child nodes of an AST are being input into the lstm and the each parent node is being predicted.
54 | 	+ Defect Prediction now becomes basically the accuracy calculation of test data
55 | 	+/- This process is now directly connected to the accuracy of the LSTM 
56 | - Defective Data will be used as a way of finding the threshold between the accuracy/def-probability 
57 | 	- Threshold might be inaccurate
58 | 
59 | 
60 | 
61 | 
62 | 
63 | ----------------------------------------------------------------------------
64 | Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks - Reference Implemenations
65 | 
66 | - The Library consitsts of two Tasks:
67 | 	semantic relatedness | sentiment classification
68 | 
69 |   and can make use of 4 nn models, where only treelstm is interesting:
70 | 	(LSTM(Normal, bidirected) and TreeLSTM(childsum and n-ary))
71 | 
72 | - They have two different Tree Algorithms, both probably have to be adapted for defect prediction:
73 | 	- ChildSumTree: 
74 | 	- N-ary Tree LSTM:
75 | 		'can be used on tree structures where the branching factor 
76 | 		is at most N and where children are ordered'
77 | 	-> I dont really understand which has a better/suitable functionality so ill combine them?
78 | 
79 | - The Tasks don't especially fit to the Defect Prediction model; Nonetheless the label classification process from the sentiment classification might suffice for its training process
80 | 


--------------------------------------------------------------------------------
/workspace/data/clean_test.py:
--------------------------------------------------------------------------------
 1 | """
 2 | clean labled file taken from Motivation Example of Papaer
 3 | """
 4 | 
 5 | 
 6 | def test_clean(stack): 
 7 |     x = 0
 8 |     while x < 10:
 9 |         y = 0
10 |         if not stack.empty():
11 |             y = stack.pop()
12 |     x += 1
13 | 


--------------------------------------------------------------------------------
/workspace/data/defective_test.py:
--------------------------------------------------------------------------------
 1 | """
 2 | defective labled file taken from Motivation Example of Papaer
 3 | """
 4 | 
 5 | 
 6 | def test_defective(stack):  # a testcomment
 7 |     x = 0
 8 |     if not stack.empty():
 9 |         while x < 10:
10 |             y = 0
11 |             y = stack.pop()
12 |             x += 1
13 | 
14 |     def test_defective2(stack):  # a testcomment
15 |         x = 0
16 |         if not stack.empty():
17 |             while x < 10:
18 |                 y = 0
19 |                 y = stack.pop()
20 |                 x += 1
21 | 


--------------------------------------------------------------------------------
/workspace/data/testdata_test.py:
--------------------------------------------------------------------------------
1 | def return_first(x, y):
2 |     x = 3
3 |     return x
4 | 


--------------------------------------------------------------------------------
/workspace/main.py:
--------------------------------------------------------------------------------
 1 | import pipelines.defect_prediction as dp
 2 | 
 3 | if __name__ == "__main__":
 4 |     """
 5 |     Main to test-run specific pipelines or modules
 6 |     """
 7 |     dp_pipeline = dp.DefectPrediction('/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/data/defective_test.py',
 8 |                                  '/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/data/clean_test.py',
 9 |                                  '/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/data/testdata_test.py')
10 |     dp_pipeline.run()
11 |     print("run completed")
12 | 


--------------------------------------------------------------------------------
/workspace/modules/models/child_sum_tree_lstm.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import re
 4 | import ast  # for python AST respresentation
 5 | import random
 6 | import torch
 7 | import modules.models.tree_lstm_base as tlstm
 8 | import torch.legacy.nn as nn
 9 | 
10 | 
11 | class TreeLSTM(tlstm.TreeLSTM):
12 |     """
13 |     E for Experimental, Tree LSTM inheriting from TreeLSTM that is used for lable prediction
14 |     """
15 | 
16 |     def __init__(self, config):
17 |         super().__init__(config.emb_dim, config.mem_dim)
18 |         self.criterion = config.criterion
19 | 
20 |     def forward(self, tree, inputs):
21 |         pass
22 | 
23 |     def backward(self, tree, inputs, grad):
24 |         pass
25 | 


--------------------------------------------------------------------------------
/workspace/modules/models/logistic_regression.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | from torch.utils.data.dataset import Dataset
 3 | import numpy as np
 4 | import argparse
 5 | import os
 6 | import time
 7 | import torch
 8 | import torch.nn as nn
 9 | from sklearn.linear_model import LogisticRegression as sk_lr
10 | 
11 | 
12 | def sigmoid(Z):
13 |     return 1/(1+np.e**(-Z))
14 | 
15 | 
16 | def logistic_loss(y, y_hat):
17 |     return -np.mean(y*np.log(y_hat)+(1-y)*np.log(1-y_hat))
18 | 
19 | 
20 | def set_pairs(X):
21 |     # input list of single x1 values and make 2d vectors with x2 =1
22 |     X_2d = []
23 |     for x1 in X:
24 |         X_2d.append([x1, 1])
25 |     return X_2d
26 | 
27 | 
28 | class LogisticRegression:
29 |     """
30 |     One-to-One Network Model that trains logistic regression threshold
31 |     to predict binary classification : Y=(0/1) for X=(x1)
32 |     """
33 | 
34 |     def __init__(self, config=None):
35 |         if config != None:
36 |             self.epochs = config["epochs"]
37 |             self.learning_rate = config["learning_rate"]
38 |         else:
39 |             self.epochs = 50
40 |             self.learning_rate = 0.01
41 |         self.T = 0 #treshold
42 | 
43 |     def train(self, X, Y):
44 |         #not really training, just finign average threshold
45 |         d_0 = [] # cln should have higher accuracies
46 |         d_1 = [] # def
47 |         for i in range(len(X)):
48 |             if Y[i] == 0:
49 |                 d_0.append(X[i])
50 |             elif Y[i] == 1:
51 |                 d_1.append(X[i])
52 |             else:
53 |                 print("something wrong with training data")
54 |         m_0 = np.average(d_0)
55 |         m_1 = np.average(d_1)
56 |         diff = m_0-m_1
57 |         self.T = m_1+ diff*(len(d_1)/len(X))
58 |         #print(self.T)
59 |         
60 | 
61 |     def test(self, X):
62 |         """
63 |         predicts Y for X on trained model
64 |         :param X: 2b-predicting float input
65 |         :returns: true or false (for defect prediction true if defective)
66 |         """
67 |         if X>self.T:
68 |             return 0#not buggy
69 |         else:
70 |             return 1
71 | 


--------------------------------------------------------------------------------
/workspace/modules/models/rnn_example.py:
--------------------------------------------------------------------------------
  1 | import torch.nn as nn
  2 | from torch.utils.data.dataset import Dataset
  3 | import numpy as np
  4 | import argparse
  5 | import os
  6 | import time
  7 | import torch
  8 | import torch.nn as nn
  9 | from torch.utils.data import DataLoader
 10 | 
 11 | 
 12 | class RNN_Example():
 13 |     """
 14 |     Implementation of Recurrent Neural Network.
 15 |     For Emils Pipeline Implementation
 16 |     """
 17 | 
 18 |     def __init__(self, config):
 19 |         self.dict = config.dictionary
 20 |         self.batch_size = config.batch_size
 21 |         self.epochs = 10
 22 |         self.input_length = config.emb_dim*config.in_len  # TODO
 23 |         self.output_length = config.emb_dim
 24 |         self.dict_size = len(self.dict)
 25 |         self.saved_path = "/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/dump"
 26 |         self.saved_file = os.path.join(self.saved_path, "best_trained_model")
 27 |         # TODO: current model not taking sequences of token, only 1 token
 28 | 
 29 |     def run(self, train_in, train_out, test_in, test_out):
 30 | 
 31 |         self.train_network(train_in, train_out, test_in, test_out)
 32 |         return list(self.predict_testing_output.numpy())
 33 | 
 34 |     def train_network(self, train_in, train_out, test_in, test_out):
 35 |         """
 36 |         Train network.
 37 | 
 38 |         Train each epoch by training set, evaluate model after each epoch
 39 |         using validation set, save the best model and test using test set.
 40 | 
 41 |         This function also prints out loss, accuracy each epoch
 42 |         and loss/accuracy of the best model.
 43 | 
 44 |         :param datasets: list of input/output sets,
 45 |         :returns: none
 46 |         """
 47 |         # Training and Testing data will look like a 2 dim array
 48 |         # where each index holds corresponding in [0] to output [1]
 49 |         # print("TRAINING IN OUT:", train)
 50 |         training_input = train_in
 51 |         print("Training IN RNN\n", training_input)
 52 |         training_output = train_out
 53 |         print("Training OUT RNN\n", training_output)
 54 |         validating_input = train_in
 55 |         validating_output = train_out
 56 |         testing_input = test_in
 57 |         testing_output = test_out
 58 | 
 59 |         # Used to compare with accuracy of model
 60 |         best_accuracy = 0.0
 61 | 
 62 |         params = {
 63 |             "batch_size":   self.batch_size,
 64 |             "shuffle": True,
 65 |             "drop_last": True
 66 |         }
 67 | 
 68 |         # Datasets object generate data which will put into neural network
 69 |         # Datasets contain some specific functions to adapt nn in Pytorch
 70 |         train_data = Datasets(training_input, training_output)
 71 |         valid_data = Datasets(validating_input, validating_output)
 72 |         test_data = Datasets(testing_input, testing_output)
 73 | 
 74 |         # DataLoader used to load data equal to batch_size
 75 |         train_loader = DataLoader(train_data, **params)
 76 |         valid_loader = DataLoader(valid_data, **params)
 77 |         test_loader = DataLoader(test_data, **params)
 78 | 
 79 |         model = RNN(training_data=training_input, dict_size=self.dict_size)
 80 | 
 81 |         # Check if computer have graphic card,
 82 |         # model will be trained py GPU instead of CPU
 83 |         if torch.cuda.is_available():
 84 |             model.cuda()
 85 | 
 86 |         # Loss function
 87 |         self.criterion = nn.CrossEntropyLoss()
 88 | 
 89 |         # Optimization
 90 |         optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 91 | 
 92 |         # Number of iteration ( = length of data / batch_size)
 93 |         self.num_iter = int(train_data.__len__()/self.batch_size)
 94 | 
 95 |         for epoch in range(self.epochs):
 96 |             # Declare to start training phase
 97 |             model.train()
 98 |             for iter, (content, label) in enumerate(train_loader):
 99 |                 start_time = time.time()
100 |                 if torch.cuda.is_available():
101 |                     content = content.cuda()
102 |                     label = label.cuda()
103 |                 # Clean buffer to avoid accumulate value of buffer
104 |                 optimizer.zero_grad()
105 | 
106 |                 # Training model to understand the content of traning set
107 |                 predicted_value = model(content)
108 | 
109 |                 # Calculating loss
110 |                 loss = self.criterion(predicted_value, label)
111 | 
112 |                 # Back propagation
113 |                 loss.backward()
114 | 
115 |                 # Optimizing model based on loss
116 |                 optimizer.step()
117 |                 elapse_time = time.time() - start_time
118 | 
119 |             # validate this model at the end of each epoch
120 |             self.access_model(
121 |                 model=model,
122 |                 data_loader=test_loader,
123 |                 access_data=test_data,
124 |                 criterion=self.criterion,
125 |                 num_iter=self.num_iter,
126 |                 epoch=epoch,
127 |                 best_accuracy=best_accuracy)
128 | 
129 |         try:
130 |             model.load_state_dict(torch.load(self.saved_file))
131 |         except:
132 |             FileNotFoundError
133 |             print("can't save best model because there is none")
134 | 
135 |         self.access_model(model=model,
136 |                           data_loader=test_loader,
137 |                           access_data=test_data,
138 |                           criterion=self.criterion,
139 |                           mode="test",
140 |                           num_iter=self.num_iter)
141 | 
142 |     def test_network(self, test_data):
143 |         """
144 |         separated training of
145 |         :param test_data: data 2b tested with current/best model
146 |         :returns: accuracy
147 |         """
148 |         params = {
149 |             "batch_size":   self.batch_size,
150 |             "shuffle": True,
151 |             "drop_last": True
152 |         }
153 |         test_loader = DataLoader(test_data, **params)
154 | 
155 |         try:
156 |             best_model = RNN([], self.dict_size)
157 |             best_model.load_state_dict(torch.load(self.saved_file))
158 |         except:
159 |             FileNotFoundError
160 |             print("can't load best model because there is none")
161 | 
162 |         accuracy = self.access_model(model=best_model,
163 |                                      data_loader=test_loader,
164 |                                      access_data=test_data,
165 |                                      criterion=self.criterion,
166 |                                      mode="test",
167 |                                      num_iter=self.num_iter)
168 |         return np.around(accuracy, decimals=3)
169 | 
170 |     def access_model(self, model, data_loader, access_data, criterion,
171 |                      num_iter, mode="validate", epoch=0, best_accuracy=0.0):
172 |         """
173 |         Validate model after every epoch
174 | 
175 |         :param model: TODO @Annie @Thang
176 |         :param data_loader: TODO @Annie @Thang
177 |         :param access_data: TODO @Annie @Thang
178 |         :param criterion: TODO @Annie @Thang
179 |         :param num_iter: integer
180 |         :param mode: string TODO @Annie @Thang
181 |         :param epoch: integer
182 |         :param best_accuracy: float
183 |         """
184 |         # Declare to start validating phase
185 |         model.eval()
186 |         loss_list = []
187 |         accuracy_list = []
188 | 
189 |         if mode == "test":
190 |             self.predict_testing_output = torch.LongTensor([])
191 |         for iter, (content, label) in enumerate(data_loader):
192 |             if torch.cuda.is_available():
193 |                 content = content.cuda()
194 |                 label = label.cuda()
195 |             # In testing phase, we don't optimize model,
196 |             # we only use model to predict value in testing set
197 |             with torch.no_grad():
198 |                 predicted_value = model(content)
199 |                 prediction = torch.argmax(predicted_value, dim=1)
200 | 
201 |                 if mode == "test":
202 |                     self.predict_testing_output = torch.cat(
203 |                         (self.predict_testing_output, prediction))
204 | 
205 |             # Comparing between truth output and predicted output
206 |             accuracy = get_accuracy(prediction=prediction,
207 |                                     actual_value=label,
208 |                                     dict=self.dict)
209 |             if accuracy > best_accuracy:
210 |                 best_accuracy = accuracy
211 |                 if mode == "validate":
212 |                     torch.save(model.state_dict(), self.saved_file)
213 | 
214 |             loss = criterion(predicted_value, label)
215 |             loss_list.append(loss * label.size()[0])
216 |             accuracy_list.append(accuracy * label.size()[0])
217 | 
218 |         loss = sum(loss_list) / access_data.__len__()
219 |         accuracy = sum(accuracy_list) / access_data.__len__()
220 | 
221 |         loss = np.around(loss, decimals=3)
222 |         if mode == "validate":
223 |             print("Epoch ", epoch+1, "/", self.epochs, ". Validation Loss: ",
224 |                   loss, " Validation Accuracy: ", np.around(accuracy, decimals=3))
225 | 
226 |         if mode == "test":
227 |             raccuracy = np.around(accuracy, decimals=3)
228 |             # print("Best Model. Loss: ", loss, " Accuracy: ", raccuracy)
229 |             # for defectiveness prediction
230 |         return accuracy
231 | 
232 | 
233 | def get_accuracy(prediction, actual_value, dict):
234 |     """
235 |     Calculate the accuracy of the model after every batch.
236 | 
237 |     :param prediction: list of predicted values
238 |     :param actual_value: list of actual values
239 |     :param dict: vocabulary
240 |     :returns: accuracy for this batch
241 |     """
242 |     count = 0
243 | 
244 |     for i in range(len(prediction)):
245 |         # check if the prediction is correct and not unknown
246 |         if(prediction[i] == actual_value[i] and prediction[i] != len(dict)-1):
247 |             count += 1
248 | 
249 |     return count/len(prediction)
250 | 
251 | 
252 | class Datasets(Dataset):
253 | 
254 |     def __init__(self, seq_ins, seq_outs):
255 |         """
256 |         Initial function used to get
257 |         embedded training input, output, dictionary
258 | 
259 |         :param training_input: embedded input
260 |         :param training_output: embedded output
261 |         :param dict: embedded dictionary
262 |         """
263 |         super(Datasets, self).__init__()
264 | 
265 |         self.seq_ins = seq_ins
266 |         self.seq_outs = seq_outs
267 | 
268 |     def __getitem__(self, index):
269 |         """
270 |         __getitem__ is a required function of Pytorch if we want to use
271 |         neural network (torch.nn), get the content and corresponding
272 |         label of each word is the index of next word in dictionary.
273 | 
274 |         :param index: index of word in training set or test set
275 |         :return: content and label of this word
276 |         """
277 | 
278 |         seq_in = self.seq_ins[index]
279 |         # At the moment, we can only output 1 token, because the size will
280 |         # grow exponentially with the length of the output sequences
281 |         seq_out = self.seq_outs[index][0]
282 | 
283 |         return seq_in, seq_out
284 | 
285 |     def __len__(self):
286 |         """
287 |         __len__ is a required function of neural network of Pytorch.
288 |         :return: the length of training set or test set
289 |         """
290 | 
291 |         return len(self.seq_outs)
292 | 
293 | 
294 | class RNN(nn.Module):
295 | 
296 |     def __init__(self, training_data, dict_size):
297 |         """
298 |         Initial function for RNN.
299 | 
300 |         :param training_data: embedded training input
301 |         :param dict: embedded dictionary
302 |         """
303 |         super(RNN, self).__init__()
304 | 
305 |         self.training_data = training_data
306 | 
307 |         # RNN with 1 input layer, 1 hidden layer, 1 output layer
308 |         # Input layer: 8 unit, hidden layer: 50 unit, output layer: 9 unit
309 |         # The number of unit of output layer = input layer + 1
310 |         # (1 for a unknown word)
311 |         self.RNN = nn.RNN(input_size=dict_size, hidden_size=50,
312 |                           num_layers=1, bidirectional=False)
313 | 
314 |         # Fully connected layer
315 |         self.fc = nn.Linear(in_features=50, out_features=dict_size)
316 | 
317 |     def forward(self, input):
318 |         """
319 |         Pipeline for Neural network in Pytorch (build-in function).
320 | 
321 |         :param input: 2-dimensional tensor ( batch_size x input_size)
322 |         :returns: final output of neural network,
323 |         the dimension of neural network = number of classes
324 |         """
325 | 
326 |         # Increasing dimension of input by 1
327 |         # Input shape: [batch_size x input_size]
328 |         # Output shape: [1 x batch_size x input_size]
329 | 
330 |         output, _ = self.RNN(input.float())
331 | 
332 |         output = output.permute(1, 0, 2)
333 |         output = self.fc(output[-1])
334 | 
335 |         # print(output)
336 | 
337 |         return output
338 | 


--------------------------------------------------------------------------------
/workspace/modules/models/tree_lstm_base.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import re
 4 | import ast  # for python AST respresentation
 5 | import random
 6 | import torch
 7 | import torch.legacy.nn as nn
 8 | from abc import ABC, abstractmethod
 9 | 
10 | 
11 | class TreeLSTM(nn.Module, ABC):
12 |     """
13 |     Tree LSTM Interface reimplemented from the paper
14 |     'Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks'
15 |     https://arxiv.org/abs/1503.00075
16 |     LUA implementation: https://github.com/stanfordnlp/treelstm
17 | 
18 | 
19 |     """
20 | 
21 |     def __init__(self, in_dim, mem_dim):
22 |         super().__init__()
23 |         self.in_dim = in_dim
24 |         if self.in_dim == None:
25 |             print('input dimension must be specified')
26 |         self.mem_dim = mem_dim
27 |         # memory initialized with zeros
28 |         self.zeros = torch.zeros(self.mem_dim)
29 |         # boolean to check if model is training or evaluating
30 |         self.train = False
31 | 
32 |     @abstractmethod
33 |     def forward(self, tree, inputs):
34 |         pass
35 | 
36 |     @abstractmethod
37 |     def backward(self, tree, inputs, grad):
38 |         pass
39 | 
40 |     # TODO ?
41 | 
42 |     def allocate_module(self, tree, module):
43 |         pass
44 | 
45 |     def free_module(self, tree, module):
46 |         pass
47 | 


--------------------------------------------------------------------------------
/workspace/modules/nn_modules/rnn_defect.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import re
  4 | import ast  # for python AST respresentation
  5 | import random
  6 | import torch
  7 | import torch.legacy.nn as nn
  8 | import modules.models.rnn_example as rnn
  9 | import sklearn.linear_model as lm
 10 | import numpy as np
 11 | 
 12 | 
 13 | class RNNDefect():
 14 |     """
 15 |     RNN Neural network for processes of Defect prediction.
 16 |     Its use is to understand the core processes of the Defect prediction Task
 17 |     -> that is archieved by adjusting the processes that are meant for the treelstm
 18 |     """
 19 | 
 20 |     def __init__(self, pipeline):
 21 |         """
 22 |         :param pipeline: holds all necessary information for nnsimulation
 23 |         """
 24 |         # global dictionary
 25 |         self.dictionary = pipeline.dictionary
 26 |         # vector length
 27 |         self.emb_dim = len(pipeline.emb_matrix[0])
 28 |         self.emb_matrix = pipeline.emb_matrix
 29 |         # embedding matrix as lookup table
 30 |         self.emb_matrix_look = nn.LookupTable(
 31 |             len(pipeline.emb_matrix), self.emb_dim)
 32 |         self.emb_matrix_look.weight = self.emb_matrix
 33 |         # length of input sequence for RNN
 34 |         self.in_len = 5
 35 | 
 36 |         # rnn properties
 37 |         # memory dimension
 38 |         self.mem_dim = 150
 39 |         # learning rate
 40 |         self.learning_rate = 0.05
 41 |         # word vector embedding learning rate
 42 |         self.emb_learning_rate = 0.0
 43 |         # minibatch size
 44 |         self.batch_size = 25
 45 |         # regulation strength
 46 |         self.reg = 1e-4
 47 |         # simulation module hidden dimension
 48 |         self.sim_nhidden = 50
 49 | 
 50 |         # optimization configuration
 51 |         self.optim_state = {self.learning_rate}
 52 | 
 53 |         # negative log likeligood optimization objective
 54 |         self.criterion = nn.ClassNLLCriterion()
 55 | 
 56 |         # models
 57 |         # initialize code learning model
 58 |         self.rnn = rnn.RNN_Example(self)
 59 |         # initialize classification model
 60 |         self.log_reg = lm.LogisticRegression()
 61 | 
 62 |     def train_datasets(self, dataset_def, dataset_cln, test_cln):
 63 |         """
 64 |         training for defect prediction
 65 | 
 66 |         ! consists of 2 main training steps:
 67 |         1 training RNN to learn how code should look like
 68 |         2 training a classifier with labeled(clean/defective) code
 69 | 
 70 |         :param dataset_def: dataset containing defective asts
 71 |         :param dataset_cln: dataset containing clean asts
 72 |         :param test_cln: dataset containing clean test asts
 73 |         """
 74 |         # training rnn
 75 |         self.train_clean(dataset_cln, test_cln)
 76 |         # training classifier
 77 |         self.train_pred(dataset_def, dataset_cln)
 78 | 
 79 |     def predict(self, test):
 80 |         """
 81 |         calls testdata upon RNN to obtain Reconstruction accuracy.
 82 |         the Reconstruction accuracy will be classified to find out how defective files can be
 83 |         :param test: testdata whose defectiveness is tested
 84 |         :returns: true if likely to be defective; false if not
 85 |         """
 86 | 
 87 |         def classify(accuracy):
 88 |             """
 89 |             classification process to determine the probability of data based on
 90 |             NN code recunstruction accuracy
 91 |             :param accuracy: NN code recunstruction accuracy
 92 |             :returns: percentage of likelihood of defectiveness
 93 |             """
 94 |             return 1-accuracy  # test classifier
 95 | 
 96 |         accuracy = self.rnn.test_network(
 97 |             self.parents_children(test, self.in_len))  # TODO input weird
 98 |         bug_prob = classify(accuracy)
 99 |         if bug_prob > 0.5:
100 |             return 1
101 |         else:
102 |             return 0
103 | 
104 | ############################### TRAINIGS #############################
105 | 
106 |     # Training of RNN #
107 | 
108 |     def train_clean(self, dataset_cln, test_cln):
109 |         """
110 |         trains and tests RNN with clean datafiles
111 | 
112 |         :param dataset_cln: dataset containing clean training asts
113 |         :param test_cln: dataset containing clean test asts
114 |         :returns: void; saves best model in folder
115 | 
116 |         TODO delete/adjust when made use of TreeLSTM
117 |         """
118 |         # preparing in and outputs for recurrent neural network
119 |         train_out, train_in = self.prepare_parents_children(dataset_cln)
120 |         test_out, test_in = self.prepare_parents_children(test_cln)
121 | 
122 |         # embedding of datasets | Makes only sense for RNN because we loose context
123 |         emb_train_out, emb_train_in = self.embed(train_out, train_in)
124 |         # print("Embedded Training IN OUT\n", emb_train)
125 |         emb_test_out, emb_test_in = self.embed(test_out, test_in)
126 |         self.rnn.run(emb_train_in, emb_train_out, emb_test_in, emb_test_out)
127 | 
128 |     def prepare_parents_children(self, datasets):
129 |         """
130 |         creates all traing/testing/validation data in and outputs for RNN
131 |         :param datasets: list containing ASTs
132 |         :retuns: 2 dimensional list that holds list of all children for parent at same index
133 |                 -> for all datasets
134 |         TODO this works without lstmcontext now because we use standard RNN
135 |             that must be changed/deleted later and be processed in the tree LSTM
136 |         """
137 |         # collect parent children pairs first
138 |         all_parents = []
139 |         all_children = []
140 |         for tree in datasets:
141 |             parents, chilren = self.parents_children(tree, self.in_len)
142 |             all_parents.extend(parents)
143 |             all_children.extend(chilren)
144 |         return all_parents, all_children
145 | 
146 |     def parents_children(self, tree, sequencelength):
147 |         """
148 |         extracts all parents with its children from given AST
149 |         :param tree: 2b-extracted python AST
150 |         :returns: 2 lists that represent list of all children for parent at same index
151 |         """
152 |         parents = []
153 |         children = []
154 |         for node in ast.walk(tree):
155 |             loc_children = []
156 |             loc_children_ast = ast.iter_child_nodes(node)
157 |             # test if node is branch, if yes then its ignored
158 |             for child_ast in loc_children_ast:
159 |                 loc_children.append(child_ast.__class__.__name__)
160 |             if not len(loc_children) == 0:
161 |                 parents.append(node.__class__.__name__)
162 | 
163 |                 # im sorry for everyone who has to see this
164 |                 while len(loc_children) < sequencelength:
165 |                     loc_children.append("")
166 | 
167 |                 children.append(loc_children)
168 | 
169 |         return parents, children
170 | 
171 |     def embed(self, parents, childrens):
172 |         """
173 |         embedding for ast node names of parents and children 
174 | 
175 |         :param parents: list holding parent tokens
176 |         :param children: list holding children tokens
177 |         :returns: vector representation of datafiles
178 |         """
179 | 
180 |         all_parents = []
181 |         for parent in parents:
182 |             all_parents.append([self.ast2index(parent)])
183 |         all_children = []
184 |         for children in childrens:
185 |             embedded_children = []
186 |             for child in children:
187 |                 # creating combined vector of children vec values
188 |                 # im sorry for everyone who has to see this code
189 |                 embedded_children.append(self.ast2vec(child))
190 |             all_children.append(embedded_children)
191 |         # convert to numpy array
192 |         return all_parents, np.array(all_children, dtype=float)
193 | 
194 |     def ast2vec(self, ast_node):
195 |         """
196 |         embedding of single ast node with use of local dictionary and embedding matrix
197 | 
198 |         :param ast_token: 2b-embedded ast  ast_node
199 |         :returns: vector representation of ast
200 |         """
201 |         # find index first
202 |         index = self.ast2index(ast_node)
203 |         # lookup index in embedding matrix
204 |         return self.emb_matrix[index]
205 | 
206 |     def ast2index(self, ast_node):
207 |         """
208 |         index embedding of single ast node with the use of local dictionary;
209 |         for one-hot
210 | 
211 |         :param ast_token: 2b-embedded ast  ast_node
212 |         :returns: index representation of ast
213 |         """
214 |         if ast_node in self.dictionary:
215 |             index = self.dictionary.index(ast_node)
216 |         else:
217 |             # last element in dictionary is Unknown type; equals dictionary.index("UNK")
218 |             index = len(self.dictionary)-1
219 |         return index
220 | 
221 |     #Training the classification #
222 | 
223 |     def train_pred(self, def_data, cln_data):
224 |         """
225 |         trains module's classifier with 2 types of data
226 |         :param def_data: defective labled data
227 |         :param cln_data: clean labled data
228 |         :returns: void; saves best model in folder
229 |         """
230 |         # obtaining lists containing reconstruction accuracies as inputs for log regression
231 |         # TODO were creating a lot of subtree of each ast (both inputs to this point have only one ast)
232 |         def_in_out = self.parents_children(def_data[0], self.in_len)
233 |         cln_in_out = self.parents_children(cln_data[0], self.in_len)
234 |         def_in = []
235 |         cln_in = []
236 |         # TODO rn were working with subtrees / that shouldnt stay like that because not all subtrees are defective
237 |         # BUT its interesting because it looks at the internal ast structure - but maybe that should happen in the NN
238 |         for i in range(len(def_in_out[0])):
239 |             # dimension 0 is parents and 1 are children
240 |             def_in.append(self.rnn.test_network(
241 |                 [def_in_out[0][i], def_in_out[1][i]]))
242 |         for i in range(len(cln_in_out[0])):
243 |             # dimension 0 is parents and 1 are children
244 |             cln_in.append(self.rnn.test_network(
245 |                 [cln_in_out[0][i], cln_in_out[1][i]]))
246 |         print("Defective Reconstruction Accuracies\n", def_in)
247 |         print("Clean Reconstruction Accuracies\n", cln_in)
248 |         # inputs and outputs for logistic regression
249 |         X = def_in + cln_in
250 |         Y = [0 for i in range(cln_in)]+[1 for i in range(def_in)]
251 |         # TODO TRAIN
252 | 


--------------------------------------------------------------------------------
/workspace/modules/nn_modules/tree_lstm_defect.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import re
  4 | import ast  # for python AST respresentation
  5 | import random
  6 | import torch
  7 | import torch.legacy.nn as nn
  8 | 
  9 | 
 10 | class TreeLSTMDefect:
 11 |     """
 12 |     The actual TreeLSTM Module training and testing processes for DefectPrediction.
 13 |     with the instanciation of a TreeLSTM it will be able to do defect predictions for one file(one AST)
 14 |     """
 15 | 
 16 |     def __init__(self, pipeline):
 17 |         """
 18 |         :param pipeline: holds all necessary information for nnsimulation
 19 |         """
 20 |         # global dictionary
 21 |         self.dictionary = pipeline.dictionary
 22 |         # vector length
 23 |         self.emb_dim = len(pipeline.emb_matrix[0])
 24 |         self.emb_matrix = pipeline.emb_matrix
 25 |         # embedding matrix as lookup table
 26 |         self.emb_matrix_look = nn.LookupTable(
 27 |             len(pipeline.emb_matrix), self.emb_dim)
 28 |         self.emb_matrix_look.weight = self.emb_matrix
 29 |         # length of input sequence for RNN
 30 |         self.in_len = 3
 31 | 
 32 |         # lstm properties
 33 |         # memory dimension
 34 |         self.mem_dim = 150
 35 |         # learning rate
 36 |         self.learning_rate = 0.05
 37 |         # word vector embedding learning rate
 38 |         self.emb_learning_rate = 0.0
 39 |         # minibatch size
 40 |         self.batch_size = 25
 41 |         # regulation strength
 42 |         self.reg = 1e-4
 43 |         # simulation module hidden dimension
 44 |         self.sim_nhidden = 50
 45 | 
 46 |         # optimization configuration
 47 |         self.optim_state = {self.learning_rate}
 48 | 
 49 |         # negative log likeligood optimization objective
 50 |         self.criterion = nn.ClassNLLCriterion()
 51 | 
 52 |         '''
 53 |         self.etree_lstm = etree.ETreeLSTM(self)
 54 |         try:
 55 |             self.params, self.grad_params = self.etree_lstm._flatten(
 56 |                 self.etree_lstm.parameters())
 57 |         except:
 58 |             self.params = self.grad_params = torch.zeros(1)
 59 |         '''
 60 | 
 61 |     def train_datasets(self, dataset_def, dataset_cln, test_cln):
 62 |         """
 63 |         training for the TreeLSTM
 64 |         """
 65 | 
 66 |     def train_clean(self, dataset_cln, test_cln):
 67 |         """
 68 |         trains and tests TreeLSTM with clean datafiles
 69 |         TODO in TreeLSTM
 70 |         consists of 3 steps for a tree:
 71 |             - recursively (from branch) walk over children and let them predict the parent node
 72 |             - Compare the prediction with actual node
 73 |             - adjust weights of model so that the difference is minimal
 74 | 
 75 |         :param dataset_cln: dataset containing clean training asts
 76 |         :param test_cln: dataset containing clean test asts
 77 |         :returns: void; saves best model in folder
 78 |         """
 79 | 
 80 |     def predict_parent(self, children):
 81 |         """
 82 |         predicting parent node based on child nodes
 83 |         : param children: list of children nodes
 84 |         : returns: most likely parent node
 85 |         """
 86 |         pass
 87 | 
 88 |     def predict(self, tree):
 89 |         """
 90 |         predicting defectiveness of a file/tree
 91 |         : param tree: 2b-evaluated abstract sytax tree
 92 |         : returns: likelihood of defectiveness 0-1
 93 |         """
 94 |         pass
 95 | 
 96 |     def predict_def_datasets(self, dataset_def, dataset_cln):
 97 |         """
 98 |         iterates over data and calculates the overall correctness of predictions
 99 |         : param dataset_def: dataset containing defective asts
100 |         : param dataset_cln: dataset containing clean asts
101 |         : returns: overall precision of Network 0-1
102 |         """
103 |         pass
104 | 
105 | ############################
106 |     """
107 |     TODO delete the following functions when everything runs
108 |     theyre just here for some lookups but dont have any purpose
109 |     """
110 | 
111 |     def lstm_unit(self, ast_node, depth=0):
112 |         """
113 |         Process of one LSTM unit.
114 |         Recursively calls learning processes on all children in one tree
115 | 
116 |         : param ast_node: one Python AST node; First call will be with root Node
117 |         : returns: hidden state and context of node; eventually for the whole AST
118 |         """
119 |         weight = torch.tensor([])  # TODO weights with lstm calculation!!
120 |         w_t = ast2vec(ast_node, self.dictionary,
121 |                       self.emb_matrix)  # embedding of tree
122 |         # sum of children hidden outputs
123 |         h_ = 0
124 |         # child hidden state
125 |         h_k = 0
126 |         # context of child
127 |         c_k = 0
128 |         # forget gates
129 |         f_tk = 0
130 |         # childrem forgetrates times the context
131 |         c_ = 0
132 |         for k in ast.iter_child_nodes(ast_node):
133 |             print(k, depth)
134 |             h_k, c_k = self.lstm_unit(k, depth+1)
135 |             f_tk = torch.nn.Sigmoid()(weight)
136 |             h_ += h_k
137 |             c_ += (f_tk * c_k)
138 |         # input gate
139 |         i_t = torch.nn.Sigmoid()(weight)
140 |         # vector of new candidate values for t
141 |         c_t_ = torch.nn.Tanh()(weight)
142 |         # context
143 |         c_t = i_t * c_t_ + c_
144 |         # output gate
145 |         o_t = torch.nn.Sigmoid()(weight)
146 |         h_t = o_t * torch.nn.Tanh()(c_t)
147 | 
148 |         return h_t, c_t
149 | 
150 |     def train_clean_trash(self, trees):
151 |         """
152 |         consists of 3 steps for a tree:
153 |             - recursively(from branch) walk over children and let them predict the parent node
154 |             - Compare the prediction with actual node
155 |             - adjust weights of model so that the difference is minimal
156 |         """
157 |         bar = Bar('Training', max=len(trees))
158 |         self.etree_lstm.train = True
159 |         indices = torch.randperm(len(trees))
160 |         zeros = torch.zeros(self.mem_dim)
161 |         for i in range(1, len(trees)+1, self.batch_size):
162 |             bar.next()  # printing progress
163 |             batch_size = min(i+self.batch_size - 1, len(trees))-i+1
164 | 
165 |             def f_eval():
166 |                 pass
167 | 
168 |             # torch.optim.Adagrad(self.params, self.optim_state)
169 |         bar.finish()
170 | 


--------------------------------------------------------------------------------
/workspace/pipelines/defect_prediction.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import re
  4 | import ast  # for python AST respresentation
  5 | import random
  6 | from progress.bar import Bar
  7 | import torch
  8 | import torch.legacy.nn as nn
  9 | # for childsum tree LSTM model
 10 | import modules.nn_modules.tree_lstm_defect as tlstm_dp
 11 | import modules.nn_modules.rnn_defect as rnn_dp
 12 | 
 13 | 
 14 | def read_file(path):
 15 |     """
 16 |     reads file and returns it
 17 |     :param path: path to 2b-extracted file
 18 |     :return: string of file
 19 |     """
 20 |     with open(path, 'r') as file:
 21 |         return file.read()
 22 | 
 23 | 
 24 | def tidy(files):
 25 |     """
 26 |     doing first tidying processes on code before parsing
 27 |     :param files: 2b-tidied code files
 28 |     :returns: string code
 29 |     """
 30 |     # removing comments
 31 |     def stripComments(code_str):
 32 |         code_str = str(code)
 33 |         return re.sub(r'(?m)^ *#.*\n?', '', code_str)
 34 | 
 35 |     # removing docstrings TODO
 36 | 
 37 |     t_files = []
 38 |     for code in files:
 39 |         t_files.append(stripComments(code))
 40 |     return t_files
 41 | 
 42 | 
 43 | def parse_data(data):
 44 |     """
 45 |     parsing process for python ast representation
 46 |     :param data: already tidied code files
 47 |     :returns: list containing all parsed files 
 48 |     """
 49 |     ast_data = []
 50 |     for code in data:
 51 |         ast_data.append(ast.parse(code))
 52 |     return ast_data
 53 | 
 54 | 
 55 | def create_dictionary(datasets, max_count):
 56 |     """
 57 |     Builds a fixed sized dictionary, with an <UNK> entry at last position
 58 | 
 59 |     :param dataset: tokenized list, optionally n dimensional
 60 |     :param max_count: defines how long our dictionary will be
 61 |     :return: top 'max_count' tokens of dictionary as a list
 62 |     """
 63 | 
 64 |     def extract_highest_occurences(dataset, max_count):
 65 |         """
 66 |         finds highest token occurences in datasets
 67 | 
 68 |         :param dataset: tokenized list, optionally n dimensional
 69 |         :param max_count: defines how long our dictionary will be
 70 |         :return: top 'max_count' tokens of dictionary as a list
 71 |         """
 72 |         # global fixed size dictionary as set
 73 |         global_dict = {}
 74 |         # iterating over all training files
 75 |         for tree in dataset:
 76 |             # a complete dictionary for one file
 77 |             local_dict = {}
 78 |             # iterating over all words of a file
 79 |             for ast_node in ast.walk(tree):
 80 |                 # if word not in the local dictionary then we add it
 81 |                 # otherwise we rise count
 82 |                 node = ast_node.__class__.__name__
 83 |                 if node in local_dict:
 84 |                     local_dict[node] += 1
 85 |                 else:
 86 |                     local_dict[node] = 1
 87 |             # local dict counts will now be merged into fix sized, global dictionary
 88 |             for ast_node in local_dict:
 89 |                 if ast_node in global_dict:
 90 |                     global_dict[ast_node] += local_dict[ast_node]
 91 |                 else:
 92 |                     global_dict[ast_node] = local_dict[ast_node]
 93 | 
 94 |             # global dict will be filled with highest counts
 95 |             # first we find highest count
 96 |             highest_count = 0
 97 |             for ast_node in global_dict:
 98 |                 if highest_count < global_dict[ast_node]:
 99 |                     highest_count = global_dict[ast_node]
100 |             # now we create an updated highest count global dict
101 |             new_global_dict = {}
102 |             while len(new_global_dict) < max_count:
103 |                 if highest_count < 1 or len(new_global_dict) >= 2*len(global_dict):
104 |                     break
105 |                 # filling new global
106 |                 for ast_node in global_dict:
107 |                     if (global_dict[ast_node] == highest_count and
108 |                             len(new_global_dict) < max_count):
109 |                         new_global_dict[ast_node] = highest_count
110 |                 highest_count -= 1
111 |             global_dict = new_global_dict
112 |         # print("maxcout", global_dict)
113 |         return list(global_dict)
114 | 
115 |     # running through dimensions
116 |     for dataset in datasets:
117 |         dictionary = extract_highest_occurences(dataset, max_count-1)
118 |         dictionary.append("UNK")
119 |     return dictionary
120 | 
121 | 
122 | def truncate(f, n):
123 |     '''Truncates/pads a float f to n decimal places without rounding'''
124 |     s = '{}'.format(f)
125 |     if 'e' in s or 'E' in s:
126 |         return '{0:.{1}f}'.format(f, n)
127 |     i, p, d = s.partition('.')
128 |     return '.'.join([i, (d+'0'*n)[:n]])
129 | 
130 | 
131 | def random_embed(dictionary, vector_length):
132 |     """
133 |     embeds dictionary with random valued vectors for initializing random weights.
134 |     values lay between -1 and 1
135 | 
136 |     :param dictionary: 2b-embedded dictionary
137 |     :param vector_length: desired length for vectors
138 |     :returns: embedded dictionary as matrix
139 |     """
140 |     e_mat = []
141 |     # create matric with dimension dictionarysize x vector length
142 |     for mi in range(len(dictionary)):
143 |         e_vec = []
144 |         for vi in range(vector_length):
145 |             e_vec.append(truncate(random.uniform(-1.0, 1.0), 2))
146 |         e_mat.append(e_vec)
147 |     return e_mat
148 | 
149 | def one_hot(dictionary):
150 |     """
151 |     embeds dictionary with one hot vectors : zero vector with 1 at specified position
152 | 
153 |     :param dictionary: 2b-embedded dictionary
154 |     :returns: embedded dictionary as matrix
155 |     """
156 |     e_mat = []
157 |     for mi in range(len(dictionary)):
158 |         e_vec = [0 for i in range(len(dictionary))]
159 |         e_vec[mi] = 1
160 |         e_mat.append(e_vec)
161 |     return e_mat
162 | 
163 | 
164 | class DefectPrediction:  # main pipeline
165 |     """
166 |     implementation Attemt to a published Paper:
167 |     'A deep tree-based model for software defect prediction'
168 |     Reference: https://arxiv.org/abs/1802.00921
169 | 
170 |     TASK: Predicting Probability of a Code Being Defective or not
171 |     """
172 | 
173 |     def __init__(self, data_defective, data_clean, data_test):
174 |         """
175 |         :param data_defective: path to datacorpus code labled as defective
176 |         :param data_clean: path to datacorpus code labled as clean
177 |         """
178 |         self.raw_data_defective = data_defective
179 |         self.raw_data_clean = data_clean
180 |         self.raw_data_test = data_test
181 |         # vocabulary/dictionary size
182 |         self.voc_size = 100
183 |         self.vec_length = 3 # = voc size for one hot
184 | 
185 |     def run(self):
186 |         """
187 |         runs whole Pipeline with already initialized defective and clean datasets
188 |         """
189 |     # PREPROCESSING ###########
190 | 
191 |         # cleaning and opening files TODO manage datacorpus with lables and crawling
192 |         data_def = tidy([read_file(self.raw_data_defective)])
193 |         data_cln = tidy([read_file(self.raw_data_clean)])
194 |         data_test = tidy([read_file(self.raw_data_test)])
195 | 
196 |         # transforming file strings to AST and filling datasets
197 |         # parsing with own funciton ast_data_def_exp = code2ast(data_def) TODO manage error/empty files
198 |         # parsing with ast.parse
199 | 
200 |         ast_data_def = parse_data(data_def)
201 |         ast_data_cln = parse_data(data_cln)
202 |         ast_data_test = parse_data(data_test)
203 | 
204 |         # print ast data
205 |         #("Defective Data AST:\n", ast.dump(ast_data_def[0]))
206 |         #print("Clean Data AST:\n", ast.dump(ast_data_cln[0]))
207 |         print("Test Data AST:\n", ast.dump(ast_data_test[0]))
208 | 
209 |         # vocabulary of highest occurences
210 |         self.dictionary = create_dictionary(
211 |             [ast_data_cln, ast_data_def], self.voc_size)  # TODO manage whole datastorage
212 |         print("Dictionary:\n", self.dictionary)
213 | 
214 |     # EMBEDDING ########### TODO learning?
215 |         # random embedding of dictionary; as initializing!
216 |         self.emb_matrix = one_hot(self.dictionary)
217 |         # print("Embedding Matrix:\n", self.emb_matrix)
218 | 
219 |     # NEURAL NETWORK ########### TODO
220 |         # initializing model
221 |         model = rnn_dp.RNNDefect(self)
222 | 
223 |         # training parental predictin on clean data (test is also used for general testing)
224 |         model.train_datasets(ast_data_def, ast_data_cln, ast_data_test)
225 | 
226 |         # predicting defectiveness of test data
227 |     # RESULTS ###########
228 |         defective = model.predict(ast_data_test)
229 |         if(defective):
230 |             print("The Test Data is likely to be defective")
231 |         else:
232 |             print("The Test Data is likely to be not defective")
233 | 


--------------------------------------------------------------------------------
/workspace/test_logistic_regression.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | import torch
 3 | import numpy as np
 4 | import os
 5 | import sys
 6 | 
 7 | 
 8 | import modules.models.logistic_regression as lr
 9 | 
10 | lo_reg_config = {"epochs": 400, "learning_rate": 0.02}
11 | _lr = lr.LogisticRegression(lo_reg_config)
12 | X_train = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
13 | Y_train = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
14 | 
15 | 
16 | class TestPublicFunctions(unittest.TestCase):
17 |     # checks if calculations are correct
18 |     def test_asset_sigmoid_calc(self):
19 |         cal = lr.sigmoid(2)
20 |         self.assertAlmostEqual(cal, 0.8807970779778823)
21 | 
22 | 
23 | class TestLoRegInit(unittest.TestCase):
24 |     # checks if default init works correctly
25 |     def test_assert_default_init(self):
26 |         de_lr = lr.LogisticRegression()
27 |         self.assertTrue(de_lr.epochs == 50 and de_lr.learning_rate ==
28 |                         0.01)
29 | 
30 |     # checks if nondefault init works
31 |     def test_assert_non_default_init(self):
32 |         self.assertTrue(_lr.epochs == 400 and _lr.learning_rate ==
33 |                         0.02)
34 | 
35 | 
36 | class TestLoRegTraining(unittest.TestCase):
37 |     # checks if training input is correct
38 |     def test_asset_training_input(self):
39 |         _lr.train(X_train, Y_train)
40 | 
41 | 
42 | class TestlogTrainTesting(unittest.TestCase):
43 |     # checks if models accuracy makes sense
44 |     def test_asset_testaccuracy(self):
45 |         _lr.train(X_train, Y_train)
46 |         prediction1 = _lr.test(0.1)
47 | 
48 |         self.assertTrue(prediction1 == 1)  # and prediction2 == 1)
49 | 
50 | 
51 | if __name__ == "__main__":
52 |     unittest.main()
53 | 


--------------------------------------------------------------------------------