├── .gitignore ├── README.md ├── dtLSTM_Implementation └── workspace ├── data ├── clean_test.py ├── defective_test.py └── testdata_test.py ├── main.py ├── modules ├── models │ ├── child_sum_tree_lstm.py │ ├── logistic_regression.py │ ├── rnn_example.py │ └── tree_lstm_base.py └── nn_modules │ ├── rnn_defect.py │ └── tree_lstm_defect.py ├── pipelines └── defect_prediction.py └── test_logistic_regression.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | /__pycache__ 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DeepLearningPipelines 2 | 3 | This is an experimental Repository in cooperation with a 'Deep Learning for Code Generation' project, that tries to consist of referenced and self-constructed Code Generation Tasks, written in Python3. 4 | 5 | 6 | ## Structure 7 | 8 | The different tasks can be found as whole in 'pipelines' 9 | and separated into its core-processes in 'modules' 10 | 11 | 12 | ## Tasks include: 13 | 14 | - **Defect Prediction with Deep-Tree LSTM:** 15 | An Implementation attempt to 'A deep tree-based model for software defect prediction' 16 | -> https://arxiv.org/abs/1802.00921 17 | 18 | - **Prediction of semantic relatedness of two sentence &** 19 | - **Sentiment Classification with Tree-Structured LSTM:** 20 | Python Adaptation and Module Extraction of 21 | 'Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks' 22 | -> https://arxiv.org/abs/1503.00075 23 | An already public implementation can be found here: https://github.com/stanfordnlp/treelstm 24 | 25 | -------------------------------------------------------------------------------- /dtLSTM_Implementation: -------------------------------------------------------------------------------- 1 | Progress Documentation: 2 | 3 | 4 | 5 | A deep tree based model for software defect prediction - Self Implementation 6 | 7 | - paper algorithm has most preprocesses within the LSTM Unit calculation 8 | -> will be extracted and fit to our modules 9 | - data is not being embedded beforehand but rather makes use of an embedding matrix that is used within the NN processes in oder to look up the vectors 10 | -> The embedding is only for the actual node names and doesnt represent any AST build 11 | -> thats how i will process it for now but it might ignore important data? 12 | - tree implementation will help here but LSTM unit processes will have to be adjusted (DefectTreeLSTM) 13 | 14 | - internal PROBLEMS: 15 | Everything regarding the parent prediction makes sense: 16 | We train on clean data to see which ast child-parent configurations are the most common. 17 | That is done by iterating over Asts starting from the branches, predicting parent from children 18 | (and context) and adjusting the weights based on the diffrenece of prediction and outcome 19 | The resulting, trained network is then used to recurisively iterate over ast nodes and doing some 20 | LSTM Processes on each node, to obtain a vector which is then classified(?) 21 | BUT: 22 | - Now how is defective data used? 23 | - How does training and predicting differenciate? 24 | - How do we instanciate and train the classification process 25 | 26 | - Paper Documentation PROBLEMS: 27 | - The actual defect prediction learning is not explained 28 | - Defect Prediction Algorithm doesnt say a lot about what it does 29 | - What TreeLSTM is actually used and how is it modified?? (ChildSum) 30 | - What is a vector representation of an AST, how can one obtain it, what does it actually stand for? 31 | 32 | 33 | => In order to understand the overall Functionality of the paper LUA TreeLSTM will be put aside for now, so the NN process will be simulated by an RNN and extended lateron 34 | 35 | 36 | 37 | Future Steps for only Implementation: 38 | (X)- Understanding/Adapting actual Prediction Model 39 | (*)- Include dummy RNN (note: doesnt yet really run) 40 | (X)- Adjust RNN Training and Predicting 41 | ( )- Research and add classification training 42 | ( )- Manage Embedding Training?? 43 | ( )- Include TreeLSTM 44 | ( )- Build and Understand TreeLSTM (ChildSum?) 45 | ( )- Adjust Training and Predicting 46 | 47 | Future Steps for after connecting it to COGE™ 48 | ( )- Expanding data crawlings (PROMISE Dataset) 49 | 50 | 51 | 52 | Explicit Comments: 53 | - Defect Prediction done like the training, i.e. all child nodes of an AST are being input into the lstm and the each parent node is being predicted. 54 | + Defect Prediction now becomes basically the accuracy calculation of test data 55 | +/- This process is now directly connected to the accuracy of the LSTM 56 | - Defective Data will be used as a way of finding the threshold between the accuracy/def-probability 57 | - Threshold might be inaccurate 58 | 59 | 60 | 61 | 62 | 63 | ---------------------------------------------------------------------------- 64 | Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks - Reference Implemenations 65 | 66 | - The Library consitsts of two Tasks: 67 | semantic relatedness | sentiment classification 68 | 69 | and can make use of 4 nn models, where only treelstm is interesting: 70 | (LSTM(Normal, bidirected) and TreeLSTM(childsum and n-ary)) 71 | 72 | - They have two different Tree Algorithms, both probably have to be adapted for defect prediction: 73 | - ChildSumTree: 74 | - N-ary Tree LSTM: 75 | 'can be used on tree structures where the branching factor 76 | is at most N and where children are ordered' 77 | -> I dont really understand which has a better/suitable functionality so ill combine them? 78 | 79 | - The Tasks don't especially fit to the Defect Prediction model; Nonetheless the label classification process from the sentiment classification might suffice for its training process 80 | -------------------------------------------------------------------------------- /workspace/data/clean_test.py: -------------------------------------------------------------------------------- 1 | """ 2 | clean labled file taken from Motivation Example of Papaer 3 | """ 4 | 5 | 6 | def test_clean(stack): 7 | x = 0 8 | while x < 10: 9 | y = 0 10 | if not stack.empty(): 11 | y = stack.pop() 12 | x += 1 13 | -------------------------------------------------------------------------------- /workspace/data/defective_test.py: -------------------------------------------------------------------------------- 1 | """ 2 | defective labled file taken from Motivation Example of Papaer 3 | """ 4 | 5 | 6 | def test_defective(stack): # a testcomment 7 | x = 0 8 | if not stack.empty(): 9 | while x < 10: 10 | y = 0 11 | y = stack.pop() 12 | x += 1 13 | 14 | def test_defective2(stack): # a testcomment 15 | x = 0 16 | if not stack.empty(): 17 | while x < 10: 18 | y = 0 19 | y = stack.pop() 20 | x += 1 21 | -------------------------------------------------------------------------------- /workspace/data/testdata_test.py: -------------------------------------------------------------------------------- 1 | def return_first(x, y): 2 | x = 3 3 | return x 4 | -------------------------------------------------------------------------------- /workspace/main.py: -------------------------------------------------------------------------------- 1 | import pipelines.defect_prediction as dp 2 | 3 | if __name__ == "__main__": 4 | """ 5 | Main to test-run specific pipelines or modules 6 | """ 7 | dp_pipeline = dp.DefectPrediction('/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/data/defective_test.py', 8 | '/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/data/clean_test.py', 9 | '/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/data/testdata_test.py') 10 | dp_pipeline.run() 11 | print("run completed") 12 | -------------------------------------------------------------------------------- /workspace/modules/models/child_sum_tree_lstm.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import re 4 | import ast # for python AST respresentation 5 | import random 6 | import torch 7 | import modules.models.tree_lstm_base as tlstm 8 | import torch.legacy.nn as nn 9 | 10 | 11 | class TreeLSTM(tlstm.TreeLSTM): 12 | """ 13 | E for Experimental, Tree LSTM inheriting from TreeLSTM that is used for lable prediction 14 | """ 15 | 16 | def __init__(self, config): 17 | super().__init__(config.emb_dim, config.mem_dim) 18 | self.criterion = config.criterion 19 | 20 | def forward(self, tree, inputs): 21 | pass 22 | 23 | def backward(self, tree, inputs, grad): 24 | pass 25 | -------------------------------------------------------------------------------- /workspace/modules/models/logistic_regression.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | from torch.utils.data.dataset import Dataset 3 | import numpy as np 4 | import argparse 5 | import os 6 | import time 7 | import torch 8 | import torch.nn as nn 9 | from sklearn.linear_model import LogisticRegression as sk_lr 10 | 11 | 12 | def sigmoid(Z): 13 | return 1/(1+np.e**(-Z)) 14 | 15 | 16 | def logistic_loss(y, y_hat): 17 | return -np.mean(y*np.log(y_hat)+(1-y)*np.log(1-y_hat)) 18 | 19 | 20 | def set_pairs(X): 21 | # input list of single x1 values and make 2d vectors with x2 =1 22 | X_2d = [] 23 | for x1 in X: 24 | X_2d.append([x1, 1]) 25 | return X_2d 26 | 27 | 28 | class LogisticRegression: 29 | """ 30 | One-to-One Network Model that trains logistic regression threshold 31 | to predict binary classification : Y=(0/1) for X=(x1) 32 | """ 33 | 34 | def __init__(self, config=None): 35 | if config != None: 36 | self.epochs = config["epochs"] 37 | self.learning_rate = config["learning_rate"] 38 | else: 39 | self.epochs = 50 40 | self.learning_rate = 0.01 41 | self.T = 0 #treshold 42 | 43 | def train(self, X, Y): 44 | #not really training, just finign average threshold 45 | d_0 = [] # cln should have higher accuracies 46 | d_1 = [] # def 47 | for i in range(len(X)): 48 | if Y[i] == 0: 49 | d_0.append(X[i]) 50 | elif Y[i] == 1: 51 | d_1.append(X[i]) 52 | else: 53 | print("something wrong with training data") 54 | m_0 = np.average(d_0) 55 | m_1 = np.average(d_1) 56 | diff = m_0-m_1 57 | self.T = m_1+ diff*(len(d_1)/len(X)) 58 | #print(self.T) 59 | 60 | 61 | def test(self, X): 62 | """ 63 | predicts Y for X on trained model 64 | :param X: 2b-predicting float input 65 | :returns: true or false (for defect prediction true if defective) 66 | """ 67 | if X>self.T: 68 | return 0#not buggy 69 | else: 70 | return 1 71 | -------------------------------------------------------------------------------- /workspace/modules/models/rnn_example.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | from torch.utils.data.dataset import Dataset 3 | import numpy as np 4 | import argparse 5 | import os 6 | import time 7 | import torch 8 | import torch.nn as nn 9 | from torch.utils.data import DataLoader 10 | 11 | 12 | class RNN_Example(): 13 | """ 14 | Implementation of Recurrent Neural Network. 15 | For Emils Pipeline Implementation 16 | """ 17 | 18 | def __init__(self, config): 19 | self.dict = config.dictionary 20 | self.batch_size = config.batch_size 21 | self.epochs = 10 22 | self.input_length = config.emb_dim*config.in_len # TODO 23 | self.output_length = config.emb_dim 24 | self.dict_size = len(self.dict) 25 | self.saved_path = "/home/emil/Documents/DeepLearningProject/PaperImplementation/DeepLearningPipelines/workspace/dump" 26 | self.saved_file = os.path.join(self.saved_path, "best_trained_model") 27 | # TODO: current model not taking sequences of token, only 1 token 28 | 29 | def run(self, train_in, train_out, test_in, test_out): 30 | 31 | self.train_network(train_in, train_out, test_in, test_out) 32 | return list(self.predict_testing_output.numpy()) 33 | 34 | def train_network(self, train_in, train_out, test_in, test_out): 35 | """ 36 | Train network. 37 | 38 | Train each epoch by training set, evaluate model after each epoch 39 | using validation set, save the best model and test using test set. 40 | 41 | This function also prints out loss, accuracy each epoch 42 | and loss/accuracy of the best model. 43 | 44 | :param datasets: list of input/output sets, 45 | :returns: none 46 | """ 47 | # Training and Testing data will look like a 2 dim array 48 | # where each index holds corresponding in [0] to output [1] 49 | # print("TRAINING IN OUT:", train) 50 | training_input = train_in 51 | print("Training IN RNN\n", training_input) 52 | training_output = train_out 53 | print("Training OUT RNN\n", training_output) 54 | validating_input = train_in 55 | validating_output = train_out 56 | testing_input = test_in 57 | testing_output = test_out 58 | 59 | # Used to compare with accuracy of model 60 | best_accuracy = 0.0 61 | 62 | params = { 63 | "batch_size": self.batch_size, 64 | "shuffle": True, 65 | "drop_last": True 66 | } 67 | 68 | # Datasets object generate data which will put into neural network 69 | # Datasets contain some specific functions to adapt nn in Pytorch 70 | train_data = Datasets(training_input, training_output) 71 | valid_data = Datasets(validating_input, validating_output) 72 | test_data = Datasets(testing_input, testing_output) 73 | 74 | # DataLoader used to load data equal to batch_size 75 | train_loader = DataLoader(train_data, **params) 76 | valid_loader = DataLoader(valid_data, **params) 77 | test_loader = DataLoader(test_data, **params) 78 | 79 | model = RNN(training_data=training_input, dict_size=self.dict_size) 80 | 81 | # Check if computer have graphic card, 82 | # model will be trained py GPU instead of CPU 83 | if torch.cuda.is_available(): 84 | model.cuda() 85 | 86 | # Loss function 87 | self.criterion = nn.CrossEntropyLoss() 88 | 89 | # Optimization 90 | optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) 91 | 92 | # Number of iteration ( = length of data / batch_size) 93 | self.num_iter = int(train_data.__len__()/self.batch_size) 94 | 95 | for epoch in range(self.epochs): 96 | # Declare to start training phase 97 | model.train() 98 | for iter, (content, label) in enumerate(train_loader): 99 | start_time = time.time() 100 | if torch.cuda.is_available(): 101 | content = content.cuda() 102 | label = label.cuda() 103 | # Clean buffer to avoid accumulate value of buffer 104 | optimizer.zero_grad() 105 | 106 | # Training model to understand the content of traning set 107 | predicted_value = model(content) 108 | 109 | # Calculating loss 110 | loss = self.criterion(predicted_value, label) 111 | 112 | # Back propagation 113 | loss.backward() 114 | 115 | # Optimizing model based on loss 116 | optimizer.step() 117 | elapse_time = time.time() - start_time 118 | 119 | # validate this model at the end of each epoch 120 | self.access_model( 121 | model=model, 122 | data_loader=test_loader, 123 | access_data=test_data, 124 | criterion=self.criterion, 125 | num_iter=self.num_iter, 126 | epoch=epoch, 127 | best_accuracy=best_accuracy) 128 | 129 | try: 130 | model.load_state_dict(torch.load(self.saved_file)) 131 | except: 132 | FileNotFoundError 133 | print("can't save best model because there is none") 134 | 135 | self.access_model(model=model, 136 | data_loader=test_loader, 137 | access_data=test_data, 138 | criterion=self.criterion, 139 | mode="test", 140 | num_iter=self.num_iter) 141 | 142 | def test_network(self, test_data): 143 | """ 144 | separated training of 145 | :param test_data: data 2b tested with current/best model 146 | :returns: accuracy 147 | """ 148 | params = { 149 | "batch_size": self.batch_size, 150 | "shuffle": True, 151 | "drop_last": True 152 | } 153 | test_loader = DataLoader(test_data, **params) 154 | 155 | try: 156 | best_model = RNN([], self.dict_size) 157 | best_model.load_state_dict(torch.load(self.saved_file)) 158 | except: 159 | FileNotFoundError 160 | print("can't load best model because there is none") 161 | 162 | accuracy = self.access_model(model=best_model, 163 | data_loader=test_loader, 164 | access_data=test_data, 165 | criterion=self.criterion, 166 | mode="test", 167 | num_iter=self.num_iter) 168 | return np.around(accuracy, decimals=3) 169 | 170 | def access_model(self, model, data_loader, access_data, criterion, 171 | num_iter, mode="validate", epoch=0, best_accuracy=0.0): 172 | """ 173 | Validate model after every epoch 174 | 175 | :param model: TODO @Annie @Thang 176 | :param data_loader: TODO @Annie @Thang 177 | :param access_data: TODO @Annie @Thang 178 | :param criterion: TODO @Annie @Thang 179 | :param num_iter: integer 180 | :param mode: string TODO @Annie @Thang 181 | :param epoch: integer 182 | :param best_accuracy: float 183 | """ 184 | # Declare to start validating phase 185 | model.eval() 186 | loss_list = [] 187 | accuracy_list = [] 188 | 189 | if mode == "test": 190 | self.predict_testing_output = torch.LongTensor([]) 191 | for iter, (content, label) in enumerate(data_loader): 192 | if torch.cuda.is_available(): 193 | content = content.cuda() 194 | label = label.cuda() 195 | # In testing phase, we don't optimize model, 196 | # we only use model to predict value in testing set 197 | with torch.no_grad(): 198 | predicted_value = model(content) 199 | prediction = torch.argmax(predicted_value, dim=1) 200 | 201 | if mode == "test": 202 | self.predict_testing_output = torch.cat( 203 | (self.predict_testing_output, prediction)) 204 | 205 | # Comparing between truth output and predicted output 206 | accuracy = get_accuracy(prediction=prediction, 207 | actual_value=label, 208 | dict=self.dict) 209 | if accuracy > best_accuracy: 210 | best_accuracy = accuracy 211 | if mode == "validate": 212 | torch.save(model.state_dict(), self.saved_file) 213 | 214 | loss = criterion(predicted_value, label) 215 | loss_list.append(loss * label.size()[0]) 216 | accuracy_list.append(accuracy * label.size()[0]) 217 | 218 | loss = sum(loss_list) / access_data.__len__() 219 | accuracy = sum(accuracy_list) / access_data.__len__() 220 | 221 | loss = np.around(loss, decimals=3) 222 | if mode == "validate": 223 | print("Epoch ", epoch+1, "/", self.epochs, ". Validation Loss: ", 224 | loss, " Validation Accuracy: ", np.around(accuracy, decimals=3)) 225 | 226 | if mode == "test": 227 | raccuracy = np.around(accuracy, decimals=3) 228 | # print("Best Model. Loss: ", loss, " Accuracy: ", raccuracy) 229 | # for defectiveness prediction 230 | return accuracy 231 | 232 | 233 | def get_accuracy(prediction, actual_value, dict): 234 | """ 235 | Calculate the accuracy of the model after every batch. 236 | 237 | :param prediction: list of predicted values 238 | :param actual_value: list of actual values 239 | :param dict: vocabulary 240 | :returns: accuracy for this batch 241 | """ 242 | count = 0 243 | 244 | for i in range(len(prediction)): 245 | # check if the prediction is correct and not unknown 246 | if(prediction[i] == actual_value[i] and prediction[i] != len(dict)-1): 247 | count += 1 248 | 249 | return count/len(prediction) 250 | 251 | 252 | class Datasets(Dataset): 253 | 254 | def __init__(self, seq_ins, seq_outs): 255 | """ 256 | Initial function used to get 257 | embedded training input, output, dictionary 258 | 259 | :param training_input: embedded input 260 | :param training_output: embedded output 261 | :param dict: embedded dictionary 262 | """ 263 | super(Datasets, self).__init__() 264 | 265 | self.seq_ins = seq_ins 266 | self.seq_outs = seq_outs 267 | 268 | def __getitem__(self, index): 269 | """ 270 | __getitem__ is a required function of Pytorch if we want to use 271 | neural network (torch.nn), get the content and corresponding 272 | label of each word is the index of next word in dictionary. 273 | 274 | :param index: index of word in training set or test set 275 | :return: content and label of this word 276 | """ 277 | 278 | seq_in = self.seq_ins[index] 279 | # At the moment, we can only output 1 token, because the size will 280 | # grow exponentially with the length of the output sequences 281 | seq_out = self.seq_outs[index][0] 282 | 283 | return seq_in, seq_out 284 | 285 | def __len__(self): 286 | """ 287 | __len__ is a required function of neural network of Pytorch. 288 | :return: the length of training set or test set 289 | """ 290 | 291 | return len(self.seq_outs) 292 | 293 | 294 | class RNN(nn.Module): 295 | 296 | def __init__(self, training_data, dict_size): 297 | """ 298 | Initial function for RNN. 299 | 300 | :param training_data: embedded training input 301 | :param dict: embedded dictionary 302 | """ 303 | super(RNN, self).__init__() 304 | 305 | self.training_data = training_data 306 | 307 | # RNN with 1 input layer, 1 hidden layer, 1 output layer 308 | # Input layer: 8 unit, hidden layer: 50 unit, output layer: 9 unit 309 | # The number of unit of output layer = input layer + 1 310 | # (1 for a unknown word) 311 | self.RNN = nn.RNN(input_size=dict_size, hidden_size=50, 312 | num_layers=1, bidirectional=False) 313 | 314 | # Fully connected layer 315 | self.fc = nn.Linear(in_features=50, out_features=dict_size) 316 | 317 | def forward(self, input): 318 | """ 319 | Pipeline for Neural network in Pytorch (build-in function). 320 | 321 | :param input: 2-dimensional tensor ( batch_size x input_size) 322 | :returns: final output of neural network, 323 | the dimension of neural network = number of classes 324 | """ 325 | 326 | # Increasing dimension of input by 1 327 | # Input shape: [batch_size x input_size] 328 | # Output shape: [1 x batch_size x input_size] 329 | 330 | output, _ = self.RNN(input.float()) 331 | 332 | output = output.permute(1, 0, 2) 333 | output = self.fc(output[-1]) 334 | 335 | # print(output) 336 | 337 | return output 338 | -------------------------------------------------------------------------------- /workspace/modules/models/tree_lstm_base.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import re 4 | import ast # for python AST respresentation 5 | import random 6 | import torch 7 | import torch.legacy.nn as nn 8 | from abc import ABC, abstractmethod 9 | 10 | 11 | class TreeLSTM(nn.Module, ABC): 12 | """ 13 | Tree LSTM Interface reimplemented from the paper 14 | 'Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks' 15 | https://arxiv.org/abs/1503.00075 16 | LUA implementation: https://github.com/stanfordnlp/treelstm 17 | 18 | 19 | """ 20 | 21 | def __init__(self, in_dim, mem_dim): 22 | super().__init__() 23 | self.in_dim = in_dim 24 | if self.in_dim == None: 25 | print('input dimension must be specified') 26 | self.mem_dim = mem_dim 27 | # memory initialized with zeros 28 | self.zeros = torch.zeros(self.mem_dim) 29 | # boolean to check if model is training or evaluating 30 | self.train = False 31 | 32 | @abstractmethod 33 | def forward(self, tree, inputs): 34 | pass 35 | 36 | @abstractmethod 37 | def backward(self, tree, inputs, grad): 38 | pass 39 | 40 | # TODO ? 41 | 42 | def allocate_module(self, tree, module): 43 | pass 44 | 45 | def free_module(self, tree, module): 46 | pass 47 | -------------------------------------------------------------------------------- /workspace/modules/nn_modules/rnn_defect.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import re 4 | import ast # for python AST respresentation 5 | import random 6 | import torch 7 | import torch.legacy.nn as nn 8 | import modules.models.rnn_example as rnn 9 | import sklearn.linear_model as lm 10 | import numpy as np 11 | 12 | 13 | class RNNDefect(): 14 | """ 15 | RNN Neural network for processes of Defect prediction. 16 | Its use is to understand the core processes of the Defect prediction Task 17 | -> that is archieved by adjusting the processes that are meant for the treelstm 18 | """ 19 | 20 | def __init__(self, pipeline): 21 | """ 22 | :param pipeline: holds all necessary information for nnsimulation 23 | """ 24 | # global dictionary 25 | self.dictionary = pipeline.dictionary 26 | # vector length 27 | self.emb_dim = len(pipeline.emb_matrix[0]) 28 | self.emb_matrix = pipeline.emb_matrix 29 | # embedding matrix as lookup table 30 | self.emb_matrix_look = nn.LookupTable( 31 | len(pipeline.emb_matrix), self.emb_dim) 32 | self.emb_matrix_look.weight = self.emb_matrix 33 | # length of input sequence for RNN 34 | self.in_len = 5 35 | 36 | # rnn properties 37 | # memory dimension 38 | self.mem_dim = 150 39 | # learning rate 40 | self.learning_rate = 0.05 41 | # word vector embedding learning rate 42 | self.emb_learning_rate = 0.0 43 | # minibatch size 44 | self.batch_size = 25 45 | # regulation strength 46 | self.reg = 1e-4 47 | # simulation module hidden dimension 48 | self.sim_nhidden = 50 49 | 50 | # optimization configuration 51 | self.optim_state = {self.learning_rate} 52 | 53 | # negative log likeligood optimization objective 54 | self.criterion = nn.ClassNLLCriterion() 55 | 56 | # models 57 | # initialize code learning model 58 | self.rnn = rnn.RNN_Example(self) 59 | # initialize classification model 60 | self.log_reg = lm.LogisticRegression() 61 | 62 | def train_datasets(self, dataset_def, dataset_cln, test_cln): 63 | """ 64 | training for defect prediction 65 | 66 | ! consists of 2 main training steps: 67 | 1 training RNN to learn how code should look like 68 | 2 training a classifier with labeled(clean/defective) code 69 | 70 | :param dataset_def: dataset containing defective asts 71 | :param dataset_cln: dataset containing clean asts 72 | :param test_cln: dataset containing clean test asts 73 | """ 74 | # training rnn 75 | self.train_clean(dataset_cln, test_cln) 76 | # training classifier 77 | self.train_pred(dataset_def, dataset_cln) 78 | 79 | def predict(self, test): 80 | """ 81 | calls testdata upon RNN to obtain Reconstruction accuracy. 82 | the Reconstruction accuracy will be classified to find out how defective files can be 83 | :param test: testdata whose defectiveness is tested 84 | :returns: true if likely to be defective; false if not 85 | """ 86 | 87 | def classify(accuracy): 88 | """ 89 | classification process to determine the probability of data based on 90 | NN code recunstruction accuracy 91 | :param accuracy: NN code recunstruction accuracy 92 | :returns: percentage of likelihood of defectiveness 93 | """ 94 | return 1-accuracy # test classifier 95 | 96 | accuracy = self.rnn.test_network( 97 | self.parents_children(test, self.in_len)) # TODO input weird 98 | bug_prob = classify(accuracy) 99 | if bug_prob > 0.5: 100 | return 1 101 | else: 102 | return 0 103 | 104 | ############################### TRAINIGS ############################# 105 | 106 | # Training of RNN # 107 | 108 | def train_clean(self, dataset_cln, test_cln): 109 | """ 110 | trains and tests RNN with clean datafiles 111 | 112 | :param dataset_cln: dataset containing clean training asts 113 | :param test_cln: dataset containing clean test asts 114 | :returns: void; saves best model in folder 115 | 116 | TODO delete/adjust when made use of TreeLSTM 117 | """ 118 | # preparing in and outputs for recurrent neural network 119 | train_out, train_in = self.prepare_parents_children(dataset_cln) 120 | test_out, test_in = self.prepare_parents_children(test_cln) 121 | 122 | # embedding of datasets | Makes only sense for RNN because we loose context 123 | emb_train_out, emb_train_in = self.embed(train_out, train_in) 124 | # print("Embedded Training IN OUT\n", emb_train) 125 | emb_test_out, emb_test_in = self.embed(test_out, test_in) 126 | self.rnn.run(emb_train_in, emb_train_out, emb_test_in, emb_test_out) 127 | 128 | def prepare_parents_children(self, datasets): 129 | """ 130 | creates all traing/testing/validation data in and outputs for RNN 131 | :param datasets: list containing ASTs 132 | :retuns: 2 dimensional list that holds list of all children for parent at same index 133 | -> for all datasets 134 | TODO this works without lstmcontext now because we use standard RNN 135 | that must be changed/deleted later and be processed in the tree LSTM 136 | """ 137 | # collect parent children pairs first 138 | all_parents = [] 139 | all_children = [] 140 | for tree in datasets: 141 | parents, chilren = self.parents_children(tree, self.in_len) 142 | all_parents.extend(parents) 143 | all_children.extend(chilren) 144 | return all_parents, all_children 145 | 146 | def parents_children(self, tree, sequencelength): 147 | """ 148 | extracts all parents with its children from given AST 149 | :param tree: 2b-extracted python AST 150 | :returns: 2 lists that represent list of all children for parent at same index 151 | """ 152 | parents = [] 153 | children = [] 154 | for node in ast.walk(tree): 155 | loc_children = [] 156 | loc_children_ast = ast.iter_child_nodes(node) 157 | # test if node is branch, if yes then its ignored 158 | for child_ast in loc_children_ast: 159 | loc_children.append(child_ast.__class__.__name__) 160 | if not len(loc_children) == 0: 161 | parents.append(node.__class__.__name__) 162 | 163 | # im sorry for everyone who has to see this 164 | while len(loc_children) < sequencelength: 165 | loc_children.append("") 166 | 167 | children.append(loc_children) 168 | 169 | return parents, children 170 | 171 | def embed(self, parents, childrens): 172 | """ 173 | embedding for ast node names of parents and children 174 | 175 | :param parents: list holding parent tokens 176 | :param children: list holding children tokens 177 | :returns: vector representation of datafiles 178 | """ 179 | 180 | all_parents = [] 181 | for parent in parents: 182 | all_parents.append([self.ast2index(parent)]) 183 | all_children = [] 184 | for children in childrens: 185 | embedded_children = [] 186 | for child in children: 187 | # creating combined vector of children vec values 188 | # im sorry for everyone who has to see this code 189 | embedded_children.append(self.ast2vec(child)) 190 | all_children.append(embedded_children) 191 | # convert to numpy array 192 | return all_parents, np.array(all_children, dtype=float) 193 | 194 | def ast2vec(self, ast_node): 195 | """ 196 | embedding of single ast node with use of local dictionary and embedding matrix 197 | 198 | :param ast_token: 2b-embedded ast ast_node 199 | :returns: vector representation of ast 200 | """ 201 | # find index first 202 | index = self.ast2index(ast_node) 203 | # lookup index in embedding matrix 204 | return self.emb_matrix[index] 205 | 206 | def ast2index(self, ast_node): 207 | """ 208 | index embedding of single ast node with the use of local dictionary; 209 | for one-hot 210 | 211 | :param ast_token: 2b-embedded ast ast_node 212 | :returns: index representation of ast 213 | """ 214 | if ast_node in self.dictionary: 215 | index = self.dictionary.index(ast_node) 216 | else: 217 | # last element in dictionary is Unknown type; equals dictionary.index("UNK") 218 | index = len(self.dictionary)-1 219 | return index 220 | 221 | #Training the classification # 222 | 223 | def train_pred(self, def_data, cln_data): 224 | """ 225 | trains module's classifier with 2 types of data 226 | :param def_data: defective labled data 227 | :param cln_data: clean labled data 228 | :returns: void; saves best model in folder 229 | """ 230 | # obtaining lists containing reconstruction accuracies as inputs for log regression 231 | # TODO were creating a lot of subtree of each ast (both inputs to this point have only one ast) 232 | def_in_out = self.parents_children(def_data[0], self.in_len) 233 | cln_in_out = self.parents_children(cln_data[0], self.in_len) 234 | def_in = [] 235 | cln_in = [] 236 | # TODO rn were working with subtrees / that shouldnt stay like that because not all subtrees are defective 237 | # BUT its interesting because it looks at the internal ast structure - but maybe that should happen in the NN 238 | for i in range(len(def_in_out[0])): 239 | # dimension 0 is parents and 1 are children 240 | def_in.append(self.rnn.test_network( 241 | [def_in_out[0][i], def_in_out[1][i]])) 242 | for i in range(len(cln_in_out[0])): 243 | # dimension 0 is parents and 1 are children 244 | cln_in.append(self.rnn.test_network( 245 | [cln_in_out[0][i], cln_in_out[1][i]])) 246 | print("Defective Reconstruction Accuracies\n", def_in) 247 | print("Clean Reconstruction Accuracies\n", cln_in) 248 | # inputs and outputs for logistic regression 249 | X = def_in + cln_in 250 | Y = [0 for i in range(cln_in)]+[1 for i in range(def_in)] 251 | # TODO TRAIN 252 | -------------------------------------------------------------------------------- /workspace/modules/nn_modules/tree_lstm_defect.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import re 4 | import ast # for python AST respresentation 5 | import random 6 | import torch 7 | import torch.legacy.nn as nn 8 | 9 | 10 | class TreeLSTMDefect: 11 | """ 12 | The actual TreeLSTM Module training and testing processes for DefectPrediction. 13 | with the instanciation of a TreeLSTM it will be able to do defect predictions for one file(one AST) 14 | """ 15 | 16 | def __init__(self, pipeline): 17 | """ 18 | :param pipeline: holds all necessary information for nnsimulation 19 | """ 20 | # global dictionary 21 | self.dictionary = pipeline.dictionary 22 | # vector length 23 | self.emb_dim = len(pipeline.emb_matrix[0]) 24 | self.emb_matrix = pipeline.emb_matrix 25 | # embedding matrix as lookup table 26 | self.emb_matrix_look = nn.LookupTable( 27 | len(pipeline.emb_matrix), self.emb_dim) 28 | self.emb_matrix_look.weight = self.emb_matrix 29 | # length of input sequence for RNN 30 | self.in_len = 3 31 | 32 | # lstm properties 33 | # memory dimension 34 | self.mem_dim = 150 35 | # learning rate 36 | self.learning_rate = 0.05 37 | # word vector embedding learning rate 38 | self.emb_learning_rate = 0.0 39 | # minibatch size 40 | self.batch_size = 25 41 | # regulation strength 42 | self.reg = 1e-4 43 | # simulation module hidden dimension 44 | self.sim_nhidden = 50 45 | 46 | # optimization configuration 47 | self.optim_state = {self.learning_rate} 48 | 49 | # negative log likeligood optimization objective 50 | self.criterion = nn.ClassNLLCriterion() 51 | 52 | ''' 53 | self.etree_lstm = etree.ETreeLSTM(self) 54 | try: 55 | self.params, self.grad_params = self.etree_lstm._flatten( 56 | self.etree_lstm.parameters()) 57 | except: 58 | self.params = self.grad_params = torch.zeros(1) 59 | ''' 60 | 61 | def train_datasets(self, dataset_def, dataset_cln, test_cln): 62 | """ 63 | training for the TreeLSTM 64 | """ 65 | 66 | def train_clean(self, dataset_cln, test_cln): 67 | """ 68 | trains and tests TreeLSTM with clean datafiles 69 | TODO in TreeLSTM 70 | consists of 3 steps for a tree: 71 | - recursively (from branch) walk over children and let them predict the parent node 72 | - Compare the prediction with actual node 73 | - adjust weights of model so that the difference is minimal 74 | 75 | :param dataset_cln: dataset containing clean training asts 76 | :param test_cln: dataset containing clean test asts 77 | :returns: void; saves best model in folder 78 | """ 79 | 80 | def predict_parent(self, children): 81 | """ 82 | predicting parent node based on child nodes 83 | : param children: list of children nodes 84 | : returns: most likely parent node 85 | """ 86 | pass 87 | 88 | def predict(self, tree): 89 | """ 90 | predicting defectiveness of a file/tree 91 | : param tree: 2b-evaluated abstract sytax tree 92 | : returns: likelihood of defectiveness 0-1 93 | """ 94 | pass 95 | 96 | def predict_def_datasets(self, dataset_def, dataset_cln): 97 | """ 98 | iterates over data and calculates the overall correctness of predictions 99 | : param dataset_def: dataset containing defective asts 100 | : param dataset_cln: dataset containing clean asts 101 | : returns: overall precision of Network 0-1 102 | """ 103 | pass 104 | 105 | ############################ 106 | """ 107 | TODO delete the following functions when everything runs 108 | theyre just here for some lookups but dont have any purpose 109 | """ 110 | 111 | def lstm_unit(self, ast_node, depth=0): 112 | """ 113 | Process of one LSTM unit. 114 | Recursively calls learning processes on all children in one tree 115 | 116 | : param ast_node: one Python AST node; First call will be with root Node 117 | : returns: hidden state and context of node; eventually for the whole AST 118 | """ 119 | weight = torch.tensor([]) # TODO weights with lstm calculation!! 120 | w_t = ast2vec(ast_node, self.dictionary, 121 | self.emb_matrix) # embedding of tree 122 | # sum of children hidden outputs 123 | h_ = 0 124 | # child hidden state 125 | h_k = 0 126 | # context of child 127 | c_k = 0 128 | # forget gates 129 | f_tk = 0 130 | # childrem forgetrates times the context 131 | c_ = 0 132 | for k in ast.iter_child_nodes(ast_node): 133 | print(k, depth) 134 | h_k, c_k = self.lstm_unit(k, depth+1) 135 | f_tk = torch.nn.Sigmoid()(weight) 136 | h_ += h_k 137 | c_ += (f_tk * c_k) 138 | # input gate 139 | i_t = torch.nn.Sigmoid()(weight) 140 | # vector of new candidate values for t 141 | c_t_ = torch.nn.Tanh()(weight) 142 | # context 143 | c_t = i_t * c_t_ + c_ 144 | # output gate 145 | o_t = torch.nn.Sigmoid()(weight) 146 | h_t = o_t * torch.nn.Tanh()(c_t) 147 | 148 | return h_t, c_t 149 | 150 | def train_clean_trash(self, trees): 151 | """ 152 | consists of 3 steps for a tree: 153 | - recursively(from branch) walk over children and let them predict the parent node 154 | - Compare the prediction with actual node 155 | - adjust weights of model so that the difference is minimal 156 | """ 157 | bar = Bar('Training', max=len(trees)) 158 | self.etree_lstm.train = True 159 | indices = torch.randperm(len(trees)) 160 | zeros = torch.zeros(self.mem_dim) 161 | for i in range(1, len(trees)+1, self.batch_size): 162 | bar.next() # printing progress 163 | batch_size = min(i+self.batch_size - 1, len(trees))-i+1 164 | 165 | def f_eval(): 166 | pass 167 | 168 | # torch.optim.Adagrad(self.params, self.optim_state) 169 | bar.finish() 170 | -------------------------------------------------------------------------------- /workspace/pipelines/defect_prediction.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import re 4 | import ast # for python AST respresentation 5 | import random 6 | from progress.bar import Bar 7 | import torch 8 | import torch.legacy.nn as nn 9 | # for childsum tree LSTM model 10 | import modules.nn_modules.tree_lstm_defect as tlstm_dp 11 | import modules.nn_modules.rnn_defect as rnn_dp 12 | 13 | 14 | def read_file(path): 15 | """ 16 | reads file and returns it 17 | :param path: path to 2b-extracted file 18 | :return: string of file 19 | """ 20 | with open(path, 'r') as file: 21 | return file.read() 22 | 23 | 24 | def tidy(files): 25 | """ 26 | doing first tidying processes on code before parsing 27 | :param files: 2b-tidied code files 28 | :returns: string code 29 | """ 30 | # removing comments 31 | def stripComments(code_str): 32 | code_str = str(code) 33 | return re.sub(r'(?m)^ *#.*\n?', '', code_str) 34 | 35 | # removing docstrings TODO 36 | 37 | t_files = [] 38 | for code in files: 39 | t_files.append(stripComments(code)) 40 | return t_files 41 | 42 | 43 | def parse_data(data): 44 | """ 45 | parsing process for python ast representation 46 | :param data: already tidied code files 47 | :returns: list containing all parsed files 48 | """ 49 | ast_data = [] 50 | for code in data: 51 | ast_data.append(ast.parse(code)) 52 | return ast_data 53 | 54 | 55 | def create_dictionary(datasets, max_count): 56 | """ 57 | Builds a fixed sized dictionary, with an entry at last position 58 | 59 | :param dataset: tokenized list, optionally n dimensional 60 | :param max_count: defines how long our dictionary will be 61 | :return: top 'max_count' tokens of dictionary as a list 62 | """ 63 | 64 | def extract_highest_occurences(dataset, max_count): 65 | """ 66 | finds highest token occurences in datasets 67 | 68 | :param dataset: tokenized list, optionally n dimensional 69 | :param max_count: defines how long our dictionary will be 70 | :return: top 'max_count' tokens of dictionary as a list 71 | """ 72 | # global fixed size dictionary as set 73 | global_dict = {} 74 | # iterating over all training files 75 | for tree in dataset: 76 | # a complete dictionary for one file 77 | local_dict = {} 78 | # iterating over all words of a file 79 | for ast_node in ast.walk(tree): 80 | # if word not in the local dictionary then we add it 81 | # otherwise we rise count 82 | node = ast_node.__class__.__name__ 83 | if node in local_dict: 84 | local_dict[node] += 1 85 | else: 86 | local_dict[node] = 1 87 | # local dict counts will now be merged into fix sized, global dictionary 88 | for ast_node in local_dict: 89 | if ast_node in global_dict: 90 | global_dict[ast_node] += local_dict[ast_node] 91 | else: 92 | global_dict[ast_node] = local_dict[ast_node] 93 | 94 | # global dict will be filled with highest counts 95 | # first we find highest count 96 | highest_count = 0 97 | for ast_node in global_dict: 98 | if highest_count < global_dict[ast_node]: 99 | highest_count = global_dict[ast_node] 100 | # now we create an updated highest count global dict 101 | new_global_dict = {} 102 | while len(new_global_dict) < max_count: 103 | if highest_count < 1 or len(new_global_dict) >= 2*len(global_dict): 104 | break 105 | # filling new global 106 | for ast_node in global_dict: 107 | if (global_dict[ast_node] == highest_count and 108 | len(new_global_dict) < max_count): 109 | new_global_dict[ast_node] = highest_count 110 | highest_count -= 1 111 | global_dict = new_global_dict 112 | # print("maxcout", global_dict) 113 | return list(global_dict) 114 | 115 | # running through dimensions 116 | for dataset in datasets: 117 | dictionary = extract_highest_occurences(dataset, max_count-1) 118 | dictionary.append("UNK") 119 | return dictionary 120 | 121 | 122 | def truncate(f, n): 123 | '''Truncates/pads a float f to n decimal places without rounding''' 124 | s = '{}'.format(f) 125 | if 'e' in s or 'E' in s: 126 | return '{0:.{1}f}'.format(f, n) 127 | i, p, d = s.partition('.') 128 | return '.'.join([i, (d+'0'*n)[:n]]) 129 | 130 | 131 | def random_embed(dictionary, vector_length): 132 | """ 133 | embeds dictionary with random valued vectors for initializing random weights. 134 | values lay between -1 and 1 135 | 136 | :param dictionary: 2b-embedded dictionary 137 | :param vector_length: desired length for vectors 138 | :returns: embedded dictionary as matrix 139 | """ 140 | e_mat = [] 141 | # create matric with dimension dictionarysize x vector length 142 | for mi in range(len(dictionary)): 143 | e_vec = [] 144 | for vi in range(vector_length): 145 | e_vec.append(truncate(random.uniform(-1.0, 1.0), 2)) 146 | e_mat.append(e_vec) 147 | return e_mat 148 | 149 | def one_hot(dictionary): 150 | """ 151 | embeds dictionary with one hot vectors : zero vector with 1 at specified position 152 | 153 | :param dictionary: 2b-embedded dictionary 154 | :returns: embedded dictionary as matrix 155 | """ 156 | e_mat = [] 157 | for mi in range(len(dictionary)): 158 | e_vec = [0 for i in range(len(dictionary))] 159 | e_vec[mi] = 1 160 | e_mat.append(e_vec) 161 | return e_mat 162 | 163 | 164 | class DefectPrediction: # main pipeline 165 | """ 166 | implementation Attemt to a published Paper: 167 | 'A deep tree-based model for software defect prediction' 168 | Reference: https://arxiv.org/abs/1802.00921 169 | 170 | TASK: Predicting Probability of a Code Being Defective or not 171 | """ 172 | 173 | def __init__(self, data_defective, data_clean, data_test): 174 | """ 175 | :param data_defective: path to datacorpus code labled as defective 176 | :param data_clean: path to datacorpus code labled as clean 177 | """ 178 | self.raw_data_defective = data_defective 179 | self.raw_data_clean = data_clean 180 | self.raw_data_test = data_test 181 | # vocabulary/dictionary size 182 | self.voc_size = 100 183 | self.vec_length = 3 # = voc size for one hot 184 | 185 | def run(self): 186 | """ 187 | runs whole Pipeline with already initialized defective and clean datasets 188 | """ 189 | # PREPROCESSING ########### 190 | 191 | # cleaning and opening files TODO manage datacorpus with lables and crawling 192 | data_def = tidy([read_file(self.raw_data_defective)]) 193 | data_cln = tidy([read_file(self.raw_data_clean)]) 194 | data_test = tidy([read_file(self.raw_data_test)]) 195 | 196 | # transforming file strings to AST and filling datasets 197 | # parsing with own funciton ast_data_def_exp = code2ast(data_def) TODO manage error/empty files 198 | # parsing with ast.parse 199 | 200 | ast_data_def = parse_data(data_def) 201 | ast_data_cln = parse_data(data_cln) 202 | ast_data_test = parse_data(data_test) 203 | 204 | # print ast data 205 | #("Defective Data AST:\n", ast.dump(ast_data_def[0])) 206 | #print("Clean Data AST:\n", ast.dump(ast_data_cln[0])) 207 | print("Test Data AST:\n", ast.dump(ast_data_test[0])) 208 | 209 | # vocabulary of highest occurences 210 | self.dictionary = create_dictionary( 211 | [ast_data_cln, ast_data_def], self.voc_size) # TODO manage whole datastorage 212 | print("Dictionary:\n", self.dictionary) 213 | 214 | # EMBEDDING ########### TODO learning? 215 | # random embedding of dictionary; as initializing! 216 | self.emb_matrix = one_hot(self.dictionary) 217 | # print("Embedding Matrix:\n", self.emb_matrix) 218 | 219 | # NEURAL NETWORK ########### TODO 220 | # initializing model 221 | model = rnn_dp.RNNDefect(self) 222 | 223 | # training parental predictin on clean data (test is also used for general testing) 224 | model.train_datasets(ast_data_def, ast_data_cln, ast_data_test) 225 | 226 | # predicting defectiveness of test data 227 | # RESULTS ########### 228 | defective = model.predict(ast_data_test) 229 | if(defective): 230 | print("The Test Data is likely to be defective") 231 | else: 232 | print("The Test Data is likely to be not defective") 233 | -------------------------------------------------------------------------------- /workspace/test_logistic_regression.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import torch 3 | import numpy as np 4 | import os 5 | import sys 6 | 7 | 8 | import modules.models.logistic_regression as lr 9 | 10 | lo_reg_config = {"epochs": 400, "learning_rate": 0.02} 11 | _lr = lr.LogisticRegression(lo_reg_config) 12 | X_train = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) 13 | Y_train = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]) 14 | 15 | 16 | class TestPublicFunctions(unittest.TestCase): 17 | # checks if calculations are correct 18 | def test_asset_sigmoid_calc(self): 19 | cal = lr.sigmoid(2) 20 | self.assertAlmostEqual(cal, 0.8807970779778823) 21 | 22 | 23 | class TestLoRegInit(unittest.TestCase): 24 | # checks if default init works correctly 25 | def test_assert_default_init(self): 26 | de_lr = lr.LogisticRegression() 27 | self.assertTrue(de_lr.epochs == 50 and de_lr.learning_rate == 28 | 0.01) 29 | 30 | # checks if nondefault init works 31 | def test_assert_non_default_init(self): 32 | self.assertTrue(_lr.epochs == 400 and _lr.learning_rate == 33 | 0.02) 34 | 35 | 36 | class TestLoRegTraining(unittest.TestCase): 37 | # checks if training input is correct 38 | def test_asset_training_input(self): 39 | _lr.train(X_train, Y_train) 40 | 41 | 42 | class TestlogTrainTesting(unittest.TestCase): 43 | # checks if models accuracy makes sense 44 | def test_asset_testaccuracy(self): 45 | _lr.train(X_train, Y_train) 46 | prediction1 = _lr.test(0.1) 47 | 48 | self.assertTrue(prediction1 == 1) # and prediction2 == 1) 49 | 50 | 51 | if __name__ == "__main__": 52 | unittest.main() 53 | --------------------------------------------------------------------------------