├── .travis.yml ├── README.md ├── brnn_model.png ├── markov_chain.py ├── naive_bayes.py ├── others ├── brnn_sequence_analyzer.py ├── brnn_sequence_analyzer_gen.py ├── rnn_sequence_analyzer_gen.py ├── sequence_analyzer.py └── sequence_analyzer_gen.py ├── requirements.txt ├── rnn_model.png └── rnn_sequence_analyzer.py /.travis.yml: -------------------------------------------------------------------------------- 1 | before_install: 2 | - sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran 3 | language: python 4 | python: 5 | - "2.7" 6 | # command to install dependencies 7 | install: "pip install -r requirements.txt" 8 | # command to run tests 9 | script: nosetests 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # sequence-rnn-py 2 | 3 | [![Build Status](https://travis-ci.org/fluency03/sequence-rnn-py.svg?branch=master)](https://travis-ci.org/fluency03/sequence-rnn-py) 4 | 5 | This program analyze the sequence using (Uni-directional and Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) based on the python library Keras ([Documents](http://keras.io/) and [Github](https://github.com/fchollet/keras)). 6 | It is based on this [lstm_text_generation.py](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py) and this [imdb_bidirectional_lstm.py]( https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py) examples of Keras. 7 | 8 | 9 | *This is part of my master thesis project and still in development.* 10 | 11 | ## Requirements 12 | 13 | - [Python 2.7](https://www.python.org/downloads/) 14 | - [NumPy](http://www.numpy.org/): The fundamental package needed for scientific computing with Python. 15 | - [SciPy](http://scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering. 16 | - [Theano](http://deeplearning.net/software/theano/): A Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. 17 | - [Tensorflow](https://www.tensorflow.org/): An open source software library for numerical computation using data flow graphs. 18 | - [Keras>=1.0](http://keras.io/): A minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. Update the Keras: 19 | 20 | `pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps` . 21 | 22 | - **GPU Support** (optional but highly recommended). Instructions of enabling GPU are here: [for Theano](http://deeplearning.net/software/theano/install.html#using-the-gpu) and [for TensorFlow](https://www.tensorflow.org/versions/r0.7/get_started/os_setup.html#optional-linux-enable-gpu-support). 23 | - [pydot](https://github.com/erocarrera/pydot) and [graphviz](http://www.graphviz.org/) (optional, if you want to plot the model) 24 | - [HDF5](https://www.hdfgroup.org/HDF5/) and [h5py](http://www.h5py.org/) (optional, if you use model saving/loading functions) 25 | 26 | 27 | ## Materials 28 | 29 | A serias of Recurrent Neural Networks Tutorial: 30 | 31 | 1. [Part 1 - Introduction to RNNs](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) 32 | 2. [Part 2 - Implementing a RNN with Python, Numpy and Theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/) 33 | 3. [Part 3 - Backpropagation Through Time and Vanishing Gradients](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/) 34 | 4. [Part 4 - Implementing a GRU/LSTM RNN with Python and Theano](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/) 35 | 36 | Two great materials about LSTM: [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of [Christopher Olah](http://colah.github.io/) and [Understanding LSTM and its diagrams](https://medium.com/@shiyan/understanding-lstm-and-its-diagrams-37e2f46f1714#.5hkwmotmr) of [Shi Yan](https://medium.com/@shiyan) 37 | 38 | The best post of [Andrej Karpathy blog](http://karpathy.github.io/) regarding sequence prediction using RNN: [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) 39 | 40 | One deeper material about RNN: [Chapter 10 - Sequence Modeling: Recurrentand Recursive Nets](http://www.deeplearningbook.org/contents/rnn.html) of this book [MIT Deep Learning](http://www.deeplearningbook.org/). 41 | 42 | 43 | ## Model 44 | 45 | 46 | - Two layers of LSTMs Uni-directional RNN model: 47 | 48 | ![ RNN LSTM ](https://github.com/fluency03/sequence-rnn-py/blob/master/rnn_model.png "RNN LSTM") 49 | 50 | 51 | - One layer of LSTM Bi-Directional RNN model: 52 | 53 | ![ BRNN LSTM ](https://github.com/fluency03/sequence-rnn-py/blob/master/brnn_model.png "BRNN LSTM") 54 | 55 | 56 | - Naive Bayes model: 57 | 58 | [naive_bayes.py](https://github.com/fluency03/sequence-rnn-py/blob/master/naive_bayes.py) is a simple Naive Bayes model used for comparison. 59 | 60 | 61 | ## Data 62 | 63 | - Training Set 64 | 65 | - Validation Set 66 | 67 | - Test Set 68 | 69 | 70 | 71 | ## Training 72 | 73 | This [hyperas](https://github.com/maxpumperla/hyperas) may help. It is *A very simple convenience wrapper around [hyperopt](https://github.com/hyperopt/hyperopt) for fast prototyping with keras models.* It is used for hyper-parameter optimization. An example can be found [here](https://github.com/maxpumperla/hyperas/blob/master/examples/lstm.py). 74 | 75 | Two good materials: 76 | 77 | - [CHAPTER 3: Improving the way neural networks learn](http://neuralnetworksanddeeplearning.com/chap3.html) from [Michael Nielsen](http://michaelnielsen.org/) 78 | - [Neural Networks Part 2: Setting up the Data and the Loss](http://cs231n.github.io/neural-networks-2/) and [Neural Networks Part 3: Learning and Evaluation](http://cs231n.github.io/neural-networks-3/) from Stanford CS class [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.github.io/) 79 | 80 | Considerations: 81 | 82 | - **Batch Size**: how many streams of data are processed in parallel at one time. 83 | 84 | 85 | - **Samples per epoch** and **Batches per epoch**: how many samples or batches considered per epoch. Based on some of my experiments: (i) the more #samples there are, the higher the accuracy can reach at the stable stage and the less the loss can be at the stable stage; (ii) the more #batches (integer ratio of #sample/batch_size) there are, the higher the accuracy can reach at table stage and the less the loss can be at stable stage and the less iterations it will take to reach the same loss/accuracy value. 86 | 87 | 88 | - **Sentence Length**: according to [char-rnn](https://github.com/karpathy/char-rnn): 89 | > The length of each data stream, which is also the limit at which the gradients can propagate backwards in time. For example, if seq_length is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not find dependencies longer than this length in number of characters. 90 | 91 | This is actually the limitation of the model's long term memory. 92 | 93 | > Thus, if you have a very difficult dataset where there are a lot of long-term dependencies, you will want to increase this setting. 94 | 95 | 96 | - **Offset during sampling**: offset is the start index when sampling the X_train and y_train from original sequence. The offset can be fixed value or random value ranging between 0 ~ step-1. 97 | 98 | 99 | - **Data size vs. #parameters** in total: 100 | - #layers: the number of layers, [here](https://github.com/karpathy/char-rnn) suggests that always use num_layers of either 2 or 3. 101 | - layer size: the number of units per layer. 102 | 103 | Acoording to [char-rnn](https://github.com/karpathy/char-rnn), the two important quantities to keep track of here are: 104 | - The total number of parameters in your model. 105 | - The size of your dataset. 106 | These two should be about the same order of magnitude. 107 | 108 | **How to calculate the number of parameters in RNN?** For example, consider one layer of LSTM: 109 | - if it has the layer size of `H=512`; 110 | - if we have the vocabulary size as `C=3000` (the number of unique classes); 111 | - the LSTM layer will have three parameter matrix - `U` with dimension `(H, C)=(512, 3000)`, `V` with dimension `(C, H)=(3000, 512)`, `W` with dimension `(H, H)=(512, 512)`; 112 | - the total number of parameter for one layer will be: `2HC + H^2`, which is **3,334,144** in this case. 113 | - That is 3 million parameters for only one layer! 114 | 115 | 116 | - **Learning Rate**: This ratio (percentage) influences the speed (step of the gradient descent) and quality of learning. The greater the ratio, the faster the neuron trains; the lower the ratio, the more accurate the training is. According to [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069v1.pdf) [\[1\]](https://github.com/fluency03/sequence-rnn-py#1-greff-klaus-rupesh-kumar-srivastava-jan-koutník-bas-r-steunebrink-and-jürgen-schmidhuber-lstm-a-search-space-odyssey-arxiv-preprint-arxiv150304069-2015): 117 | > The learning rate is by far the most important hyperparameter. And based on their suggestion, while searching for a good learning rate for the LSTM, it is sufficient to do a coarse search by starting with a high value (e.g. 1.0) and dividing it by ten until performance stops increasing. 118 | 119 | 120 | - **[Dropout](http://keras.io/layers/core/#dropout)**: an float between 0 and 1, indicating how much percentage of the hidden layer data are ignored when feeding to next layer. It is a powerful regularization method and mainly used for avoiding overfitting. If your model is overfitting, it better to increase the value of dropout. 121 | 122 | 123 | - **Reinforcement learning function**: The *temperature* parameter is dividing the predicted log probabilities before the *[Softmax](https://en.wikipedia.org/wiki/Softmax_function)*, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes. 124 | 125 | 126 | - **Loss function**: [categorical_crossentropy](http://keras.io/objectives/) 127 | 128 | 129 | - **Optimizer**: [RMSprop](http://keras.io/optimizers/#rmsprop), you can try other options like simple [SGD](http://keras.io/optimizers/#sgd), [Adagrad](http://keras.io/optimizers/#adagrad) and [Adam](http://keras.io/optimizers/#adam). 130 | 131 | 132 | ## Reference 133 | 134 | ###### [1] Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. "*[LSTM: A search space odyssey.](http://arxiv.org/pdf/1503.04069v1.pdf)*" arXiv preprint arXiv:1503.04069 (2015). 135 | -------------------------------------------------------------------------------- /brnn_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fluency03/sequence-rnn-py/0a55a8fcc93644bca216afc660564d3a606886ab/brnn_model.png -------------------------------------------------------------------------------- /markov_chain.py: -------------------------------------------------------------------------------- 1 | """ 2 | Markov Chain, a comparable model to RNN as a baseline. 3 | 4 | The LabeledMarkovPredictor is riginally written by Erik Ylipaa at SICS. 5 | 6 | Author: Chang Liu (fluency03) 7 | Data: 2016-04-15 8 | """ 9 | 10 | # import unittest 11 | # from collections import Counter, defaultdict 12 | 13 | import glob 14 | import numpy as np 15 | from hmmlearn.hmm import MultinomialHMM 16 | 17 | 18 | class LabeledMarkovPredictor(object): 19 | """ 20 | Model which builds a first order markov model of the labeled data. 21 | """ 22 | def __init__(self, num_classes, # pylint: disable=W0613 23 | eval_during_training=False, **kwargs): 24 | """ 25 | Create a new LabeledMarkov predictor. 26 | 27 | Arguments: 28 | num_classes: {integer}, the number of class labels in the data. 29 | eval_during_training: {bool}, if True, loss will be calculated 30 | during training. For the Markov chain this means very little. 31 | Disabling this speeds up training. 32 | kwargs: 33 | """ 34 | self.num_classes = num_classes 35 | self.eval_during_training = eval_during_training 36 | self.dirty_counts = True # 37 | self.setup_params() 38 | 39 | def setup_params(self): 40 | """ 41 | Set up other perematers. The model is basically just a matrix. Each row 42 | of the matrix is the conditional probabilty for the next symbol in the 43 | sequence, given the current symbol. We set all entries to 1, giving us 44 | a uniform distribution as a prior. 45 | """ 46 | # the matrix initialized with all ones 47 | self.W = np.ones((self.num_classes, self.num_classes), np.uint64) 48 | 49 | # Give the class count two dimensions, but put the second to 1, so it's 50 | # broadcastable over W when we wish to divide. Set to to the number of 51 | # classes, so that it will give us the uniform distribution as a prior 52 | self.class_counts = np.full(self.num_classes, 53 | self.num_classes, 54 | dtype=np.uint64) 55 | 56 | self.log_class_counts = np.log(np.full(self.num_classes, 57 | self.num_classes, 58 | dtype=np.uint64)) 59 | self.dirty_counts = False 60 | 61 | def train(self, training_arguments, *args, **kwargs): # pylint: disable=W0613 62 | """ 63 | Updates the model based on the input batch. The input should be a tuple 64 | of two ndarray training-batches. 65 | 66 | Arguments: 67 | training_arguments: {tuple}, should be a tuple of x- and y-batches. 68 | (x_batch, y_bath). The batches should be ndarray matrices of 69 | integer labels. The first dimension is the time dimension, the 70 | second the batch dimension. The shape is considered to have the 71 | semantics: (sequence_length, batch_size). 72 | args: 73 | kwargs: 74 | Returns: {tuple}, (training_loss, info_dict). The training loss will 75 | be the average negative log of the probability of the y_batch before 76 | training on the x_batch. The info_dict is an empty dictionary for 77 | this model. If eval_during_training was set to False when the model 78 | was instantiated, None is returned instead of the loss. 79 | """ 80 | # We disregard any arguments except the training arguments tuple 81 | try: 82 | x_batch, y_batch, mask = training_arguments # pylint: disable=W0612 83 | except ValueError: 84 | x_batch, y_batch = training_arguments 85 | mask = None 86 | 87 | sequence_length, batch_size = x_batch.shape 88 | 89 | # We go over each timestep and increase all the columns denoted by the 90 | # y's for the rows denoted by the x's 91 | for t in range(sequence_length): 92 | for batch_num in range(batch_size): 93 | x = x_batch[t, batch_num] 94 | y = y_batch[t, batch_num] 95 | self.W[x, y] += 1 96 | self.class_counts[x] += 1 97 | self.dirty_counts = True 98 | 99 | info_dict = dict() 100 | if self.eval_during_training: 101 | loss = self.evaluate(training_arguments) 102 | else: 103 | loss = None 104 | return loss, info_dict 105 | 106 | def evaluate(self, training_argument): 107 | """ 108 | Get the average negative log probability for the y_batch, using the 109 | model predicted probabilities from the x_batch. 110 | 111 | Arguments: 112 | training_argument: {tuple}, a pair of ndarrays (x_batch, y_batch). 113 | The batches should be matrices of integers of the same shape, 114 | where the first dimension is time, the second is over batches. 115 | Returns: {float}, The average negative log probability the model 116 | assigned the correct answers of the y_batch given the x_batch. 117 | """ 118 | # We disregard any arguments except the training arguments tuple 119 | try: 120 | x_batch, y_batch, mask = training_argument # pylint: disable=W0612 121 | except ValueError: 122 | x_batch, y_batch = training_argument 123 | mask = None 124 | 125 | x_batch = x_batch.astype(np.int) 126 | y_batch = y_batch.astype(np.int) 127 | sequence_length, batch_size = x_batch.shape 128 | 129 | # np.seterr(divide='ignore'). We ignore division by zero, since we will 130 | # be performing many of them. We will return the negative log likelihood 131 | # per sequence. This will be the logarithm of normalized value for each 132 | # of the entries in the matrix. The matrix needs to be normalized by row 133 | # P = np.divide(self.W, self.class_counts) 134 | # P[np.where(np.isnan(P))] = 1/self.num_classes 135 | # Any rows with NaN, we replace with a uniform score 136 | flat_x = x_batch.flatten() 137 | flat_y = y_batch.flatten() 138 | if self.dirty_counts: 139 | self.log_class_counts = np.log(self.class_counts) 140 | self.dirty_counts = False 141 | 142 | # We should take the negative log of the probabilities, this is the same 143 | # as taking the log of the W[x,y]/count[x], which is the same as 144 | # log(W[x,y]) - log(count[x]) 145 | # probs = self.W[flat_x, flat_y] / self.class_counts[flat_x] 146 | # log_probs = np.log(probs) 147 | log_probs = (np.log(self.W[flat_x, flat_y]) - 148 | self.log_class_counts[flat_x]) 149 | loss = - float(np.sum(log_probs)) 150 | # for consistency, divide the negative log loss with the batch size and 151 | # sequence length returning the same loss as the RNN models 152 | sequence_loss = loss / (batch_size * sequence_length) 153 | return sequence_loss 154 | 155 | def predict(self, x_batch): 156 | """ 157 | Arguments: 158 | x_batch: {np.array}, An ndarray of integer labels. 159 | Returns: {integer}. The predicted label the same shape as x_batch. 160 | """ 161 | x_batch = x_batch.astype(np.int) 162 | # for each entry in x_batch, it will pick out a row for W. 163 | label_counts = self.W[x_batch] 164 | # along each row picked by the x_batch 165 | # return the index of the highest count 166 | return np.argmax(label_counts, axis=-1) 167 | 168 | 169 | def transpose(theList): 170 | """ 171 | Transpose matrix for Markov Chain model. 172 | 173 | Arguments: 174 | theList: {list}, the input list. 175 | Returns: {np.array}, the transposed np.array. 176 | """ 177 | return np.asarray(theList).transpose() 178 | 179 | 180 | def get_sequence(filepath): 181 | """ 182 | Get the original sequence from file. 183 | 184 | Arguments: 185 | filename: {string}, the name/path of input log sequence file. 186 | Returns: 187 | {list}, the log sequence. 188 | {integer}, the size of vocabulary. 189 | """ 190 | # read file and convert ids of each line into array of numbers 191 | seqfiles = glob.glob(filepath) 192 | sequence = [] 193 | 194 | for seqfile in seqfiles: 195 | with open(seqfile, 'r') as f: 196 | one_sequence = [int(id_) for id_ in f] 197 | print " %s, sequence length: %d" %(seqfile, 198 | len(one_sequence)) 199 | sequence.extend(one_sequence) 200 | 201 | # add two extra positions for 'unknown-log' and 'no-log' 202 | vocab_size = max(sequence) + 2 203 | 204 | return sequence, vocab_size 205 | 206 | 207 | def get_data(sequence, sentence_length=40, random_offset=False): 208 | """ 209 | Retrieves data from a plain txt file and formats it using one-hot vector. 210 | 211 | Arguments: 212 | sequence: {lsit}, the original input sequence 213 | sentence_length: {integer}, the length of each training sentence. 214 | random_offset: {bool}, the offset is random between step or is 0. 215 | Returns: 216 | {list}, training input data X 217 | {list}, training target data y 218 | """ 219 | X_sentences = [] 220 | y_sentences = [] 221 | 222 | offset = np.random.randint(0, sentence_length) if random_offset else 0 223 | 224 | # creat batch data and next sentences 225 | for i in range(offset, len(sequence) - sentence_length, sentence_length): 226 | X_sentences.append(sequence[i : i + sentence_length]) 227 | y_sentences.append(sequence[i + 1 : i + sentence_length + 1]) 228 | 229 | return X_sentences, y_sentences 230 | 231 | 232 | def train(sentence_length=40): 233 | """ 234 | Train the markov chain. 235 | 236 | Arguments: 237 | sentence_length: {integer}, length of one sentence in the data set. 238 | """ 239 | # get parameters and dimensions of the model 240 | print "Loading training data..." 241 | train_sequence, input_len1 = get_sequence("./train_data/*") 242 | print "Loading validation data..." 243 | val_sequence, input_len2 = get_sequence("./validation_data/*") 244 | nb_classes = max(input_len1, input_len2) 245 | 246 | print "Training sequence length: %d" %len(train_sequence) 247 | print "Validation sequence length: %d" %len(val_sequence) 248 | print "#classes: %d\n" %nb_classes 249 | 250 | X_train, y_train = get_data(train_sequence, 251 | sentence_length=sentence_length, 252 | random_offset=False) 253 | X_val, y_val = get_data(val_sequence, 254 | sentence_length=sentence_length, 255 | random_offset=False) 256 | 257 | print "Build Markov Chain..." 258 | model = LabeledMarkovPredictor(nb_classes) 259 | 260 | print "Train the model..." 261 | model.train((transpose(X_train), transpose(y_train))) 262 | 263 | print "Validating..." 264 | validation_loss = 0 265 | validation_loss = model.evaluate((transpose(X_val), transpose(y_val))) 266 | 267 | print "Validation loss: {}".format(validation_loss) 268 | 269 | 270 | # TODO: not working yet 271 | def train_hmm(): 272 | """ 273 | HMM for sequence learning. 274 | """ 275 | print "Loading training data..." 276 | train_sequence, num_classes = get_sequence("./train_data/*") 277 | 278 | print "Build HMM..." 279 | model = MultinomialHMM(n_components=2) 280 | 281 | print "Train HMM..." 282 | model.fit([train_sequence]) 283 | 284 | 285 | 286 | if __name__ == '__main__': 287 | train() 288 | # train_hmm() 289 | -------------------------------------------------------------------------------- /naive_bayes.py: -------------------------------------------------------------------------------- 1 | """ 2 | Simple Naive Bayes classifier implimentation for sequence prediction. 3 | 4 | Author: Chang Liu (fluency03) 5 | Data: 2016-05-12 6 | """ 7 | 8 | import cPickle as pickle 9 | import glob 10 | import os 11 | import time 12 | from math import log 13 | import numpy as np 14 | from rnn_sequence_analyzer import plot_hist, plot_and_write_prob 15 | 16 | 17 | class NaiveBayes(object): 18 | """ 19 | Simple Naive Bayes classifier implimentation for sequence prediction. 20 | """ 21 | def __init__(self, window_size, nb_classes, alpha=1.0): 22 | """ 23 | Initialization. Set up some parameters. Build up the matrix. 24 | 25 | Arguments: 26 | window_size: {integer}, the size of input window. 27 | nb_classes: {integer}, number of uniques classes. 28 | alpha: {float}, the smoothing priors alpha >= 0 accounts for 29 | features not present in the learning samples and prevents zero 30 | probabilities in further computations. Setting alpha = 1 is 31 | called Laplace smoothing, while alpha < 1 is called 32 | Lidstone smoothing. 33 | 34 | """ 35 | self.window_size = window_size 36 | self.nb_classes = nb_classes 37 | self.alpha = alpha 38 | self.build() 39 | 40 | def build(self): 41 | """ 42 | Build up the matrix. 43 | """ 44 | self.ny = np.zeros((self.nb_classes,), dtype=np.int) 45 | self.nx_y = np.zeros((self.window_size, 46 | self.nb_classes, 47 | self.nb_classes), dtype=np.int) 48 | 49 | def train(self, X, y): 50 | """ 51 | Train the model. 52 | 53 | Arguments: 54 | X: {array}, X training data. 55 | y: {array}, y training data. 56 | """ 57 | N = len(y) 58 | for i in xrange(N): 59 | self.ny[y[i]] += 1 60 | for j in xrange(self.window_size): 61 | self.nx_y[j, X[i, j], y[i]] += 1 62 | 63 | def save_model(self, filename): 64 | """ 65 | Save the model information to a file. 66 | """ 67 | print " |-Write the model into %s ..." %filename 68 | with open(filename, 'w') as pkl_file: 69 | pickle.dump({'ny': self.ny, 'nx_y': self.nx_y, 70 | 'window_size': self.window_size, 71 | 'nb_classes': self.nb_classes, 72 | 'alpha': self.alpha}, pkl_file) 73 | 74 | def load_model(self, filename): 75 | """ 76 | Load the model information from a file. 77 | """ 78 | if os.path.isfile(filename): 79 | print "%s existing, loading it...\n" %filename 80 | with open(filename) as pkl_file: 81 | model = pickle.load(pkl_file) 82 | self.ny = model['ny'] 83 | self.nx_y = model['nx_y'] 84 | # self.window_size = model['window_size'] 85 | # self.nb_classes = model['nb_classes'] 86 | # self.alpha = model['alpha'] 87 | else: 88 | print "File does not exist!" 89 | 90 | def evaluate(self, X, y, normalization=True, log_scale=False): 91 | """ 92 | Evaluate the model. 93 | 94 | Arguments: 95 | X: {array}, X evaluation data. 96 | y: {array}, y evaluation data. 97 | normalization: {bool}, whether do the normalization. 98 | log_scale: {bool}, whether transfer probabilities on log scale. 99 | """ 100 | def scale(p): 101 | """ 102 | Probability in log scale. 103 | """ 104 | return log(p) if log_scale else p 105 | 106 | def normalize(py_x): 107 | """ 108 | Normalize the probabilities. 109 | """ 110 | py_x_sum = np.sum(py_x) 111 | return np.asarray([py_x[p] / py_x_sum 112 | for p in xrange(self.nb_classes)]) 113 | 114 | N = np.sum(self.ny) 115 | length = len(y) 116 | print "length: %d " %length 117 | correct = 0 118 | 119 | probs = np.zeros(length) 120 | if not log_scale: 121 | probs[:self.window_size] = 1.0 122 | 123 | # ------------------- Prior ------------------- # 124 | py = np.zeros(self.nb_classes) 125 | for i in xrange(self.nb_classes): 126 | py[i] = ((self.ny[i] + self.alpha) / 127 | (N + self.alpha * self.nb_classes)) 128 | 129 | for i in xrange(length): 130 | print "evaluating %d ..." %i 131 | # ------------------- Likelihood ------------------- # 132 | px_y = np.zeros((self.nb_classes, self.window_size)) 133 | for p in xrange(self.nb_classes): 134 | for k in xrange(self.window_size): 135 | px_y[p, k] = ((self.nx_y[k, X[i, k], p] + 136 | self.alpha) / 137 | (self.ny[p] + 138 | self.alpha * self.nb_classes)) 139 | # ------------------- Posterior ------------------- # 140 | py_x = np.zeros(self.nb_classes) 141 | for j in xrange(self.nb_classes): 142 | py_x[j] = py[j] * np.prod(px_y[j]) 143 | 144 | # ------------------- Normalization ------------------- # 145 | if normalization: 146 | py_x = normalize(py_x) 147 | 148 | # ------------------- Prediction ------------------- # 149 | # check the prediction 150 | y_pred = np.argmax(py_x) 151 | y_true = y[i] 152 | 153 | max_prob = scale(py_x[y_pred]) 154 | print ("y_pred: %d , max_prod: %.8f, y_true_prob: %.8f ," 155 | %(y_pred, max_prob, scale(py_x[y_true]))) 156 | 157 | if y_true == y_pred: 158 | correct += 1 159 | 160 | probs[i + self.window_size] = max_prob 161 | 162 | accuracy = (correct * 100.0) / length 163 | print "Accuracy: %.4f%%" %accuracy 164 | 165 | print " |-Plot figures ..." 166 | plot_and_write_prob(probs, 167 | "nb_prob_", 168 | [0, 50000, 0, 1], 169 | 'Log' if log_scale else 'Normal') 170 | 171 | def evaluate_all(self, X, y, nb_options=3, normalization=True): # pylint: disable=R0912 172 | """ 173 | Evaluate the model. 174 | 175 | Arguments: 176 | X: {array}, X evaluation data. 177 | y: {array}, y evaluation data. 178 | nb_options: {interger}, number of predicted options. 179 | normalization: {bool}, whether do the normalization. 180 | """ 181 | N = np.sum(self.ny) 182 | length = len(y) 183 | print "length: %d " %length 184 | 185 | probs = np.zeros((nb_options+1, length + self.window_size)) 186 | for o in xrange(nb_options+1): 187 | probs[o][:self.window_size] = 1.0 188 | 189 | # probability in negative log scale 190 | log_probs = np.zeros((nb_options+1, length + self.window_size)) 191 | 192 | # count the number of correct predictions 193 | nb_correct = [0] * (nb_options+1) 194 | 195 | # ------------------- Prior ------------------- # 196 | py = np.zeros(self.nb_classes) 197 | for i in xrange(self.nb_classes): 198 | py[i] = ((self.ny[i] + self.alpha) / 199 | (N + self.alpha * self.nb_classes)) 200 | 201 | try: 202 | for i in xrange(length): 203 | print "evaluating %d ..." %i 204 | # ------------------- Likelihood ------------------- # 205 | px_y = np.zeros((self.nb_classes, self.window_size)) 206 | for p in xrange(self.nb_classes): 207 | for k in xrange(self.window_size): 208 | px_y[p, k] = ((self.nx_y[k, X[i, k], p] + 209 | self.alpha) / 210 | (self.ny[p] + 211 | self.alpha * self.nb_classes)) 212 | # ------------------- Posterior ------------------- # 213 | py_x = np.zeros(self.nb_classes) 214 | for j in xrange(self.nb_classes): 215 | py_x[j] = py[j] * np.prod(px_y[j]) 216 | 217 | # ------------------- Normalization ------------------- # 218 | if normalization: 219 | py_x_sum = np.sum(py_x) 220 | py_x = np.asarray([py_x[p] / py_x_sum 221 | for p in xrange(self.nb_classes)]) 222 | 223 | # ------------------- Prediction ------------------- # 224 | # check the prediction 225 | y_pred = np.argsort(py_x)[-nb_options:][::-1] 226 | y_true = y[i] 227 | print y_pred, y_true 228 | 229 | next_probs = [0.0] * (nb_options+1) 230 | next_probs[0] = py_x[y_true] 231 | 232 | for o in xrange(nb_options): 233 | if y_true == y_pred[o]: 234 | next_probs[o+1] = 1.0 235 | nb_correct[o+1] += 1 236 | 237 | next_probs = np.maximum.accumulate(next_probs) 238 | print next_probs 239 | 240 | for k in xrange(nb_options+1): 241 | probs[k, i + self.window_size] = next_probs[k] 242 | # get the negative log probability 243 | log_probs[k, i + self.window_size] = -log(next_probs[k]) 244 | 245 | except: 246 | print "KeyboardInterrupt" 247 | 248 | nb_correct = np.add.accumulate(nb_correct) 249 | for n in xrange(nb_options+1): 250 | print "Accuracy %d: %.4f%%" %(n, (nb_correct[n] * 100.0 / (i + 1))) # pylint: disable=W0631 251 | 252 | print " |-Plot figures ..." 253 | for q in xrange(nb_options+1): 254 | plot_and_write_prob(probs[q], 255 | "nb_prob_"+str(q), 256 | [0, 50000, 0, 1], 257 | 'Normal') 258 | plot_and_write_prob(log_probs[q], 259 | "nb_log_prob_"+str(q), 260 | [0, 50000, 0, 25], 261 | 'Log') 262 | def predict(self, X): 263 | """ 264 | Predict next sequence. 265 | """ 266 | pass 267 | 268 | 269 | 270 | def get_sequence(filepath): 271 | """ 272 | Get the original sequence from file. 273 | 274 | Arguments: 275 | filename: {string}, the name/path of input log sequence file. 276 | Returns: 277 | {list}, the log sequence. 278 | {integer}, the size of vocabulary. 279 | {integer}, total length of the sequences. 280 | """ 281 | # read file and convert ids of each line into array of numbers 282 | seqfiles = glob.glob(filepath) 283 | sequences = [] 284 | total_length = 0 285 | max_value = 0 286 | 287 | for seqfile in seqfiles: 288 | sequence = [] 289 | with open(seqfile, 'r') as f: 290 | one_sequence = [int(id_) for id_ in f] 291 | print " %s, sequence length: %d" %(seqfile, 292 | len(one_sequence)) 293 | sequence.extend(one_sequence) 294 | total_length += len(one_sequence) 295 | max_new = np.amax(sequence) 296 | max_value = max_new if max_new > max_value else max_value 297 | sequences.append(sequence) 298 | 299 | # add two extra positions for 'unknown-log' and 'no-log' 300 | vocab_size = max_value + 2 301 | 302 | return sequences, vocab_size, total_length 303 | 304 | 305 | def get_data(sequence, sentence_length=40, step=3, random_offset=True): 306 | """ 307 | Retrieves data from a plain txt file and formats it using one-hot vector. 308 | 309 | Arguments: 310 | sequence: {lsit}, the original input sequence 311 | vocab_size: {integer}, the number of unique id classes 312 | sentence_length: {integer}, the length of each training sentence. 313 | step: {integer}, the sample steps. 314 | random_offset: {bool}, the offset is random between step or is 0. 315 | Returns: 316 | {np.array}, training input data X 317 | {np.array}, training target data y 318 | """ 319 | X_sentences = [] 320 | next_ids = [] 321 | 322 | offset = np.random.randint(0, step) if random_offset else 0 323 | 324 | # creat batch data and next sentences 325 | for i in range(offset, len(sequence) - sentence_length, step): 326 | X_sentences.append(sequence[i : i + sentence_length]) 327 | next_ids.append(sequence[i + sentence_length]) 328 | 329 | # number of sampes 330 | # nb_samples = len(X_sentences) 331 | # print "total # of sentences: %d" %nb_samples 332 | 333 | return np.asarray(X_sentences), np.asarray(next_ids) 334 | 335 | 336 | def main(sentence_length=3, mode='train'): 337 | """ 338 | Train the model. 339 | 340 | Arguments: 341 | sentence_length: {integer}, the length of each training sentence. 342 | """ 343 | # get parameters and dimensions of the model 344 | print "Loading training data..." 345 | train_sequence, input_len1, total_length1 = get_sequence("./train_data/*") 346 | 347 | print "Loading validation data..." 348 | val_sequence, input_len2, total_length2 = get_sequence("./validation_data/*") 349 | 350 | input_len = max(input_len1, input_len2) 351 | 352 | print "Training sequence length: %d" %total_length1 353 | print "Validation sequence length: %d" %total_length2 354 | print "#classes: %d\n" %input_len 355 | 356 | start_time = time.time() 357 | 358 | nb = NaiveBayes(window_size=sentence_length, 359 | nb_classes=input_len, 360 | alpha=1.0/input_len) 361 | 362 | if mode == 'train': 363 | print "Train the model...\n" 364 | for sequence in train_sequence: 365 | X_train, y_train = get_data(sequence, sentence_length=sentence_length, 366 | step=1, random_offset=False) 367 | nb.train(X_train, y_train) 368 | # nb.save_model('2.pkl') 369 | elif mode == 'load': 370 | nb.load_model('2.pkl') 371 | 372 | print "Evaluate the model...\n" 373 | # for sequence in val_sequence: 374 | # X_val, y_val = get_data(sequence, sentence_length=sentence_length, 375 | # step=1, random_offset=False) 376 | # nb.evaluate(X_val, y_val, normalization=True, log_scale=False) 377 | 378 | for sequence in val_sequence: 379 | X_val, y_val = get_data(sequence, sentence_length=sentence_length, 380 | step=1, random_offset=False) 381 | nb.evaluate_all(X_val, y_val, nb_options=3, normalization=True) 382 | 383 | stop_time = time.time() 384 | print "Stop...\n" 385 | print "--- %s seconds ---\n" % (stop_time - start_time) 386 | 387 | if __name__ == '__main__': 388 | main() 389 | -------------------------------------------------------------------------------- /others/brnn_sequence_analyzer.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program analyze the integer sequence using Bi-diractional Recurrent Neural 3 | Network (BRNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) 4 | based on the python library Keras. 5 | 6 | "Keras is a minimalist, highly modular neural networks library, written in 7 | Python and capable of running on top of either TensorFlow or Theano." 8 | ---- Keras (http://keras.io/) 9 | 10 | It is based on this Keras example - imdb_bidirectional_lstm.py: 11 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py 12 | 13 | Author: Chang Liu (fluency03) 14 | Data: 2016-03-26 15 | """ 16 | 17 | import glob 18 | # import os 19 | import sys 20 | import csv 21 | import time 22 | import matplotlib.pyplot as plt 23 | import numpy as np 24 | 25 | from keras.callbacks import Callback, ModelCheckpoint 26 | from keras.layers import Input, Dense, Dropout, LSTM, GRU, merge 27 | from keras.layers.wrappers import TimeDistributed 28 | from keras.models import Model 29 | from keras.optimizers import RMSprop # pylint: disable=W0611 30 | from keras.utils.visualize_util import plot 31 | 32 | 33 | # random number generator with a fixed value for reproducibility 34 | np.random.seed(1337) 35 | 36 | 37 | def override(f): 38 | """ 39 | Override decorator. 40 | """ 41 | return f 42 | 43 | 44 | class SequenceAnalyzer(object): 45 | """ 46 | Sequence analyzer based on RNN Graph model. 47 | """ 48 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 49 | self.sentence_length = sentence_length 50 | self.input_len = input_len 51 | self.hidden_len = hidden_len 52 | self.output_len = output_len 53 | self.model = None 54 | 55 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 56 | nb_layers=2, dropout=0.2): 57 | """ 58 | Bidirectional RNN with specified dropout rate (default 0.2), built with 59 | softmax activation, cross entropy loss and rmsprop optimizer. 60 | 61 | Arguments: 62 | layer: {string}, the type of the layers in the RNN Model. 63 | 'LSTM': LSTM layers 64 | 'GRU': GRU layers 65 | mapping: {string}, input to output mapping. 66 | 'o2o': one-to-one 67 | 'm2m': many-to-many 68 | learning_rate: {float}, learning rate. 69 | nb_layers: {integer}, number of layers in total. 70 | dropout: {float}, dropout value. 71 | """ 72 | print "Building Model..." 73 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f " 74 | "nb_layers = %d , dropout = %.2f" 75 | %(self.hidden_len, layer, mapping, learning_rate, 76 | nb_layers, dropout)) 77 | 78 | # check the layer type: LSTM or GRU 79 | if layer == 'LSTM': 80 | class LAYER(LSTM): 81 | """ 82 | LAYER as LSTM. 83 | """ 84 | pass 85 | elif layer == 'GRU': 86 | class LAYER(GRU): 87 | """ 88 | LAYER as GRU. 89 | """ 90 | pass 91 | 92 | # check whether return sequence for each of the layers 93 | return_sequences = [] 94 | if mapping == 'o2o': 95 | # if mapping is one-to-one 96 | for nl in range(nb_layers): 97 | if nl == nb_layers-1: 98 | return_sequences.append(False) 99 | else: 100 | return_sequences.append(True) 101 | elif mapping == 'm2m': 102 | # if mapping is many-to-many 103 | for _ in range(nb_layers): 104 | return_sequences.append(True) 105 | 106 | # add input 107 | input_layer = Input(shape=(self.sentence_length, self.input_len), 108 | dtype='float32') 109 | 110 | # first Bi-directional LSTM layer 111 | forward1 = LAYER(self.hidden_len, 112 | return_sequences=return_sequences[0])(input_layer) 113 | forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612 114 | backward1 = LAYER(self.hidden_len, 115 | return_sequences=return_sequences[0], 116 | go_backwards=True)(input_layer) 117 | backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612 118 | 119 | # following Bi-directional layers 120 | for nl in range(nb_layers-1): 121 | exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)" 122 | %('forward' + str(nl+2), 123 | return_sequences[nl+1], 124 | 'forward_dropout' + str(nl+1))) 125 | exec("%s = Dropout(dropout)(%s)" 126 | %('forward_dropout' + str(nl+2), 127 | 'forward' + str(nl+2))) 128 | exec(("%s = LAYER(self.hidden_len, return_sequences=%s, " 129 | "go_backwards=True)(%s)") 130 | %('backward' + str(nl+2), 131 | return_sequences[nl+1], 132 | 'backward_dropout' + str(nl+1))) 133 | exec("%s = Dropout(dropout)(%s)" 134 | %('backward_dropout' + str(nl+2), 135 | 'backward' + str(nl+2))) 136 | 137 | merged_layer = merge([locals()['forward_dropout' + str(nb_layers)], 138 | locals()['backward_dropout' + str(nb_layers)]], 139 | mode='concat', concat_axis=-1) 140 | 141 | if mapping == 'o2o': 142 | output_layer = Dense(self.output_len, 143 | activation='softmax')(merged_layer) 144 | elif mapping == 'm2m': 145 | output_layer = TimeDistributed( 146 | Dense(self.output_len, activation='softmax'))(merged_layer) 147 | 148 | # add ouput 149 | self.model = Model(input=input_layer, output=output_layer) 150 | 151 | rms = RMSprop(lr=learning_rate) 152 | # try using different optimizers and different optimizer configs 153 | self.model.compile(loss='categorical_crossentropy', 154 | optimizer=rms, 155 | metrics=['accuracy']) 156 | 157 | def save_model(self, filename, overwrite=False): 158 | """ 159 | Save the model weight into a hdf5 file. 160 | 161 | Arguments: 162 | filename: {string}, the name/path to the file 163 | to which the weights are going to be saved. 164 | overwrite: {bool}, overwrite existing file. 165 | """ 166 | print "Save Weights %s ..." %filename 167 | self.model.save_weights(filename, overwrite=overwrite) 168 | 169 | def load_model(self, filename): 170 | """ 171 | Load the model weight into a hdf5 file. 172 | 173 | Arguments: 174 | filename: {string}, the name/path to the file 175 | to which the weights are going to be loaded. 176 | """ 177 | print "Load Weights %s ..." %filename 178 | self.model.load_weights(filename) 179 | 180 | def plot_model(self, filename='brnn_model.png'): 181 | """ 182 | Plot model. 183 | 184 | Arguments: 185 | filename: {string}, the name/path to the file 186 | to which the weights are going to be plotted. 187 | """ 188 | print "Plot Model %s ..." %filename 189 | plot(self.model, to_file=filename) 190 | 191 | 192 | class History(Callback): 193 | """ 194 | Record the loss and accuracy history. 195 | """ 196 | @override 197 | def on_train_begin(self, logs={}): # pylint: disable=W0102 198 | """ 199 | A method starting at the begining of the training. 200 | 201 | Arguments: 202 | logs: {dictionary}, recording the training and validation 203 | losses and accuracy of every epoch. 204 | """ 205 | # training loss and accuracy 206 | self.train_losses = [] 207 | self.train_acc = [] 208 | # validation loss and accuracy 209 | self.val_losses = [] 210 | self.val_acc = [] 211 | 212 | @override 213 | def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102 214 | """ 215 | A method starting at the begining of the training. 216 | 217 | Arguments: 218 | epoch: {integer}, the current epoch. 219 | logs: {dictionary}, recording the training and validation 220 | losses and accuracy of every epoch. 221 | """ 222 | # record training loss and accuracy 223 | self.train_losses.append(logs.get('loss')) 224 | self.train_acc.append(logs.get('acc')) 225 | # record validation loss and accuracy 226 | self.val_losses.append(logs.get('val_loss')) 227 | self.val_acc.append(logs.get('val_acc')) 228 | 229 | # continutously save the train_loss, train_acc, val_loss, val_acc 230 | # into a csv file with 4 columns respeactively 231 | csv_name = 'history.csv' 232 | with open(csv_name, 'a') as csvfile: 233 | his_writer = csv.writer(csvfile) 234 | print "\n Save loss and accuracy into %s" %csv_name 235 | his_writer.writerow((logs.get('loss'), logs.get('acc'), 236 | logs.get('val_loss'), logs.get('val_acc'))) 237 | 238 | 239 | def sample(prob, temperature=0.2): 240 | """ 241 | Softmax function for reinforcement learning. 242 | 243 | Arguments: 244 | prob: {list}, a list of probabilities of each of the classes. 245 | temperature: {float}, Softmax temperature. 246 | Returns: 247 | {integer}, the most possible sample. 248 | """ 249 | prob = np.log(prob) / temperature 250 | prob = np.exp(prob) / np.sum(np.exp(prob)) 251 | return np.argmax(np.random.multinomial(1, prob, 1)) 252 | 253 | 254 | def get_sequence(filepath): 255 | """ 256 | Get the original sequence from file. 257 | 258 | Arguments: 259 | filename: {string}, the name/path of input log sequence file. 260 | Returns: 261 | {list}, the log sequence. 262 | {integer}, the size of vocabulary. 263 | """ 264 | # read file and convert ids of each line into array of numbers 265 | seqfiles = glob.glob(filepath) 266 | sequence = [] 267 | 268 | for seqfile in seqfiles: 269 | with open(seqfile, 'r') as f: 270 | one_sequence = [int(id_) for id_ in f] 271 | print " %s, sequence length: %d" %(seqfile, 272 | len(one_sequence)) 273 | sequence.extend(one_sequence) 274 | 275 | # add two extra positions for 'unknown-log' and 'no-log' 276 | vocab_size = max(sequence) + 2 277 | 278 | return sequence, vocab_size 279 | 280 | 281 | def get_data(sequence, vocab_size, mapping='m2m', sentence_length=40, step=3, 282 | random_offset=True): 283 | """ 284 | Retrieves data from a plain txt file and formats it using one-hot vector. 285 | 286 | Arguments: 287 | sequence: {lsit}, the original input sequence 288 | vocab_size: {integer}, the number of unique id classes 289 | mapping: {string}, input to output mapping. 290 | 'o2o': one-to-one 291 | 'm2m': many-to-many 292 | sentence_length: {integer}, the length of each training sentence. 293 | step: {integer}, the sample steps. 294 | random_offset: {bool}, the offset is random between step or is 0. 295 | Returns: 296 | {np.array}, training input data X 297 | {np.array}, training target data y 298 | """ 299 | X_sentences = [] 300 | y_sentences = [] 301 | next_ids = [] 302 | 303 | offset = np.random.randint(0, step) if random_offset else 0 304 | 305 | # creat batch data and next sentences 306 | for i in range(offset, len(sequence) - sentence_length, step): 307 | X_sentences.append(sequence[i : i + sentence_length]) 308 | if mapping == 'o2o': 309 | # if mapping is one-to-one 310 | next_ids.append(sequence[i + sentence_length]) 311 | elif mapping == 'm2m': 312 | # if mapping is many-to-many 313 | y_sentences.append(sequence[i + 1 : i + sentence_length + 1]) 314 | 315 | # number of sampes 316 | nb_samples = len(X_sentences) 317 | # print "total # of sentences: %d" %nb_samples 318 | 319 | # one-hot vector (all zeros except for a single one at 320 | # the exact postion of this id number) 321 | X_train = np.zeros((nb_samples, sentence_length, vocab_size), dtype=np.bool) 322 | # expected outputs for each sentence 323 | if mapping == 'o2o': 324 | # if mapping is one-to-one 325 | y_train = np.zeros((nb_samples, vocab_size), dtype=np.bool) 326 | elif mapping == 'm2m': 327 | # if mapping is many-to-many 328 | y_train = np.zeros((nb_samples, sentence_length, vocab_size), 329 | dtype=np.bool) 330 | 331 | for i, x_sentence in enumerate(X_sentences): 332 | for t, id_ in enumerate(x_sentence): 333 | # mark the each corresponding character in a sentence as 1 334 | X_train[i, t, id_] = 1 335 | # if mapping is many-to-many 336 | if mapping == 'm2m': 337 | y_train[i, t, y_sentences[i][t]] = 1 338 | # if mapping is one-to-one 339 | # mark the corresponding character in expected output as 1 340 | if mapping == 'o2o': 341 | y_train[i, next_ids[i]] = 1 342 | 343 | return X_train, y_train 344 | 345 | 346 | def predict(sequence, input_len, analyzer, nb_predictions=80, 347 | mapping='m2m', sentence_length=40): 348 | """ 349 | Predict the next sequences using existing model and weights given some seed. 350 | 351 | Arguments: 352 | sequence: {lsit}, the original input sequence 353 | input_len: {integer}, the number of unique id classes 354 | analyzer: {SequenceAnalyzer}, the sequence analyzer 355 | nb_predictions: {integer}, number of predictions after giving the seed 356 | mapping: {string}, input to output mapping. 357 | 'o2o': one-to-one 358 | 'm2m': many-to-many 359 | sentence_length: {integer}, the length of each sentence. 360 | """ 361 | # generate elements 362 | for _ in range(nb_predictions): 363 | # start index of the seed, random number in range 364 | start_index = np.random.randint(0, len(sequence) - sentence_length - 1) 365 | # seed sentence 366 | sentence = sequence[start_index : start_index + sentence_length] 367 | 368 | # Y_true 369 | y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 370 | print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 371 | 372 | seed = np.zeros((1, sentence_length, input_len)) 373 | # format input 374 | for t in range(0, sentence_length): 375 | seed[0, t, sentence[t]] = 1 376 | 377 | # get predictions 378 | # verbose = 0, no logging 379 | predictions = analyzer.model.predict(seed, verbose=0)[0] 380 | 381 | # y_predicted 382 | if mapping == 'o2o': 383 | next_id = np.argmax(predictions) 384 | sys.stdout.write(' ' + str(next_id)) 385 | sys.stdout.flush() 386 | elif mapping == 'm2m': 387 | next_sentence = [] 388 | for pred in predictions: 389 | next_sentence.append(np.argmax(pred)) 390 | print "y_pred: " + ' '.join(str(id_).ljust(4) 391 | for id_ in next_sentence) 392 | # next_id = np.argmax(predictions[-1]) 393 | 394 | # y_true 395 | print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 396 | 397 | print "\n" 398 | 399 | 400 | def train(analyzer, train_sequence, val_sequence, input_len, 401 | batch_size=128, nb_epoch=50, nb_iterations=4, 402 | sentence_length=40, step=40, mapping='m2m'): 403 | """ 404 | Trains the network. 405 | 406 | Arguments: 407 | analyzer: {SequenceAnalyzer}. 408 | train_sequence: {list}, training sequence. 409 | val_sequence: {list}, validation sequence. 410 | input_len: {integer}, the number of classes, i.e., the input length of 411 | neural network. 412 | batch_size: {interger}, the number of sentences per batch. 413 | nb_epoch: {integer}, number of epoches per iteration. 414 | nb_iterations: {integer}, number of iterations. 415 | sentence_length: {integer}, the length of each training sentence. 416 | step: {integer}, the sample steps. 417 | mapping: {string}, input to output mapping. 418 | 'o2o': one-to-one 419 | 'm2m': many-to-many 420 | """ 421 | for iteration in range(1, nb_iterations+1): 422 | # create training data, randomize the offset between steps 423 | X_train, y_train = get_data(train_sequence, input_len, mapping=mapping, 424 | sentence_length=sentence_length, step=step, 425 | random_offset=False) 426 | X_val, y_val = get_data(val_sequence, input_len, mapping=mapping, 427 | sentence_length=sentence_length, step=step, 428 | random_offset=False) 429 | print "" 430 | print "------------------------ Start Training ------------------------" 431 | print "Iteration: ", iteration 432 | print "Number of epoch per iteration: ", nb_epoch 433 | 434 | # history of losses and accuracy 435 | history = History() 436 | 437 | # saves the model weights after each epoch 438 | # if the validation loss decreased 439 | checkpointer = ModelCheckpoint(filepath="weights.hdf5", 440 | verbose=1, save_best_only=True) 441 | 442 | # train the model 443 | analyzer.model.fit(X_train, y_train, 444 | batch_size=batch_size, nb_epoch=nb_epoch, verbose=1, 445 | callbacks=[history, checkpointer], 446 | validation_data=(X_val, y_val)) 447 | 448 | analyzer.save_model("weights-after-iteration.hdf5") 449 | 450 | 451 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40): 452 | """ 453 | Scan the given sequence for detecting anormalies. 454 | 455 | Arguments: 456 | sequence: {lsit}, the original input sequence 457 | input_len: {integer}, the number of unique id classes 458 | analyzer: {SequenceAnalyzer}, the sequence analyzer 459 | mapping: {string}, input to output mapping. 460 | 'o2o': one-to-one 461 | 'm2m': many-to-many 462 | sentence_length: {integer}, the length of each sentence. 463 | """ 464 | # sequence length 465 | length = len(sequence) 466 | 467 | # predicted probabilities for each id 468 | # we assume the first sentence_length ids are true 469 | prob = [1] * sentence_length + [0] * (length - sentence_length) 470 | 471 | start_time = time.time() 472 | try: 473 | # generate elements 474 | for start_index in xrange(length - sentence_length): 475 | # seed sentence 476 | X = sequence[start_index : start_index + sentence_length] 477 | # print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 478 | 479 | # Y_true 480 | # y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 481 | # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 482 | y_next_true = sequence[start_index + sentence_length] 483 | 484 | seed = np.zeros((1, sentence_length, input_len)) 485 | # format input 486 | for t in range(0, sentence_length): 487 | seed[0, t, X[t]] = 1 488 | 489 | # get predictionsverbose = 0, no logging 490 | predictions = analyzer.model.predict(seed, verbose=0)[0] 491 | 492 | # y_predicted 493 | y_next_pred = 0 494 | next_prob = 0 495 | if mapping == 'o2o': 496 | next_prob = predictions[y_next_true] 497 | prob[start_index + sentence_length] = next_prob 498 | y_next_pred = np.argmax(predictions) 499 | elif mapping == 'm2m': 500 | # next_sentence = [] 501 | # for pred in predictions: 502 | # next_sentence.append(np.argmax(pred)) 503 | # y_next_pred = next_sentence[-1] 504 | # print "y_pred: " + ' '.join(str(id_).ljust(4) 505 | # for id_ in next_sentence) 506 | y_next_pred = np.argmax(predictions[-1]) 507 | next_prob = predictions[-1][y_next_true] 508 | prob[start_index + sentence_length] = next_prob 509 | 510 | print start_index, next_prob 511 | except KeyboardInterrupt: 512 | # print " |-Write the clusters into %s ..." %self.cluster_file 513 | with open('prob.txt', 'w') as prob_file: 514 | for p in prob: 515 | prob_file.write(str(p) + '\n') 516 | 517 | plt.plot(prob, 'r*') 518 | plt.xlim(0, 1000) 519 | plt.ylim(0, 1) 520 | plt.savefig("prob.png") 521 | plt.clf() 522 | plt.cla() 523 | 524 | stop_time = time.time() 525 | print "--- %s seconds ---\n" % (stop_time - start_time) 526 | 527 | return prob 528 | 529 | 530 | def run(hidden_len=512, batch_size=128, nb_epoch=50, nb_iterations=4, 531 | learning_rate=0.001, nb_predictions=20, mapping='m2m', 532 | sentence_length=80, step=80, mode='train'): 533 | """ 534 | Train, evaluate, or predict. 535 | 536 | Arguments: 537 | hidden_len: {integer}, the size of a hidden layer. 538 | batch_size: {interger}, the number of sentences per batch. 539 | nb_epoch: {interger}, number of epoches per iteration. 540 | nb_iterations: {integer}, number of iterations. 541 | learning_rate: {float}, learning rate. 542 | nb_predictions: {integer}, number of the ids predicted. 543 | mapping: {string}, input to output mapping. 544 | 'o2o': one-to-one 545 | 'm2m': many-to-many 546 | sentence_length: {integer}, the length of each training sentence. 547 | step: {integer}, the sample steps. 548 | mode: {string}, th running mode of this programm 549 | 'train': train and predict 550 | 'predict': only predict by loading existing model weights 551 | 'evaluate': evaluate the model in evaluation data set 552 | 'detect': detect a new log sequence for the probabilities 553 | """ 554 | # get parameters and dimensions of the model 555 | print "Loading training data..." 556 | train_sequence, input_len1 = get_sequence("./train_data/*") 557 | print "Loading validation data..." 558 | val_sequence, input_len2 = get_sequence("./validation_data/*") 559 | input_len = max(input_len1, input_len2) 560 | 561 | print "Training sequence length: %d" %len(train_sequence) 562 | print "Validation sequence length: %d" %len(val_sequence) 563 | print "#classes: %d\n" %input_len 564 | 565 | # two layered LSTM 512 hidden nodes and a dropout rate of 0.2 566 | brnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len) 567 | 568 | # build model 569 | brnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate, 570 | nb_layers=2, dropout=0.2) 571 | 572 | # plot model 573 | # brnn.plot_model() 574 | 575 | # load the previous model weights 576 | # brnn.load_model("weightsf4-61.hdf5") 577 | 578 | if mode == 'predict': 579 | print "Predict..." 580 | predict(val_sequence, input_len, brnn, nb_predictions=nb_predictions, 581 | mapping=mapping, sentence_length=sentence_length) 582 | elif mode == 'evaluate': 583 | print "Evaluate..." 584 | print "Metrics: " + ', '.join(brnn.model.metrics_names) 585 | X_val, y_val = get_data(val_sequence, input_len, mapping=mapping, 586 | sentence_length=sentence_length, step=step, 587 | random_offset=False) 588 | results = brnn.model.evaluate(X_val, y_val, #pylint: disable=W0612 589 | batch_size=batch_size, 590 | verbose=1) 591 | print "Loss: ", results[0] 592 | print "Accuracy: ", results[1] 593 | elif mode == 'train': 594 | print "Train..." 595 | try: 596 | train(brnn, train_sequence, val_sequence, input_len, 597 | batch_size=batch_size, nb_epoch=nb_epoch, 598 | nb_iterations=nb_iterations, 599 | sentence_length=sentence_length, 600 | step=step, mapping=mapping) 601 | except KeyboardInterrupt: 602 | brnn.save_model("weights-stop.hdf5") 603 | elif mode == 'detect': 604 | print "Detect..." 605 | detect(val_sequence, input_len, brnn, mapping=mapping, 606 | sentence_length=sentence_length) 607 | else: 608 | print "The mode = %s is not correct!!!" %mode 609 | 610 | return mode 611 | 612 | 613 | if __name__ == '__main__': 614 | run() 615 | -------------------------------------------------------------------------------- /others/brnn_sequence_analyzer_gen.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program analyze the integer sequence using Bi-diractional Recurrent Neural 3 | Network (BRNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) 4 | based on the python library Keras. 5 | 6 | Input data is Generator and the training is by calling model.fit_generator(). 7 | 8 | "Keras is a minimalist, highly modular neural networks library, written in 9 | Python and capable of running on top of either TensorFlow or Theano." 10 | ---- Keras (http://keras.io/) 11 | 12 | It is based on this Keras example - imdb_bidirectional_lstm.py: 13 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py 14 | 15 | Author: Chang Liu (fluency03) 16 | Data: 2016-04-03 17 | """ 18 | 19 | import glob 20 | # import os 21 | import sys 22 | import csv 23 | import time 24 | import matplotlib.pyplot as plt 25 | import numpy as np 26 | 27 | from keras.callbacks import Callback, ModelCheckpoint 28 | from keras.layers import Input, Dense, Dropout, LSTM, GRU, merge 29 | from keras.layers.wrappers import TimeDistributed 30 | from keras.models import Model 31 | from keras.optimizers import RMSprop # pylint: disable=W0611 32 | from keras.utils.visualize_util import plot 33 | 34 | 35 | # random number generator with a fixed value for reproducibility 36 | np.random.seed(1337) 37 | 38 | 39 | def override(f): 40 | """ 41 | Override decorator. 42 | """ 43 | return f 44 | 45 | 46 | class SequenceAnalyzer(object): 47 | """ 48 | Sequence analyzer based on RNN Graph model. 49 | """ 50 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 51 | self.sentence_length = sentence_length 52 | self.input_len = input_len 53 | self.hidden_len = hidden_len 54 | self.output_len = output_len 55 | self.model = None 56 | 57 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 58 | nb_layers=2, dropout=0.2): 59 | """ 60 | Bidirectional RNN with specified dropout rate (default 0.2), built with 61 | softmax activation, cross entropy loss and rmsprop optimizer. 62 | 63 | Arguments: 64 | layer: {string}, the type of the layers in the RNN Model. 65 | 'LSTM': LSTM layers 66 | 'GRU': GRU layers 67 | mapping: {string}, input to output mapping. 68 | 'o2o': one-to-one 69 | 'm2m': many-to-many 70 | learning_rate: {float}, learning rate. 71 | nb_layers: {integer}, number of layers in total. 72 | dropout: {float}, dropout value. 73 | """ 74 | print "Building Model..." 75 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f " 76 | "nb_layers = %d , dropout = %.2f" 77 | %(self.hidden_len, layer, mapping, learning_rate, 78 | nb_layers, dropout)) 79 | 80 | # check the layer type: LSTM or GRU 81 | if layer == 'LSTM': 82 | class LAYER(LSTM): 83 | """ 84 | LAYER as LSTM. 85 | """ 86 | pass 87 | elif layer == 'GRU': 88 | class LAYER(GRU): 89 | """ 90 | LAYER as GRU. 91 | """ 92 | pass 93 | 94 | # check whether return sequence for each of the layers 95 | return_sequences = [] 96 | if mapping == 'o2o': 97 | # if mapping is one-to-one 98 | for nl in range(nb_layers): 99 | if nl == nb_layers-1: 100 | return_sequences.append(False) 101 | else: 102 | return_sequences.append(True) 103 | elif mapping == 'm2m': 104 | # if mapping is many-to-many 105 | for _ in range(nb_layers): 106 | return_sequences.append(True) 107 | 108 | # add input 109 | input_layer = Input(shape=(self.sentence_length, self.input_len), 110 | dtype='float32') 111 | 112 | # first Bi-directional LSTM layer 113 | forward1 = LAYER(self.hidden_len, 114 | return_sequences=return_sequences[0])(input_layer) 115 | forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612 116 | backward1 = LAYER(self.hidden_len, 117 | return_sequences=return_sequences[0], 118 | go_backwards=True)(input_layer) 119 | backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612 120 | 121 | # following Bi-directional layers 122 | for nl in range(nb_layers-1): 123 | exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)" 124 | %('forward' + str(nl+2), 125 | return_sequences[nl+1], 126 | 'forward_dropout' + str(nl+1))) 127 | exec("%s = Dropout(dropout)(%s)" 128 | %('forward_dropout' + str(nl+2), 129 | 'forward' + str(nl+2))) 130 | exec(("%s = LAYER(self.hidden_len, return_sequences=%s, " 131 | "go_backwards=True)(%s)") 132 | %('backward' + str(nl+2), 133 | return_sequences[nl+1], 134 | 'backward_dropout' + str(nl+1))) 135 | exec("%s = Dropout(dropout)(%s)" 136 | %('backward_dropout' + str(nl+2), 137 | 'backward' + str(nl+2))) 138 | 139 | merged_layer = merge([locals()['forward_dropout' + str(nb_layers)], 140 | locals()['backward_dropout' + str(nb_layers)]], 141 | mode='concat', concat_axis=-1) 142 | 143 | if mapping == 'o2o': 144 | output_layer = Dense(self.output_len, 145 | activation='softmax')(merged_layer) 146 | elif mapping == 'm2m': 147 | output_layer = TimeDistributed( 148 | Dense(self.output_len, activation='softmax'))(merged_layer) 149 | 150 | # add ouput 151 | self.model = Model(input=input_layer, output=output_layer) 152 | 153 | rms = RMSprop(lr=learning_rate) 154 | # try using different optimizers and different optimizer configs 155 | self.model.compile(loss='categorical_crossentropy', 156 | optimizer=rms, 157 | metrics=['accuracy']) 158 | 159 | def save_model(self, filename, overwrite=False): 160 | """ 161 | Save the model weight into a hdf5 file. 162 | 163 | Arguments: 164 | filename: {string}, the name/path to the file 165 | to which the weights are going to be saved. 166 | overwrite: {bool}, overwrite existing file. 167 | """ 168 | print "Save Weights %s ..." %filename 169 | self.model.save_weights(filename, overwrite=overwrite) 170 | 171 | def load_model(self, filename): 172 | """ 173 | Load the model weight into a hdf5 file. 174 | 175 | Arguments: 176 | filename: {string}, the name/path to the file 177 | to which the weights are going to be loaded. 178 | """ 179 | print "Load Weights %s ..." %filename 180 | self.model.load_weights(filename) 181 | 182 | def plot_model(self, filename='brnn_model.png'): 183 | """ 184 | Plot model. 185 | 186 | Arguments: 187 | filename: {string}, the name/path to the file 188 | to which the weights are going to be plotted. 189 | """ 190 | print "Plot Model %s ..." %filename 191 | plot(self.model, to_file=filename) 192 | 193 | 194 | class History(Callback): 195 | """ 196 | Record the loss and accuracy history. 197 | """ 198 | @override 199 | def on_train_begin(self, logs={}): # pylint: disable=W0102 200 | """ 201 | A method starting at the begining of the training. 202 | 203 | Arguments: 204 | logs: {dictionary}, recording the training and validation 205 | losses and accuracy of every epoch. 206 | """ 207 | # training loss and accuracy 208 | self.train_losses = [] 209 | self.train_acc = [] 210 | # validation loss and accuracy 211 | self.val_losses = [] 212 | self.val_acc = [] 213 | 214 | @override 215 | def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102 216 | """ 217 | A method starting at the begining of the training. 218 | 219 | Arguments: 220 | epoch: {integer}, the current epoch. 221 | logs: {dictionary}, recording the training and validation 222 | losses and accuracy of every epoch. 223 | """ 224 | # record training loss and accuracy 225 | self.train_losses.append(logs.get('loss')) 226 | self.train_acc.append(logs.get('acc')) 227 | # record validation loss and accuracy 228 | self.val_losses.append(logs.get('val_loss')) 229 | self.val_acc.append(logs.get('val_acc')) 230 | 231 | # continutously save the train_loss, train_acc, val_loss, val_acc 232 | # into a csv file with 4 columns respeactively 233 | csv_name = 'history.csv' 234 | with open(csv_name, 'a') as csvfile: 235 | his_writer = csv.writer(csvfile) 236 | print "\n Save loss and accuracy into %s" %csv_name 237 | his_writer.writerow((logs.get('loss'), logs.get('acc'), 238 | logs.get('val_loss'), logs.get('val_acc'))) 239 | 240 | 241 | def sample(prob, temperature=0.2): 242 | """ 243 | Softmax function for reinforcement learning. 244 | 245 | Arguments: 246 | prob: {list}, a list of probabilities of each of the classes. 247 | temperature: {float}, Softmax temperature. 248 | Returns: 249 | {integer}, the most possible sample. 250 | """ 251 | prob = np.log(prob) / temperature 252 | prob = np.exp(prob) / np.sum(np.exp(prob)) 253 | return np.argmax(np.random.multinomial(1, prob, 1)) 254 | 255 | 256 | def get_sequence(filepath): 257 | """ 258 | Get the original sequence from file. 259 | 260 | Arguments: 261 | filename: {string}, the name/path of input log sequence file. 262 | Returns: 263 | {list}, the log sequence. 264 | {integer}, the size of vocabulary. 265 | """ 266 | # read file and convert ids of each line into array of numbers 267 | seqfiles = glob.glob(filepath) 268 | sequence = [] 269 | 270 | for seqfile in seqfiles: 271 | with open(seqfile, 'r') as f: 272 | one_sequence = [int(id_) for id_ in f] 273 | print " %s, sequence length: %d" %(seqfile, 274 | len(one_sequence)) 275 | sequence.extend(one_sequence) 276 | 277 | # add two extra positions for 'unknown-log' and 'no-log' 278 | vocab_size = max(sequence) + 2 279 | 280 | return sequence, vocab_size 281 | 282 | 283 | def data_generator(sequence, vocab_size, mapping='m2m', sentence_length=40, 284 | step=3, random_offset=True, batch_size=128): 285 | """ 286 | Retrieves data from a plain txt file and formats it using one-hot vector. 287 | This method returns a data generator yeilding a batch of (X_train, y_train) 288 | every time being called. 289 | 290 | Arguments: 291 | sequence: {lsit}, the original input sequence 292 | vocab_size: {integer}, the number of unique id classes 293 | mapping: {string}, input to output mapping. 294 | 'o2o': one-to-one 295 | 'm2m': many-to-many 296 | sentence_length: {integer}, the length of each training sentence. 297 | step: {integer}, the sample steps. 298 | random_offset: {bool}, the offset is random between step or is 0. 299 | batch_size: {integer}, the number of sample per batch. 300 | Yields: 301 | {np.array}, training input data X 302 | {np.array}, training target data y 303 | """ 304 | # the number of current sample 305 | sample_count = 0 306 | 307 | # one-hot vector (all zeros except for a single one at 308 | # the exact postion of this id number) 309 | X_train = np.zeros((batch_size, sentence_length, vocab_size), 310 | dtype=np.bool) 311 | # expected outputs for each sentence 312 | if mapping == 'o2o': 313 | # if mapping is one-to-one 314 | y_train = np.zeros((batch_size, vocab_size), dtype=np.bool) 315 | elif mapping == 'm2m': 316 | # if mapping is many-to-many 317 | y_train = np.zeros((batch_size, sentence_length, vocab_size), 318 | dtype=np.bool) 319 | 320 | # continuousy creat batch data and next sentences 321 | while True: 322 | offset = np.random.randint(0, step) if random_offset else 0 323 | for i in range(offset, len(sequence) - sentence_length, step): 324 | # index of a this sample in this batch 325 | batch_index = sample_count % batch_size 326 | 327 | # re-initialzing the batch 328 | if batch_index == 0: 329 | X_train.fill(0) 330 | y_train.fill(0) 331 | 332 | # current sample and target outputs 333 | X_sentence = [] 334 | y_sentence = [] 335 | next_id = [] 336 | 337 | X_sentence = sequence[i : i + sentence_length] 338 | if mapping == 'o2o': 339 | # if mapping is one-to-one 340 | next_id = sequence[i + sentence_length] 341 | elif mapping == 'm2m': 342 | # if mapping is many-to-many 343 | y_sentence = sequence[i + 1 : i + sentence_length + 1] 344 | 345 | for t, id_ in enumerate(X_sentence): 346 | # mark the each corresponding character in a sentence as 1 347 | X_train[batch_index, t, id_] = 1 348 | # if mapping is many-to-many 349 | if mapping == 'm2m': 350 | y_train[batch_index, t, y_sentence[t]] = 1 351 | # if mapping is one-to-one 352 | # mark the corresponding character in expected output as 1 353 | if mapping == 'o2o': 354 | y_train[batch_index, next_id] = 1 355 | 356 | # sample count plus 1 357 | sample_count += 1 358 | 359 | if batch_index == batch_size-1: 360 | yield X_train, y_train 361 | 362 | 363 | def predict(sequence, input_len, analyzer, nb_predictions=80, 364 | mapping='m2m', sentence_length=40): 365 | """ 366 | Predict the next sequences using existing model and weights given some seed. 367 | 368 | Arguments: 369 | sequence: {lsit}, the original input sequence 370 | input_len: {integer}, the number of unique id classes 371 | analyzer: {SequenceAnalyzer}, the sequence analyzer 372 | nb_predictions: {integer}, number of predictions after giving the seed 373 | mapping: {string}, input to output mapping. 374 | 'o2o': one-to-one 375 | 'm2m': many-to-many 376 | sentence_length: {integer}, the length of each sentence. 377 | """ 378 | # generate elements 379 | for _ in range(nb_predictions): 380 | # start index of the seed, random number in range 381 | start_index = np.random.randint(0, len(sequence) - sentence_length - 1) 382 | # seed sentence 383 | sentence = sequence[start_index : start_index + sentence_length] 384 | 385 | # Y_true 386 | y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 387 | print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 388 | 389 | seed = np.zeros((1, sentence_length, input_len)) 390 | # format input 391 | for t in range(0, sentence_length): 392 | seed[0, t, sentence[t]] = 1 393 | 394 | # get predictions 395 | # verbose = 0, no logging 396 | predictions = analyzer.model.predict(seed, verbose=0)[0] 397 | 398 | # y_predicted 399 | if mapping == 'o2o': 400 | next_id = np.argmax(predictions) 401 | sys.stdout.write(' ' + str(next_id)) 402 | sys.stdout.flush() 403 | elif mapping == 'm2m': 404 | next_sentence = [] 405 | for pred in predictions: 406 | next_sentence.append(np.argmax(pred)) 407 | print "y_pred: " + ' '.join(str(id_).ljust(4) 408 | for id_ in next_sentence) 409 | # next_id = np.argmax(predictions[-1]) 410 | 411 | # y_true 412 | print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 413 | 414 | print "\n" 415 | 416 | 417 | def train(analyzer, train_data, nb_training_samples, 418 | val_data, nb_validation_samples, 419 | nb_epoch=50, nb_iterations=4): 420 | """ 421 | Trains the network. 422 | 423 | Arguments: 424 | analyzer: {SequenceAnalyzer}. 425 | train_data: {tuple}, training data (X_train, y_train). 426 | val_data: {tuple}, validation data (X_val, y_val). 427 | nb_training_samples: {integer}, the number training samples. 428 | nb_validation_samples: {integer}, the number validation samples. 429 | nb_iterations: {integer}, number of iterations. 430 | sentence_length: {integer}, the length of each training sentence. 431 | """ 432 | for iteration in range(1, nb_iterations+1): 433 | print "" 434 | print "------------------------ Start Training ------------------------" 435 | print "Iteration: ", iteration 436 | print "Number of epoch per iteration: ", nb_epoch 437 | 438 | # history of losses and accuracy 439 | history = History() 440 | 441 | # saves the model weights after each epoch 442 | # if the validation loss decreased 443 | checkpointer = ModelCheckpoint(filepath="weights.hdf5", 444 | verbose=1, save_best_only=True) 445 | 446 | # train the model with data generator 447 | analyzer.model.fit_generator(train_data, 448 | samples_per_epoch=nb_training_samples, 449 | nb_epoch=nb_epoch, verbose=1, 450 | callbacks=[history, checkpointer], 451 | validation_data=val_data, 452 | nb_val_samples=nb_validation_samples) 453 | 454 | analyzer.save_model("weights-after-iteration.hdf5") 455 | 456 | 457 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40): 458 | """ 459 | Scan the given sequence for detecting anormalies. 460 | 461 | Arguments: 462 | sequence: {lsit}, the original input sequence 463 | input_len: {integer}, the number of unique id classes 464 | analyzer: {SequenceAnalyzer}, the sequence analyzer 465 | mapping: {string}, input to output mapping. 466 | 'o2o': one-to-one 467 | 'm2m': many-to-many 468 | sentence_length: {integer}, the length of each sentence. 469 | """ 470 | # sequence length 471 | length = len(sequence) 472 | 473 | # predicted probabilities for each id 474 | # we assume the first sentence_length ids are true 475 | prob = [1] * sentence_length + [0] * (length - sentence_length) 476 | 477 | start_time = time.time() 478 | try: 479 | # generate elements 480 | for start_index in xrange(length - sentence_length): 481 | # seed sentence 482 | X = sequence[start_index : start_index + sentence_length] 483 | # print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 484 | 485 | # Y_true 486 | # y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 487 | # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 488 | y_next_true = sequence[start_index + sentence_length] 489 | 490 | seed = np.zeros((1, sentence_length, input_len)) 491 | # format input 492 | for t in range(0, sentence_length): 493 | seed[0, t, X[t]] = 1 494 | 495 | # get predictionsverbose = 0, no logging 496 | predictions = analyzer.model.predict(seed, verbose=0)[0] 497 | 498 | # y_predicted 499 | y_next_pred = 0 500 | next_prob = 0 501 | if mapping == 'o2o': 502 | next_prob = predictions[y_next_true] 503 | prob[start_index + sentence_length] = next_prob 504 | y_next_pred = np.argmax(predictions) 505 | elif mapping == 'm2m': 506 | # next_sentence = [] 507 | # for pred in predictions: 508 | # next_sentence.append(np.argmax(pred)) 509 | # y_next_pred = next_sentence[-1] 510 | # print "y_pred: " + ' '.join(str(id_).ljust(4) 511 | # for id_ in next_sentence) 512 | y_next_pred = np.argmax(predictions[-1]) 513 | next_prob = predictions[-1][y_next_true] 514 | prob[start_index + sentence_length] = next_prob 515 | 516 | print start_index, next_prob 517 | except KeyboardInterrupt: 518 | # print " |-Write the clusters into %s ..." %self.cluster_file 519 | with open('prob.txt', 'w') as prob_file: 520 | for p in prob: 521 | prob_file.write(str(p) + '\n') 522 | 523 | plt.plot(prob, 'r*') 524 | plt.xlim(0, 1000) 525 | plt.ylim(0, 1) 526 | plt.savefig("prob.png") 527 | plt.clf() 528 | plt.cla() 529 | 530 | stop_time = time.time() 531 | print "--- %s seconds ---\n" % (stop_time - start_time) 532 | 533 | return prob 534 | 535 | 536 | def run(hidden_len=512, batch_size=128, nb_batch=200, nb_epoch=50, 537 | nb_iterations=4, learning_rate=0.001, nb_predictions=20, 538 | mapping='m2m', sentence_length=80, step=80, mode='train'): 539 | """ 540 | Train, evaluate, or predict. 541 | 542 | Arguments: 543 | hidden_len: {integer}, the size of a hidden layer. 544 | batch_size: {interger}, the number of sentences per batch. 545 | nb_batch: {integer}, number of batches to be trained durign each epoch. 546 | nb_epoch: {interger}, number of epoches per iteration. 547 | nb_iterations: {integer}, number of iterations. 548 | learning_rate: {float}, learning rate. 549 | nb_predictions: {integer}, number of the ids predicted. 550 | mapping: {string}, input to output mapping. 551 | 'o2o': one-to-one 552 | 'm2m': many-to-many 553 | sentence_length: {integer}, the length of each training sentence. 554 | step: {integer}, the sample steps. 555 | mode: {string}, th running mode of this programm 556 | 'train': train and predict 557 | 'predict': only predict by loading existing model weights 558 | 'evaluate': evaluate the model in evaluation data set 559 | 'detect': detect a new log sequence for the probabilities 560 | """ 561 | # get parameters and dimensions of the model 562 | print "Loading training data..." 563 | train_sequence, input_len1 = get_sequence("./train_data/*") 564 | print "Loading validation data..." 565 | val_sequence, input_len2 = get_sequence("./validation_data/*") 566 | input_len = max(input_len1, input_len2) 567 | 568 | print "Training sequence length: %d" %len(train_sequence) 569 | print "Validation sequence length: %d" %len(val_sequence) 570 | print "#classes: %d\n" %input_len 571 | 572 | # data generator of X_train and y_train, with random offset 573 | train_data = data_generator(train_sequence, input_len, mapping=mapping, 574 | sentence_length=sentence_length, step=step, 575 | random_offset=True, batch_size=batch_size) 576 | 577 | # data generator of X_val and y _val, with random offset 578 | val_data = data_generator(val_sequence, input_len, mapping=mapping, 579 | sentence_length=sentence_length, step=step, 580 | random_offset=True, batch_size=batch_size) 581 | 582 | # two layered LSTM 512 hidden nodes and a dropout rate of 0.2 583 | brnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len) 584 | 585 | # build model 586 | brnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate, 587 | nb_layers=2, dropout=0.2) 588 | 589 | # plot model 590 | # brnn.plot_model() 591 | 592 | # load the previous model weights 593 | # brnn.load_model("weightsf4-61.hdf5") 594 | 595 | if mode == 'predict': 596 | print "Predict..." 597 | predict(val_sequence, input_len, brnn, nb_predictions=nb_predictions, 598 | mapping=mapping, sentence_length=sentence_length) 599 | elif mode == 'evaluate': 600 | print "Evaluate..." 601 | print "Metrics: " + ', '.join(brnn.model.metrics_names) 602 | X_val, y_val = data_generator(val_sequence, input_len, mapping=mapping, 603 | sentence_length=sentence_length, 604 | step=step, random_offset=False, 605 | batch_size=batch_size) 606 | results = brnn.model.evaluate(X_val, y_val, #pylint: disable=W0612 607 | batch_size=batch_size, 608 | verbose=1) 609 | print "Loss: ", results[0] 610 | print "Accuracy: ", results[1] 611 | elif mode == 'train': 612 | print "Train..." 613 | # number of training sampes and validation samples 614 | nb_training_samples = batch_size * nb_batch 615 | nb_validation_samples = int(nb_training_samples * 0.05) 616 | 617 | try: 618 | train(brnn, train_data, nb_training_samples, 619 | val_data, nb_validation_samples, 620 | nb_epoch=nb_epoch, nb_iterations=nb_iterations) 621 | except KeyboardInterrupt: 622 | brnn.save_model("weights-stop.hdf5") 623 | elif mode == 'detect': 624 | print "Detect..." 625 | detect(val_sequence, input_len, brnn, mapping=mapping, 626 | sentence_length=sentence_length) 627 | else: 628 | print "The mode = %s is not correct!!!" %mode 629 | 630 | return mode 631 | 632 | 633 | if __name__ == '__main__': 634 | run() 635 | -------------------------------------------------------------------------------- /others/rnn_sequence_analyzer_gen.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program analyze the integer sequence using Uni-diractional Recurrent Neural 3 | Network (RNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) 4 | based on the python library Keras. 5 | 6 | Input data is Generator and the training is by calling model.fit_generator(). 7 | 8 | "Keras is a minimalist, highly modular neural networks library, written in 9 | Python and capable of running on top of either TensorFlow or Theano." 10 | ---- Keras (http://keras.io/) 11 | 12 | It is based on this Keras example - lstm_text_generation: 13 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py 14 | 15 | Author: Chang Liu (fluency03) 16 | Data: 2016-04-01 17 | """ 18 | 19 | import glob 20 | # import os 21 | import sys 22 | import csv 23 | import time 24 | import matplotlib.pyplot as plt 25 | import numpy as np 26 | 27 | from keras.callbacks import Callback, ModelCheckpoint 28 | from keras.layers import Activation, Dense, Dropout, LSTM, GRU 29 | from keras.layers.wrappers import TimeDistributed 30 | from keras.models import Sequential 31 | from keras.optimizers import RMSprop # pylint: disable=W0611 32 | from keras.utils.visualize_util import plot 33 | 34 | 35 | # random number generator with a fixed value for reproducibility 36 | np.random.seed(1337) 37 | 38 | 39 | def override(f): 40 | """ 41 | Override decorator. 42 | """ 43 | return f 44 | 45 | 46 | class SequenceAnalyzer(object): 47 | """ 48 | Sequence analyzer based on RNN Sequential Model. 49 | """ 50 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 51 | self.sentence_length = sentence_length 52 | self.input_len = input_len 53 | self.hidden_len = hidden_len 54 | self.output_len = output_len 55 | self.model = Sequential() 56 | 57 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 58 | nb_layers=2, dropout=0.2): 59 | """ 60 | Stacked RNN with specified dropout rate (default 0.2), built with 61 | softmax activation, cross entropy loss and rmsprop optimizer. 62 | 63 | Arguments: 64 | layer: {string}, the type of the layers in the RNN Model. 65 | 'LSTM': LSTM layers 66 | 'GRU': GRU layers 67 | mapping: {string}, input to output mapping. 68 | 'o2o': one-to-one 69 | 'm2m': many-to-many 70 | learning_rate: {float}, learning rate. 71 | nb_layers: {integer}, number of layers in total. 72 | dropout: {float}, dropout value. 73 | """ 74 | print "Building Model..." 75 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f " 76 | "nb_layers = %d , dropout = %.2f" 77 | %(self.hidden_len, layer, mapping, learning_rate, 78 | nb_layers, dropout)) 79 | 80 | # check the layer type: LSTM or GRU 81 | if layer == 'LSTM': 82 | class LAYER(LSTM): 83 | """ 84 | LAYER as LSTM. 85 | """ 86 | pass 87 | elif layer == 'GRU': 88 | class LAYER(GRU): 89 | """ 90 | LAYER as GRU. 91 | """ 92 | pass 93 | 94 | # check whether return sequence for each of the layers 95 | return_sequences = [] 96 | if mapping == 'o2o': 97 | # if mapping is one-to-one 98 | for nl in range(nb_layers): 99 | if nl == nb_layers-1: 100 | return_sequences.append(False) 101 | else: 102 | return_sequences.append(True) 103 | elif mapping == 'm2m': 104 | # if mapping is many-to-many 105 | for _ in range(nb_layers): 106 | return_sequences.append(True) 107 | 108 | # first layer RNN with specified number of nodes in the hidden layer. 109 | self.model.add(LAYER(self.hidden_len, 110 | return_sequences=return_sequences[0], 111 | input_shape=(self.sentence_length, 112 | self.input_len))) 113 | self.model.add(Dropout(dropout)) 114 | 115 | # the following layers 116 | for nl in range(nb_layers-1): 117 | self.model.add(LAYER(self.hidden_len, 118 | return_sequences=return_sequences[nl+1])) 119 | self.model.add(Dropout(dropout)) 120 | 121 | if mapping == 'o2o': 122 | # if mapping is one-to-one 123 | self.model.add(Dense(self.output_len)) 124 | elif mapping == 'm2m': 125 | # if mapping is many-to-many 126 | self.model.add(TimeDistributed(Dense(self.output_len))) 127 | 128 | self.model.add(Activation('softmax')) 129 | 130 | rms = RMSprop(lr=learning_rate) 131 | self.model.compile(loss='categorical_crossentropy', 132 | optimizer=rms, 133 | metrics=['accuracy']) 134 | 135 | def save_model(self, filename, overwrite=False): 136 | """ 137 | Save the model weight into a hdf5 file. 138 | 139 | Arguments: 140 | filename: {string}, the name/path to the file 141 | to which the weights are going to be saved. 142 | overwrite: {bool}, overwrite existing file. 143 | """ 144 | print "Save Weights %s ..." %filename 145 | self.model.save_weights(filename, overwrite=overwrite) 146 | 147 | def load_model(self, filename): 148 | """ 149 | Load the model weight into a hdf5 file. 150 | 151 | Arguments: 152 | filename: {string}, the name/path to the file 153 | to which the weights are going to be loaded. 154 | """ 155 | print "Load Weights %s ..." %filename 156 | self.model.load_weights(filename) 157 | 158 | def plot_model(self, filename='rnn_model.png'): 159 | """ 160 | Plot model. 161 | 162 | Arguments: 163 | filename: {string}, the name/path to the file 164 | to which the weights are going to be plotted. 165 | """ 166 | print "Plot Model %s ..." %filename 167 | plot(self.model, to_file=filename) 168 | 169 | 170 | class History(Callback): 171 | """ 172 | Record the loss and accuracy history. 173 | """ 174 | @override 175 | def on_train_begin(self, logs={}): # pylint: disable=W0102 176 | """ 177 | A method starting at the begining of the training. 178 | 179 | Arguments: 180 | logs: {dictionary}, recording the training and validation 181 | losses and accuracy of every epoch. 182 | """ 183 | # training loss and accuracy 184 | self.train_losses = [] 185 | self.train_acc = [] 186 | # validation loss and accuracy 187 | self.val_losses = [] 188 | self.val_acc = [] 189 | 190 | @override 191 | def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102 192 | """ 193 | A method starting at the begining of the training. 194 | 195 | Arguments: 196 | epoch: {integer}, the current epoch. 197 | logs: {dictionary}, recording the training and validation 198 | losses and accuracy of every epoch. 199 | """ 200 | # record training loss and accuracy 201 | self.train_losses.append(logs.get('loss')) 202 | self.train_acc.append(logs.get('acc')) 203 | # record validation loss and accuracy 204 | self.val_losses.append(logs.get('val_loss')) 205 | self.val_acc.append(logs.get('val_acc')) 206 | 207 | # continutously save the train_loss, train_acc, val_loss, val_acc 208 | # into a csv file with 4 columns respeactively 209 | csv_name = 'history.csv' 210 | with open(csv_name, 'a') as csvfile: 211 | his_writer = csv.writer(csvfile) 212 | print "\n Save loss and accuracy into %s" %csv_name 213 | his_writer.writerow((logs.get('loss'), logs.get('acc'), 214 | logs.get('val_loss'), logs.get('val_acc'))) 215 | 216 | 217 | def sample(prob, temperature=0.2): 218 | """ 219 | Softmax function for reinforcement learning. 220 | 221 | Arguments: 222 | prob: {list}, a list of probabilities of each of the classes. 223 | temperature: {float}, Softmax temperature. 224 | Returns: 225 | {integer}, the most possible sample. 226 | """ 227 | prob = np.log(prob) / temperature 228 | prob = np.exp(prob) / np.sum(np.exp(prob)) 229 | return np.argmax(np.random.multinomial(1, prob, 1)) 230 | 231 | 232 | def get_sequence(filepath): 233 | """ 234 | Get the original sequence from file. 235 | 236 | Arguments: 237 | filename: {string}, the name/path of input log sequence file. 238 | Returns: 239 | {list}, the log sequence. 240 | {integer}, the size of vocabulary. 241 | """ 242 | # read file and convert ids of each line into array of numbers 243 | seqfiles = glob.glob(filepath) 244 | sequence = [] 245 | 246 | for seqfile in seqfiles: 247 | with open(seqfile, 'r') as f: 248 | one_sequence = [int(id_) for id_ in f] 249 | print " %s, sequence length: %d" %(seqfile, 250 | len(one_sequence)) 251 | sequence.extend(one_sequence) 252 | 253 | # add two extra positions for 'unknown-log' and 'no-log' 254 | vocab_size = max(sequence) + 2 255 | 256 | return sequence, vocab_size 257 | 258 | 259 | def data_generator(sequence, vocab_size, mapping='m2m', sentence_length=40, 260 | step=3, random_offset=True, batch_size=64): 261 | """ 262 | Retrieves data from a plain txt file and formats it using one-hot vector. 263 | This method returns a data generator yeilding a batch of (X_train, y_train) 264 | every time being called. 265 | 266 | Arguments: 267 | sequence: {lsit}, the original input sequence 268 | vocab_size: {integer}, the number of unique id classes 269 | mapping: {string}, input to output mapping. 270 | 'o2o': one-to-one 271 | 'm2m': many-to-many 272 | sentence_length: {integer}, the length of each training sentence. 273 | step: {integer}, the sample steps. 274 | random_offset: {bool}, the offset is random between step or is 0. 275 | batch_size: {integer}, the number of sample per batch. 276 | Yields: 277 | {np.array}, training input data X 278 | {np.array}, training target data y 279 | """ 280 | # the number of current sample 281 | sample_count = 0 282 | 283 | # one-hot vector (all zeros except for a single one at 284 | # the exact postion of this id number) 285 | X_train = np.zeros((batch_size, sentence_length, vocab_size), dtype=np.bool) 286 | 287 | # expected outputs for each sentence 288 | if mapping == 'o2o': 289 | # if mapping is one-to-one 290 | y_train = np.zeros((batch_size, vocab_size), dtype=np.bool) 291 | elif mapping == 'm2m': 292 | # if mapping is many-to-many 293 | y_train = np.zeros((batch_size, sentence_length, vocab_size), 294 | dtype=np.bool) 295 | 296 | # continuousy creat batch data and next sentences 297 | while True: 298 | offset = np.random.randint(0, step) if random_offset else 0 299 | for i in range(offset, len(sequence) - sentence_length, step): 300 | # index of a this sample in this batch 301 | batch_index = sample_count % batch_size 302 | # print sample_count 303 | # print batch_index 304 | 305 | # re-initialzing the batch 306 | if batch_index == 0: 307 | X_train.fill(0) 308 | y_train.fill(0) 309 | 310 | # current sample and target outputs 311 | X_sentence = [] 312 | y_sentence = [] 313 | next_id = [] 314 | 315 | X_sentence = sequence[i : i + sentence_length] 316 | if mapping == 'o2o': 317 | # if mapping is one-to-one 318 | next_id = sequence[i + sentence_length] 319 | elif mapping == 'm2m': 320 | # if mapping is many-to-many 321 | y_sentence = sequence[i + 1 : i + sentence_length + 1] 322 | 323 | for t, id_ in enumerate(X_sentence): 324 | # mark the each corresponding character in a sentence as 1 325 | X_train[batch_index, t, id_] = 1 326 | # if mapping is many-to-many 327 | if mapping == 'm2m': 328 | y_train[batch_index, t, y_sentence[t]] = 1 329 | # if mapping is one-to-one 330 | # mark the corresponding character in expected output as 1 331 | if mapping == 'o2o': 332 | y_train[batch_index, next_id] = 1 333 | 334 | # sample count plus 1 335 | sample_count += 1 336 | 337 | if batch_index == batch_size-1: 338 | yield X_train, y_train 339 | 340 | 341 | def predict(sequence, input_len, analyzer, nb_predictions=80, 342 | mapping='m2m', sentence_length=40): 343 | """ 344 | Predict the next sequences using existing model and weights given some seed. 345 | 346 | Arguments: 347 | sequence: {lsit}, the original input sequence 348 | input_len: {integer}, the number of unique id classes 349 | analyzer: {SequenceAnalyzer}, the sequence analyzer 350 | nb_predictions: {integer}, number of predictions after giving the seed 351 | mapping: {string}, input to output mapping. 352 | 'o2o': one-to-one 353 | 'm2m': many-to-many 354 | sentence_length: {integer}, the length of each sentence. 355 | """ 356 | # generate elements 357 | for _ in range(nb_predictions): 358 | # start index of the seed, random number in range 359 | start_index = np.random.randint(0, len(sequence) - sentence_length - 1) 360 | # seed sentence 361 | sentence = sequence[start_index : start_index + sentence_length] 362 | 363 | # Y_true 364 | y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 365 | print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 366 | 367 | seed = np.zeros((1, sentence_length, input_len)) 368 | # format input 369 | for t in range(0, sentence_length): 370 | seed[0, t, sentence[t]] = 1 371 | 372 | # get predictions 373 | # verbose = 0, no logging 374 | predictions = analyzer.model.predict(seed, verbose=0)[0] 375 | 376 | # y_predicted 377 | if mapping == 'o2o': 378 | next_id = np.argmax(predictions) 379 | sys.stdout.write(' ' + str(next_id)) 380 | sys.stdout.flush() 381 | elif mapping == 'm2m': 382 | next_sentence = [] 383 | for pred in predictions: 384 | next_sentence.append(np.argmax(pred)) 385 | print "y_pred: " + ' '.join(str(id_).ljust(4) 386 | for id_ in next_sentence) 387 | # next_id = np.argmax(predictions[-1]) 388 | 389 | # y_true 390 | print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 391 | 392 | print "\n" 393 | 394 | 395 | def train(analyzer, train_data, nb_training_samples, 396 | val_data, nb_validation_samples, 397 | nb_epoch=50, nb_iterations=4): 398 | """ 399 | Trains the network. 400 | 401 | Arguments: 402 | analyzer: {SequenceAnalyzer}. 403 | train_data: {tuple}, training data (X_train, y_train). 404 | val_data: {tuple}, validation data (X_val, y_val). 405 | nb_training_samples: {integer}, the number training samples. 406 | nb_validation_samples: {integer}, the number validation samples. 407 | nb_iterations: {integer}, number of iterations. 408 | sentence_length: {integer}, the length of each training sentence. 409 | """ 410 | for iteration in range(1, nb_iterations+1): 411 | print "" 412 | print "------------------------ Start Training ------------------------" 413 | print "Iteration: ", iteration 414 | print "Number of epoch per iteration: ", nb_epoch 415 | 416 | # history of losses and accuracy 417 | history = History() 418 | 419 | # saves the model weights after each epoch 420 | # if the validation loss decreased 421 | checkpointer = ModelCheckpoint(filepath="weights.hdf5", 422 | verbose=1, save_best_only=True) 423 | 424 | # train the model with data generator 425 | analyzer.model.fit_generator(train_data, 426 | samples_per_epoch=nb_training_samples, 427 | nb_epoch=nb_epoch, verbose=1, 428 | callbacks=[history, checkpointer], 429 | validation_data=val_data, 430 | nb_val_samples=nb_validation_samples) 431 | 432 | analyzer.save_model("weights-after-iteration.hdf5") 433 | 434 | 435 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40): 436 | """ 437 | Scan the given sequence for detecting anormalies. 438 | 439 | Arguments: 440 | sequence: {lsit}, the original input sequence 441 | input_len: {integer}, the number of unique id classes 442 | analyzer: {SequenceAnalyzer}, the sequence analyzer 443 | mapping: {string}, input to output mapping. 444 | 'o2o': one-to-one 445 | 'm2m': many-to-many 446 | sentence_length: {integer}, the length of each sentence. 447 | """ 448 | # sequence length 449 | length = len(sequence) 450 | 451 | # predicted probabilities for each id 452 | # we assume the first sentence_length ids are true 453 | prob = [1] * sentence_length + [0] * (length - sentence_length) 454 | 455 | start_time = time.time() 456 | try: 457 | # generate elements 458 | for start_index in xrange(length - sentence_length): 459 | # seed sentence 460 | X = sequence[start_index : start_index + sentence_length] 461 | # print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 462 | 463 | # Y_true 464 | # y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 465 | # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 466 | y_next_true = sequence[start_index + sentence_length] 467 | 468 | seed = np.zeros((1, sentence_length, input_len)) 469 | # format input 470 | for t in range(0, sentence_length): 471 | seed[0, t, X[t]] = 1 472 | 473 | # get predictionsverbose = 0, no logging 474 | predictions = analyzer.model.predict(seed, verbose=0)[0] 475 | 476 | # y_predicted 477 | y_next_pred = 0 478 | next_prob = 0 479 | if mapping == 'o2o': 480 | next_prob = predictions[y_next_true] 481 | prob[start_index + sentence_length] = next_prob 482 | y_next_pred = np.argmax(predictions) 483 | elif mapping == 'm2m': 484 | # next_sentence = [] 485 | # for pred in predictions: 486 | # next_sentence.append(np.argmax(pred)) 487 | # y_next_pred = next_sentence[-1] 488 | # print "y_pred: " + ' '.join(str(id_).ljust(4) 489 | # for id_ in next_sentence) 490 | y_next_pred = np.argmax(predictions[-1]) 491 | next_prob = predictions[-1][y_next_true] 492 | prob[start_index + sentence_length] = next_prob 493 | 494 | print start_index, next_prob 495 | except KeyboardInterrupt: 496 | # print " |-Write the clusters into %s ..." %self.cluster_file 497 | with open('prob.txt', 'w') as prob_file: 498 | for p in prob: 499 | prob_file.write(str(p) + '\n') 500 | 501 | plt.plot(prob, 'r*') 502 | plt.xlim(0, 1000) 503 | plt.ylim(0, 1) 504 | plt.savefig("prob.png") 505 | plt.clf() 506 | plt.cla() 507 | 508 | stop_time = time.time() 509 | print "--- %s seconds ---\n" % (stop_time - start_time) 510 | 511 | return prob 512 | 513 | 514 | def run(hidden_len=512, batch_size=128, nb_batch=200, nb_epoch=50, 515 | nb_iterations=4, learning_rate=0.001, nb_predictions=20, mapping='m2m', 516 | sentence_length=80, step=80, mode='train'): 517 | """ 518 | Train, evaluate, or predict. 519 | 520 | Arguments: 521 | hidden_len: {integer}, the size of a hidden layer. 522 | batch_size: {interger}, the number of sentences per batch. 523 | nb_batch: {integer}, number of batches to be trained durign each epoch. 524 | nb_epoch: {interger}, number of epoches per iteration. 525 | nb_iterations: {integer}, number of iterations. 526 | learning_rate: {float}, learning rate. 527 | nb_predictions: {integer}, number of the ids predicted. 528 | mapping: {string}, input to output mapping. 529 | 'o2o': one-to-one 530 | 'm2m': many-to-many 531 | sentence_length: {integer}, the length of each training sentence. 532 | step: {integer}, the sample steps. 533 | mode: {string}, th running mode of this programm 534 | 'train': train and predict 535 | 'predict': only predict by loading existing model weights 536 | 'evaluate': evaluate the model in evaluation data set 537 | 'detect': detect a new log sequence for the probabilities 538 | """ 539 | # get parameters and dimensions of the model 540 | print "Loading training data..." 541 | train_sequence, input_len1 = get_sequence("./train_data/*") 542 | print "Loading validation data..." 543 | val_sequence, input_len2 = get_sequence("./validation_data/*") 544 | input_len = max(input_len1, input_len2) 545 | 546 | print "Training sequence length: %d" %len(train_sequence) 547 | print "Validation sequence length: %d" %len(val_sequence) 548 | print "#classes: %d\n" %input_len 549 | 550 | # data generator of X_train and y_train, with random offset 551 | train_data = data_generator(train_sequence, input_len, mapping=mapping, 552 | sentence_length=sentence_length, step=step, 553 | random_offset=True, batch_size=batch_size) 554 | 555 | # data generator of X_val and y _val, with random offset 556 | val_data = data_generator(val_sequence, input_len, mapping=mapping, 557 | sentence_length=sentence_length, step=step, 558 | random_offset=True, batch_size=batch_size) 559 | 560 | # two layered LSTM 512 hidden nodes and a dropout rate of 0.2 561 | rnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len) 562 | 563 | # build model 564 | rnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate, 565 | nb_layers=2, dropout=0.2) 566 | 567 | # plot model 568 | # rnn.plot_model() 569 | 570 | # load the previous model weights 571 | # rnn.load_model("weightsf4-61.hdf5") 572 | 573 | if mode == 'predict': 574 | print "Predict..." 575 | predict(val_sequence, input_len, rnn, nb_predictions=nb_predictions, 576 | mapping=mapping, sentence_length=sentence_length) 577 | elif mode == 'evaluate': 578 | print "Evaluate..." 579 | print "Metrics: " + ', '.join(rnn.model.metrics_names) 580 | X_val, y_val = data_generator(val_sequence, input_len, mapping=mapping, 581 | sentence_length=sentence_length, 582 | step=step, random_offset=False, 583 | batch_size=batch_size) 584 | results = rnn.model.evaluate(X_val, y_val, #pylint: disable=W0612 585 | batch_size=batch_size, 586 | verbose=1) 587 | print "Loss: ", results[0] 588 | print "Accuracy: ", results[1] 589 | elif mode == 'train': 590 | print "Train..." 591 | # number of training sampes and validation samples 592 | nb_training_samples = batch_size * nb_batch 593 | nb_validation_samples = int(nb_training_samples * 0.05) 594 | 595 | try: 596 | train(rnn, train_data, nb_training_samples, 597 | val_data, nb_validation_samples, 598 | nb_epoch=nb_epoch, nb_iterations=nb_iterations) 599 | except KeyboardInterrupt: 600 | rnn.save_model("weights-stop.hdf5") 601 | elif mode == 'detect': 602 | print "Detect..." 603 | detect(val_sequence, input_len, rnn, mapping=mapping, 604 | sentence_length=sentence_length) 605 | else: 606 | print "The mode = %s is not correct!!!" %mode 607 | 608 | return mode 609 | 610 | 611 | if __name__ == '__main__': 612 | run() 613 | -------------------------------------------------------------------------------- /others/sequence_analyzer.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program analyze the integer sequence using (Uni-directional and 3 | Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory 4 | (LSTM) and Gated Recurrent Unit (GRU) based on the python library Keras. 5 | 6 | "Keras is a minimalist, highly modular neural networks library, written in 7 | Python and capable of running on top of either TensorFlow or Theano." 8 | ---- Keras (http://keras.io/) 9 | 10 | Uni-directional model is based on the Keras example - lstm_text_generation: 11 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py 12 | 13 | Bi-directional model is based on the Keras example - imdb_bidirectional_lstm.py: 14 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py 15 | 16 | Author: Chang Liu (fluency03) 17 | Data: 2016-03-27 18 | """ 19 | 20 | import glob 21 | # import os 22 | import sys 23 | import csv 24 | import time 25 | import matplotlib.pyplot as plt 26 | import numpy as np 27 | 28 | from keras.callbacks import Callback, ModelCheckpoint 29 | from keras.layers import Input, Activation, Dense, Dropout, LSTM, GRU, merge 30 | from keras.layers.wrappers import TimeDistributed 31 | from keras.models import Sequential, Model 32 | from keras.optimizers import RMSprop # pylint: disable=W0611 33 | from keras.utils.visualize_util import plot 34 | 35 | 36 | # random number generator with a fixed value for reproducibility 37 | np.random.seed(1337) 38 | 39 | 40 | def override(f): 41 | """ 42 | Override decorator. 43 | """ 44 | return f 45 | 46 | 47 | class SequenceAnalyzer(object): 48 | """ 49 | Sequence analyzer based on RNN. 50 | """ 51 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 52 | self.sentence_length = sentence_length 53 | self.input_len = input_len 54 | self.hidden_len = hidden_len 55 | self.output_len = output_len 56 | # model is defined at child class 57 | self.model = None 58 | 59 | def build(self, layer, mapping, learning_rate, nb_layers, dropout): 60 | """ 61 | Build model. 62 | """ 63 | pass 64 | 65 | def save_model(self, filename, overwrite=False): 66 | """ 67 | Save the model weight into a hdf5 file. 68 | 69 | Arguments: 70 | filename: {string}, the name/path to the file 71 | to which the weights are going to be saved. 72 | overwrite: {bool}, overwrite existing file. 73 | """ 74 | print "Save Weights %s ..." %filename 75 | self.model.save_weights(filename, overwrite=overwrite) 76 | 77 | def load_model(self, filename): 78 | """ 79 | Load the model weight into a hdf5 file. 80 | 81 | Arguments: 82 | filename: {string}, the name/path to the file 83 | to which the weights are going to be loaded. 84 | """ 85 | print "Load Weights %s ..." %filename 86 | self.model.load_weights(filename) 87 | 88 | def plot_model(self, filename): 89 | """ 90 | Plot model. 91 | 92 | Arguments: 93 | filename: {string}, the name/path to the file 94 | to which the model graphic is plotted. 95 | """ 96 | print "Plot Model %s ..." %filename 97 | plot(self.model, to_file=filename) 98 | 99 | 100 | class URNN(SequenceAnalyzer): 101 | """ 102 | Uni-directional RNN model of the sequence analyzer. Sequential Model. 103 | """ 104 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 105 | super(URNN, self).__init__(sentence_length, 106 | input_len, hidden_len, output_len, 107 | return_sequence=True) 108 | self.model = Sequential() 109 | 110 | @override 111 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 112 | nb_layers=2, dropout=0.2): 113 | """ 114 | Stacked RNN with specified dropout rate (default 0.2), built with 115 | softmax activation, cross entropy loss and rmsprop optimizer. 116 | 117 | Arguments: 118 | layer: {string}, the type of the layers in the RNN Model. 119 | 'LSTM': LSTM layers 120 | 'GRU': GRU layers 121 | mapping: {string}, input to output mapping. 122 | 'o2o': one-to-one 123 | 'm2m': many-to-many 124 | learning_rate: {float}, learning rate. 125 | nb_layers: {integer}, number of layers in total. 126 | dropout: {float}, dropout value. 127 | """ 128 | print "Building Model..." 129 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f " 130 | "nb_layers = %d , dropout = %.2f" 131 | %(self.hidden_len, layer, mapping, learning_rate, 132 | nb_layers, dropout)) 133 | 134 | # check the layer type: LSTM or GRU 135 | if layer == 'LSTM': 136 | class LAYER(LSTM): 137 | """ 138 | LAYER as LSTM. 139 | """ 140 | pass 141 | elif layer == 'GRU': 142 | class LAYER(GRU): 143 | """ 144 | LAYER as GRU. 145 | """ 146 | pass 147 | 148 | # check whether return sequence for each of the layers 149 | return_sequences = [] 150 | if mapping == 'o2o': 151 | # if mapping is one-to-one 152 | for nl in range(nb_layers): 153 | if nl == nb_layers-1: 154 | return_sequences.append(False) 155 | else: 156 | return_sequences.append(True) 157 | elif mapping == 'm2m': 158 | # if mapping is many-to-many 159 | for _ in range(nb_layers): 160 | return_sequences.append(True) 161 | 162 | # first layer RNN with specified number of nodes in the hidden layer. 163 | self.model.add(LAYER(self.hidden_len, 164 | return_sequences=return_sequences[0], 165 | input_shape=(self.sentence_length, 166 | self.input_len))) 167 | self.model.add(Dropout(dropout)) 168 | 169 | # the following layers 170 | for nl in range(nb_layers-1): 171 | self.model.add(LAYER(self.hidden_len, 172 | return_sequences=return_sequences[nl+1])) 173 | self.model.add(Dropout(dropout)) 174 | 175 | if mapping == 'o2o': 176 | # if mapping is one-to-one 177 | self.model.add(Dense(self.output_len)) 178 | elif mapping == 'm2m': 179 | # if mapping is many-to-many 180 | self.model.add(TimeDistributed(Dense(self.output_len))) 181 | 182 | self.model.add(Activation('softmax')) 183 | 184 | rms = RMSprop(lr=learning_rate) 185 | self.model.compile(loss='categorical_crossentropy', 186 | optimizer=rms, 187 | metrics=['accuracy']) 188 | 189 | 190 | class BRNN(SequenceAnalyzer): 191 | """ 192 | Bi-directional RNN model of the sequence analyzer. Graph Model. 193 | """ 194 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 195 | super(BRNN, self).__init__(sentence_length, 196 | input_len, hidden_len, output_len) 197 | 198 | @override 199 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 200 | nb_layers=2, dropout=0.2): 201 | """ 202 | Bidirectional RNN with specified dropout rate (default 0.2), built with 203 | softmax activation, cross entropy loss and rmsprop optimizer. 204 | 205 | Arguments: 206 | layer: {string}, the type of the layers in the RNN Model. 207 | 'LSTM': LSTM layers 208 | 'GRU': GRU layers 209 | mapping: {string}, input to output mapping. 210 | 'o2o': one-to-one 211 | 'm2m': many-to-many 212 | learning_rate: {float}, learning rate. 213 | nb_layers: {integer}, number of layers in total. 214 | dropout: {float}, dropout value. 215 | """ 216 | print "Building Model..." 217 | print (" layer = %d-%s , mapping = %s , " 218 | "nb_layers = %d , dropout = %.2f" 219 | %(self.hidden_len, layer, mapping, nb_layers, dropout)) 220 | 221 | # check the layer type: LSTM or GRU 222 | if layer == 'LSTM': 223 | class LAYER(LSTM): 224 | """ 225 | LAYER as LSTM. 226 | """ 227 | pass 228 | elif layer == 'GRU': 229 | class LAYER(GRU): 230 | """ 231 | LAYER as GRU. 232 | """ 233 | pass 234 | 235 | # check whether return sequence for each of the layers 236 | return_sequences = [] 237 | if mapping == 'o2o': 238 | # if mapping is one-to-one 239 | for nl in range(nb_layers): 240 | if nl == nb_layers-1: 241 | return_sequences.append(False) 242 | else: 243 | return_sequences.append(True) 244 | elif mapping == 'm2m': 245 | # if mapping is many-to-many 246 | for _ in range(nb_layers): 247 | return_sequences.append(True) 248 | 249 | # add input 250 | input_layer = Input(shape=(self.sentence_length, self.input_len), 251 | dtype='float32') 252 | 253 | # first Bi-directional LSTM layer 254 | forward1 = LAYER(self.hidden_len, 255 | return_sequences=return_sequences[0])(input_layer) 256 | forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612 257 | backward1 = LAYER(self.hidden_len, 258 | return_sequences=return_sequences[0], 259 | go_backwards=True)(input_layer) 260 | backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612 261 | 262 | # following Bi-directional layers 263 | for nl in range(nb_layers-1): 264 | exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)" 265 | %('forward' + str(nl+2), 266 | return_sequences[nl+1], 267 | 'forward_dropout' + str(nl+1))) 268 | exec("%s = Dropout(dropout)(%s)" 269 | %('forward_dropout' + str(nl+2), 270 | 'forward' + str(nl+2))) 271 | exec(("%s = LAYER(self.hidden_len, return_sequences=%s, " 272 | "go_backwards=True)(%s)") 273 | %('backward' + str(nl+2), 274 | return_sequences[nl+1], 275 | 'backward_dropout' + str(nl+1))) 276 | exec("%s = Dropout(dropout)(%s)" 277 | %('backward_dropout' + str(nl+2), 278 | 'backward' + str(nl+2))) 279 | 280 | merged_layer = merge([locals()['forward_dropout' + str(nb_layers)], 281 | locals()['backward_dropout' + str(nb_layers)]], 282 | mode='concat', concat_axis=-1) 283 | 284 | if mapping == 'o2o': 285 | output_layer = Dense(self.output_len, 286 | activation='softmax')(merged_layer) 287 | elif mapping == 'm2m': 288 | output_layer = TimeDistributed( 289 | Dense(self.output_len, activation='softmax'))(merged_layer) 290 | 291 | # add ouput 292 | self.model = Model(input=input_layer, output=output_layer) 293 | 294 | rms = RMSprop(lr=learning_rate) 295 | # try using different optimizers and different optimizer configs 296 | self.model.compile(loss='categorical_crossentropy', 297 | optimizer=rms, 298 | metrics=['accuracy']) 299 | 300 | 301 | class History(Callback): 302 | """ 303 | Record the loss and accuracy history. 304 | """ 305 | @override 306 | def on_train_begin(self, logs={}): # pylint: disable=W0102 307 | """ 308 | A method starting at the begining of the training. 309 | 310 | Arguments: 311 | logs: {dictionary}, recording the training and validation 312 | losses and accuracy of every epoch. 313 | """ 314 | # training loss and accuracy 315 | self.train_losses = [] 316 | self.train_acc = [] 317 | # validation loss and accuracy 318 | self.val_losses = [] 319 | self.val_acc = [] 320 | 321 | @override 322 | def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102 323 | """ 324 | A method starting at the begining of the training. 325 | 326 | Arguments: 327 | epoch: {integer}, the current epoch. 328 | logs: {dictionary}, recording the training and validation 329 | losses and accuracy of every epoch. 330 | """ 331 | # record training loss and accuracy 332 | self.train_losses.append(logs.get('loss')) 333 | self.train_acc.append(logs.get('acc')) 334 | # record validation loss and accuracy 335 | self.val_losses.append(logs.get('val_loss')) 336 | self.val_acc.append(logs.get('val_acc')) 337 | 338 | # continutously save the train_loss, train_acc, val_loss, val_acc 339 | # into a csv file with 4 columns respeactively 340 | csv_name = 'history.csv' 341 | with open(csv_name, 'a') as csvfile: 342 | his_writer = csv.writer(csvfile) 343 | print "\n Save loss and accuracy into %s" %csv_name 344 | his_writer.writerow((logs.get('loss'), logs.get('acc'), 345 | logs.get('val_loss'), logs.get('val_acc'))) 346 | 347 | 348 | def sample(prob, temperature=0.2): 349 | """ 350 | Softmax function for reinforcement learning. 351 | 352 | Arguments: 353 | prob: {list}, a list of probabilities of each of the classes. 354 | temperature: {float}, Softmax temperature. 355 | Returns: 356 | {integer}, the most possible sample. 357 | """ 358 | prob = np.log(prob) / temperature 359 | prob = np.exp(prob) / np.sum(np.exp(prob)) 360 | return np.argmax(np.random.multinomial(1, prob, 1)) 361 | 362 | 363 | def get_sequence(filepath): 364 | """ 365 | Get the original sequence from file. 366 | 367 | Arguments: 368 | filename: {string}, the name/path of input log sequence file. 369 | Returns: 370 | {list}, the log sequence. 371 | {integer}, the size of vocabulary. 372 | """ 373 | # read file and convert ids of each line into array of numbers 374 | seqfiles = glob.glob(filepath) 375 | sequence = [] 376 | 377 | for seqfile in seqfiles: 378 | with open(seqfile, 'r') as f: 379 | one_sequence = [int(id_) for id_ in f] 380 | print " %s, sequence length: %d" %(seqfile, 381 | len(one_sequence)) 382 | sequence.extend(one_sequence) 383 | 384 | # add two extra positions for 'unknown-log' and 'no-log' 385 | vocab_size = max(sequence) + 2 386 | 387 | return sequence, vocab_size 388 | 389 | 390 | def get_data(sequence, vocab_size, mapping='m2m', sentence_length=40, step=3, 391 | random_offset=True): 392 | """ 393 | Retrieves data from a plain txt file and formats it using one-hot vector. 394 | 395 | Arguments: 396 | sequence: {lsit}, the original input sequence 397 | vocab_size: {integer}, the number of unique id classes 398 | mapping: {string}, input to output mapping. 399 | 'o2o': one-to-one 400 | 'm2m': many-to-many 401 | sentence_length: {integer}, the length of each training sentence. 402 | step: {integer}, the sample steps. 403 | random_offset: {bool}, the offset is random between step or is 0. 404 | Returns: 405 | {np.array}, training input data X 406 | {np.array}, training target data y 407 | """ 408 | X_sentences = [] 409 | y_sentences = [] 410 | next_ids = [] 411 | 412 | offset = np.random.randint(0, step) if random_offset else 0 413 | 414 | # creat batch data and next sentences 415 | for i in range(offset, len(sequence) - sentence_length, step): 416 | X_sentences.append(sequence[i : i + sentence_length]) 417 | if mapping == 'o2o': 418 | # if mapping is one-to-one 419 | next_ids.append(sequence[i + sentence_length]) 420 | elif mapping == 'm2m': 421 | # if mapping is many-to-many 422 | y_sentences.append(sequence[i + 1 : i + sentence_length + 1]) 423 | 424 | # number of sampes 425 | nb_samples = len(X_sentences) 426 | # print "total # of sentences: %d" %nb_samples 427 | 428 | # one-hot vector (all zeros except for a single one at 429 | # the exact postion of this id number) 430 | X_train = np.zeros((nb_samples, sentence_length, vocab_size), dtype=np.bool) 431 | # expected outputs for each sentence 432 | if mapping == 'o2o': 433 | # if mapping is one-to-one 434 | y_train = np.zeros((nb_samples, vocab_size), dtype=np.bool) 435 | elif mapping == 'm2m': 436 | # if mapping is many-to-many 437 | y_train = np.zeros((nb_samples, sentence_length, vocab_size), 438 | dtype=np.bool) 439 | 440 | for i, x_sentence in enumerate(X_sentences): 441 | for t, id_ in enumerate(x_sentence): 442 | # mark the each corresponding character in a sentence as 1 443 | X_train[i, t, id_] = 1 444 | # if mapping is many-to-many 445 | if mapping == 'm2m': 446 | y_train[i, t, y_sentences[i][t]] = 1 447 | # if mapping is one-to-one 448 | # mark the corresponding character in expected output as 1 449 | if mapping == 'o2o': 450 | y_train[i, next_ids[i]] = 1 451 | 452 | return X_train, y_train 453 | 454 | 455 | def predict(sequence, input_len, analyzer, nb_predictions=80, 456 | mapping='m2m', sentence_length=40): 457 | """ 458 | Predict the next sequences using existing model and weights given some seed. 459 | 460 | Arguments: 461 | sequence: {lsit}, the original input sequence 462 | input_len: {integer}, the number of unique id classes 463 | analyzer: {SequenceAnalyzer}, the sequence analyzer 464 | nb_predictions: {integer}, number of predictions after giving the seed 465 | mapping: {string}, input to output mapping. 466 | 'o2o': one-to-one 467 | 'm2m': many-to-many 468 | sentence_length: {integer}, the length of each sentence. 469 | """ 470 | # generate elements 471 | for _ in range(nb_predictions): 472 | # start index of the seed, random number in range 473 | start_index = np.random.randint(0, len(sequence) - sentence_length - 1) 474 | # seed sentence 475 | sentence = sequence[start_index : start_index + sentence_length] 476 | 477 | # Y_true 478 | y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 479 | print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 480 | 481 | seed = np.zeros((1, sentence_length, input_len)) 482 | # format input 483 | for t in range(0, sentence_length): 484 | seed[0, t, sentence[t]] = 1 485 | 486 | # get predictions 487 | # verbose = 0, no logging 488 | predictions = analyzer.model.predict(seed, verbose=0)[0] 489 | 490 | # y_predicted 491 | if mapping == 'o2o': 492 | next_id = np.argmax(predictions) 493 | sys.stdout.write(' ' + str(next_id)) 494 | sys.stdout.flush() 495 | elif mapping == 'm2m': 496 | next_sentence = [] 497 | for pred in predictions: 498 | next_sentence.append(np.argmax(pred)) 499 | print "y_pred: " + ' '.join(str(id_).ljust(4) 500 | for id_ in next_sentence) 501 | # next_id = np.argmax(predictions[-1]) 502 | 503 | # y_true 504 | print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 505 | 506 | print "\n" 507 | 508 | 509 | def train(analyzer, train_sequence, val_sequence, input_len, 510 | batch_size=128, nb_epoch=50, nb_iterations=4, 511 | sentence_length=40, step=40, mapping='m2m'): 512 | """ 513 | Trains the network. 514 | 515 | Arguments: 516 | analyzer: {SequenceAnalyzer}. 517 | train_sequence: {list}, training sequence. 518 | val_sequence: {list}, validation sequence. 519 | input_len: {integer}, the number of classes, i.e., the input length of 520 | neural network. 521 | batch_size: {interger}, the number of sentences per batch. 522 | nb_epoch: {integer}, number of epoches per iteration. 523 | nb_iterations: {integer}, number of iterations. 524 | sentence_length: {integer}, the length of each training sentence. 525 | step: {integer}, the sample steps. 526 | mapping: {string}, input to output mapping. 527 | 'o2o': one-to-one 528 | 'm2m': many-to-many 529 | """ 530 | for iteration in range(1, nb_iterations+1): 531 | # create training data, randomize the offset between steps 532 | X_train, y_train = get_data(train_sequence, input_len, mapping=mapping, 533 | sentence_length=sentence_length, step=step, 534 | random_offset=False) 535 | X_val, y_val = get_data(val_sequence, input_len, mapping=mapping, 536 | sentence_length=sentence_length, step=step, 537 | random_offset=False) 538 | print "" 539 | print "------------------------ Start Training ------------------------" 540 | print "Iteration: ", iteration 541 | print "Number of epoch per iteration: ", nb_epoch 542 | 543 | # history of losses and accuracy 544 | history = History() 545 | 546 | # saves the model weights after each epoch 547 | # if the validation loss decreased 548 | checkpointer = ModelCheckpoint(filepath="weights.hdf5", 549 | verbose=1, save_best_only=True) 550 | 551 | # train the model 552 | analyzer.model.fit(X_train, y_train, 553 | batch_size=batch_size, nb_epoch=nb_epoch, verbose=1, 554 | callbacks=[history, checkpointer], 555 | validation_data=(X_val, y_val)) 556 | 557 | analyzer.save_model("weights-after-iteration.hdf5") 558 | 559 | 560 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40): 561 | """ 562 | Scan the given sequence for detecting anormalies. 563 | 564 | Arguments: 565 | sequence: {lsit}, the original input sequence 566 | input_len: {integer}, the number of unique id classes 567 | analyzer: {SequenceAnalyzer}, the sequence analyzer 568 | mapping: {string}, input to output mapping. 569 | 'o2o': one-to-one 570 | 'm2m': many-to-many 571 | sentence_length: {integer}, the length of each sentence. 572 | """ 573 | # sequence length 574 | length = len(sequence) 575 | 576 | # predicted probabilities for each id 577 | # we assume the first sentence_length ids are true 578 | prob = [1] * sentence_length + [0] * (length - sentence_length) 579 | 580 | start_time = time.time() 581 | try: 582 | # generate elements 583 | for start_index in xrange(length - sentence_length): 584 | # seed sentence 585 | X = sequence[start_index : start_index + sentence_length] 586 | # print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 587 | 588 | # Y_true 589 | # y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 590 | # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 591 | y_next_true = sequence[start_index + sentence_length] 592 | 593 | seed = np.zeros((1, sentence_length, input_len)) 594 | # format input 595 | for t in range(0, sentence_length): 596 | seed[0, t, X[t]] = 1 597 | 598 | # get predictionsverbose = 0, no logging 599 | predictions = analyzer.model.predict(seed, verbose=0)[0] 600 | 601 | # y_predicted 602 | y_next_pred = 0 603 | next_prob = 0 604 | if mapping == 'o2o': 605 | next_prob = predictions[y_next_true] 606 | prob[start_index + sentence_length] = next_prob 607 | y_next_pred = np.argmax(predictions) 608 | elif mapping == 'm2m': 609 | # next_sentence = [] 610 | # for pred in predictions: 611 | # next_sentence.append(np.argmax(pred)) 612 | # y_next_pred = next_sentence[-1] 613 | # print "y_pred: " + ' '.join(str(id_).ljust(4) 614 | # for id_ in next_sentence) 615 | y_next_pred = np.argmax(predictions[-1]) 616 | next_prob = predictions[-1][y_next_true] 617 | prob[start_index + sentence_length] = next_prob 618 | 619 | print start_index, next_prob 620 | except KeyboardInterrupt: 621 | # print " |-Write the clusters into %s ..." %self.cluster_file 622 | with open('prob.txt', 'w') as prob_file: 623 | for p in prob: 624 | prob_file.write(str(p) + '\n') 625 | 626 | plt.plot(prob, 'r*') 627 | plt.xlim(0, 1000) 628 | plt.ylim(0, 1) 629 | plt.savefig("prob.png") 630 | plt.clf() 631 | plt.cla() 632 | 633 | stop_time = time.time() 634 | print "--- %s seconds ---\n" % (stop_time - start_time) 635 | 636 | return prob 637 | 638 | 639 | def run(hidden_len=512, batch_size=128, nb_epoch=50, nb_iterations=4, 640 | learning_rate=0.001, nb_predictions=20, mapping='m2m', 641 | sentence_length=80, step=80, mode='train'): 642 | """ 643 | Train, evaluate, or predict. 644 | 645 | Arguments: 646 | hidden_len: {integer}, the size of a hidden layer. 647 | batch_size: {interger}, the number of sentences per batch. 648 | nb_epoch: {interger}, number of epoches per iteration. 649 | nb_iterations: {integer}, number of iterations. 650 | learning_rate: {float}, learning rate. 651 | nb_predictions: {integer}, number of the ids predicted. 652 | mapping: {string}, input to output mapping. 653 | 'o2o': one-to-one 654 | 'm2m': many-to-many 655 | sentence_length: {integer}, the length of each training sentence. 656 | step: {integer}, the sample steps. 657 | mode: {string}, th running mode of this programm 658 | 'train': train and predict 659 | 'predict': only predict by loading existing model weights 660 | 'evaluate': evaluate the model in evaluation data set 661 | 'detect': detect a new log sequence for the probabilities 662 | """ 663 | # get parameters and dimensions of the model 664 | print "Loading training data..." 665 | train_sequence, input_len1 = get_sequence("./train_data/*") 666 | print "Loading validation data..." 667 | val_sequence, input_len2 = get_sequence("./validation_data/*") 668 | input_len = max(input_len1, input_len2) 669 | 670 | print "Training sequence length: %d" %len(train_sequence) 671 | print "Validation sequence length: %d" %len(val_sequence) 672 | print "#classes: %d\n" %input_len 673 | 674 | # two layered LSTM 512 hidden nodes and a dropout rate of 0.2 675 | analyzer = SequenceAnalyzer(sentence_length, 676 | input_len, hidden_len, input_len) 677 | 678 | # build model 679 | analyzer.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate, 680 | nb_layers=2, dropout=0.2) 681 | 682 | # plot model 683 | # analyzer.plot_model() 684 | 685 | # load the previous model weights 686 | # analyzer.load_model("weightsf4-61.hdf5") 687 | 688 | if mode == 'predict': 689 | print "Predict..." 690 | predict(val_sequence, input_len, analyzer, 691 | nb_predictions=nb_predictions, mapping=mapping, 692 | sentence_length=sentence_length) 693 | elif mode == 'evaluate': 694 | print "Evaluate..." 695 | print "Metrics: " + ', '.join(analyzer.model.metrics_names) 696 | X_val, y_val = get_data(val_sequence, input_len, mapping=mapping, 697 | sentence_length=sentence_length, step=step, 698 | random_offset=False) 699 | results = analyzer.model.evaluate(X_val, y_val, #pylint: disable=W0612 700 | batch_size=batch_size, 701 | verbose=1) 702 | print "Loss: ", results[0] 703 | print "Accuracy: ", results[1] 704 | elif mode == 'train': 705 | print "Train..." 706 | try: 707 | train(analyzer, train_sequence, val_sequence, input_len, 708 | batch_size=batch_size, nb_epoch=nb_epoch, 709 | nb_iterations=nb_iterations, 710 | sentence_length=sentence_length, 711 | step=step, mapping=mapping) 712 | except KeyboardInterrupt: 713 | analyzer.save_model("weights-stop.hdf5") 714 | elif mode == 'detect': 715 | print "Detect..." 716 | detect(val_sequence, input_len, analyzer, mapping=mapping, 717 | sentence_length=sentence_length) 718 | else: 719 | print "The mode = %s is not correct!!!" %mode 720 | 721 | return mode 722 | 723 | 724 | if __name__ == '__main__': 725 | run() 726 | -------------------------------------------------------------------------------- /others/sequence_analyzer_gen.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program analyze the integer sequence using (Uni-directional and 3 | Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory 4 | (LSTM) and Gated Recurrent Unit (GRU) based on the python library Keras. 5 | 6 | Input data is Generator and the training is by calling model.fit_generator(). 7 | 8 | "Keras is a minimalist, highly modular neural networks library, written in 9 | Python and capable of running on top of either TensorFlow or Theano." 10 | ---- Keras (http://keras.io/) 11 | 12 | Uni-directional model is based on the Keras example - lstm_text_generation: 13 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py 14 | 15 | Bi-directional model is based on the Keras example - imdb_bidirectional_lstm.py: 16 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py 17 | 18 | Author: Chang Liu (fluency03) 19 | Data: 2016-04-03 20 | """ 21 | 22 | import glob 23 | # import os 24 | import sys 25 | import csv 26 | import time 27 | import matplotlib.pyplot as plt 28 | import numpy as np 29 | 30 | from keras.callbacks import Callback, ModelCheckpoint 31 | from keras.layers import Input, Activation, Dense, Dropout, LSTM, GRU, merge 32 | from keras.layers.wrappers import TimeDistributed 33 | from keras.models import Sequential, Model 34 | from keras.optimizers import RMSprop # pylint: disable=W0611 35 | from keras.utils.visualize_util import plot 36 | 37 | 38 | # random number generator with a fixed value for reproducibility 39 | np.random.seed(1337) 40 | 41 | 42 | def override(f): 43 | """ 44 | Override decorator. 45 | """ 46 | return f 47 | 48 | 49 | class SequenceAnalyzer(object): 50 | """ 51 | Sequence analyzer based on RNN. 52 | """ 53 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 54 | self.sentence_length = sentence_length 55 | self.input_len = input_len 56 | self.hidden_len = hidden_len 57 | self.output_len = output_len 58 | # model is defined at child class 59 | self.model = None 60 | 61 | def build(self, layer, mapping, learning_rate, nb_layers, dropout): 62 | """ 63 | Build model. 64 | """ 65 | pass 66 | 67 | def save_model(self, filename, overwrite=False): 68 | """ 69 | Save the model weight into a hdf5 file. 70 | 71 | Arguments: 72 | filename: {string}, the name/path to the file 73 | to which the weights are going to be saved. 74 | overwrite: {bool}, overwrite existing file. 75 | """ 76 | print "Save Weights %s ..." %filename 77 | self.model.save_weights(filename, overwrite=overwrite) 78 | 79 | def load_model(self, filename): 80 | """ 81 | Load the model weight into a hdf5 file. 82 | 83 | Arguments: 84 | filename: {string}, the name/path to the file 85 | to which the weights are going to be loaded. 86 | """ 87 | print "Load Weights %s ..." %filename 88 | self.model.load_weights(filename) 89 | 90 | def plot_model(self, filename): 91 | """ 92 | Plot model. 93 | 94 | Arguments: 95 | filename: {string}, the name/path to the file 96 | to which the model graphic is plotted. 97 | """ 98 | print "Plot Model %s ..." %filename 99 | plot(self.model, to_file=filename) 100 | 101 | 102 | class URNN(SequenceAnalyzer): 103 | """ 104 | Uni-directional RNN model of the sequence analyzer. Sequential Model. 105 | """ 106 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 107 | super(URNN, self).__init__(sentence_length, 108 | input_len, hidden_len, output_len, 109 | return_sequence=True) 110 | self.model = Sequential() 111 | 112 | @override 113 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 114 | nb_layers=2, dropout=0.2): 115 | """ 116 | Stacked RNN with specified dropout rate (default 0.2), built with 117 | softmax activation, cross entropy loss and rmsprop optimizer. 118 | 119 | Arguments: 120 | layer: {string}, the type of the layers in the RNN Model. 121 | 'LSTM': LSTM layers 122 | 'GRU': GRU layers 123 | mapping: {string}, input to output mapping. 124 | 'o2o': one-to-one 125 | 'm2m': many-to-many 126 | learning_rate: {float}, learning rate. 127 | nb_layers: {integer}, number of layers in total. 128 | dropout: {float}, dropout value. 129 | """ 130 | print "Building Model..." 131 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f " 132 | "nb_layers = %d , dropout = %.2f" 133 | %(self.hidden_len, layer, mapping, learning_rate, 134 | nb_layers, dropout)) 135 | 136 | # check the layer type: LSTM or GRU 137 | if layer == 'LSTM': 138 | class LAYER(LSTM): 139 | """ 140 | LAYER as LSTM. 141 | """ 142 | pass 143 | elif layer == 'GRU': 144 | class LAYER(GRU): 145 | """ 146 | LAYER as GRU. 147 | """ 148 | pass 149 | 150 | # check whether return sequence for each of the layers 151 | return_sequences = [] 152 | if mapping == 'o2o': 153 | # if mapping is one-to-one 154 | for nl in range(nb_layers): 155 | if nl == nb_layers-1: 156 | return_sequences.append(False) 157 | else: 158 | return_sequences.append(True) 159 | elif mapping == 'm2m': 160 | # if mapping is many-to-many 161 | for _ in range(nb_layers): 162 | return_sequences.append(True) 163 | 164 | # first layer RNN with specified number of nodes in the hidden layer. 165 | self.model.add(LAYER(self.hidden_len, 166 | return_sequences=return_sequences[0], 167 | input_shape=(self.sentence_length, 168 | self.input_len))) 169 | self.model.add(Dropout(dropout)) 170 | 171 | # the following layers 172 | for nl in range(nb_layers-1): 173 | self.model.add(LAYER(self.hidden_len, 174 | return_sequences=return_sequences[nl+1])) 175 | self.model.add(Dropout(dropout)) 176 | 177 | if mapping == 'o2o': 178 | # if mapping is one-to-one 179 | self.model.add(Dense(self.output_len)) 180 | elif mapping == 'm2m': 181 | # if mapping is many-to-many 182 | self.model.add(TimeDistributed(Dense(self.output_len))) 183 | 184 | self.model.add(Activation('softmax')) 185 | 186 | rms = RMSprop(lr=learning_rate) 187 | self.model.compile(loss='categorical_crossentropy', 188 | optimizer=rms, 189 | metrics=['accuracy']) 190 | 191 | 192 | class BRNN(SequenceAnalyzer): 193 | """ 194 | Bi-directional RNN model of the sequence analyzer. Graph Model. 195 | """ 196 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 197 | super(BRNN, self).__init__(sentence_length, 198 | input_len, hidden_len, output_len) 199 | 200 | @override 201 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 202 | nb_layers=2, dropout=0.2): 203 | """ 204 | Bidirectional RNN with specified dropout rate (default 0.2), built with 205 | softmax activation, cross entropy loss and rmsprop optimizer. 206 | 207 | Arguments: 208 | layer: {string}, the type of the layers in the RNN Model. 209 | 'LSTM': LSTM layers 210 | 'GRU': GRU layers 211 | mapping: {string}, input to output mapping. 212 | 'o2o': one-to-one 213 | 'm2m': many-to-many 214 | learning_rate: {float}, learning rate. 215 | nb_layers: {integer}, number of layers in total. 216 | dropout: {float}, dropout value. 217 | """ 218 | print "Building Model..." 219 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f " 220 | "nb_layers = %d , dropout = %.2f" 221 | %(self.hidden_len, layer, mapping, learning_rate, 222 | nb_layers, dropout)) 223 | 224 | # check the layer type: LSTM or GRU 225 | if layer == 'LSTM': 226 | class LAYER(LSTM): 227 | """ 228 | LAYER as LSTM. 229 | """ 230 | pass 231 | elif layer == 'GRU': 232 | class LAYER(GRU): 233 | """ 234 | LAYER as GRU. 235 | """ 236 | pass 237 | 238 | # check whether return sequence for each of the layers 239 | return_sequences = [] 240 | if mapping == 'o2o': 241 | # if mapping is one-to-one 242 | for nl in range(nb_layers): 243 | if nl == nb_layers-1: 244 | return_sequences.append(False) 245 | else: 246 | return_sequences.append(True) 247 | elif mapping == 'm2m': 248 | # if mapping is many-to-many 249 | for _ in range(nb_layers): 250 | return_sequences.append(True) 251 | 252 | # add input 253 | input_layer = Input(shape=(self.sentence_length, self.input_len), 254 | dtype='float32') 255 | 256 | # first Bi-directional LSTM layer 257 | forward1 = LAYER(self.hidden_len, 258 | return_sequences=return_sequences[0])(input_layer) 259 | forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612 260 | backward1 = LAYER(self.hidden_len, 261 | return_sequences=return_sequences[0], 262 | go_backwards=True)(input_layer) 263 | backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612 264 | 265 | # following Bi-directional layers 266 | for nl in range(nb_layers-1): 267 | exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)" 268 | %('forward' + str(nl+2), 269 | return_sequences[nl+1], 270 | 'forward_dropout' + str(nl+1))) 271 | exec("%s = Dropout(dropout)(%s)" 272 | %('forward_dropout' + str(nl+2), 273 | 'forward' + str(nl+2))) 274 | exec(("%s = LAYER(self.hidden_len, return_sequences=%s, " 275 | "go_backwards=True)(%s)") 276 | %('backward' + str(nl+2), 277 | return_sequences[nl+1], 278 | 'backward_dropout' + str(nl+1))) 279 | exec("%s = Dropout(dropout)(%s)" 280 | %('backward_dropout' + str(nl+2), 281 | 'backward' + str(nl+2))) 282 | 283 | merged_layer = merge([locals()['forward_dropout' + str(nb_layers)], 284 | locals()['backward_dropout' + str(nb_layers)]], 285 | mode='concat', concat_axis=-1) 286 | 287 | if mapping == 'o2o': 288 | output_layer = Dense(self.output_len, 289 | activation='softmax')(merged_layer) 290 | elif mapping == 'm2m': 291 | output_layer = TimeDistributed( 292 | Dense(self.output_len, activation='softmax'))(merged_layer) 293 | 294 | # add ouput 295 | self.model = Model(input=input_layer, output=output_layer) 296 | 297 | rms = RMSprop(lr=learning_rate) 298 | # try using different optimizers and different optimizer configs 299 | self.model.compile(loss='categorical_crossentropy', 300 | optimizer=rms, 301 | metrics=['accuracy']) 302 | 303 | 304 | class History(Callback): 305 | """ 306 | Record the loss and accuracy history. 307 | """ 308 | @override 309 | def on_train_begin(self, logs={}): # pylint: disable=W0102 310 | """ 311 | A method starting at the begining of the training. 312 | 313 | Arguments: 314 | logs: {dictionary}, recording the training and validation 315 | losses and accuracy of every epoch. 316 | """ 317 | # training loss and accuracy 318 | self.train_losses = [] 319 | self.train_acc = [] 320 | # validation loss and accuracy 321 | self.val_losses = [] 322 | self.val_acc = [] 323 | 324 | @override 325 | def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102 326 | """ 327 | A method starting at the begining of the training. 328 | 329 | Arguments: 330 | epoch: {integer}, the current epoch. 331 | logs: {dictionary}, recording the training and validation 332 | losses and accuracy of every epoch. 333 | """ 334 | # record training loss and accuracy 335 | self.train_losses.append(logs.get('loss')) 336 | self.train_acc.append(logs.get('acc')) 337 | # record validation loss and accuracy 338 | self.val_losses.append(logs.get('val_loss')) 339 | self.val_acc.append(logs.get('val_acc')) 340 | 341 | # continutously save the train_loss, train_acc, val_loss, val_acc 342 | # into a csv file with 4 columns respeactively 343 | csv_name = 'history.csv' 344 | with open(csv_name, 'a') as csvfile: 345 | his_writer = csv.writer(csvfile) 346 | print "\n Save loss and accuracy into %s" %csv_name 347 | his_writer.writerow((logs.get('loss'), logs.get('acc'), 348 | logs.get('val_loss'), logs.get('val_acc'))) 349 | 350 | 351 | def sample(prob, temperature=0.2): 352 | """ 353 | Softmax function for reinforcement learning. 354 | 355 | Arguments: 356 | prob: {list}, a list of probabilities of each of the classes. 357 | temperature: {float}, Softmax temperature. 358 | Returns: 359 | {integer}, the most possible sample. 360 | """ 361 | prob = np.log(prob) / temperature 362 | prob = np.exp(prob) / np.sum(np.exp(prob)) 363 | return np.argmax(np.random.multinomial(1, prob, 1)) 364 | 365 | 366 | def get_sequence(filepath): 367 | """ 368 | Get the original sequence from file. 369 | 370 | Arguments: 371 | filename: {string}, the name/path of input log sequence file. 372 | Returns: 373 | {list}, the log sequence. 374 | {integer}, the size of vocabulary. 375 | """ 376 | # read file and convert ids of each line into array of numbers 377 | seqfiles = glob.glob(filepath) 378 | sequence = [] 379 | 380 | for seqfile in seqfiles: 381 | with open(seqfile, 'r') as f: 382 | one_sequence = [int(id_) for id_ in f] 383 | print " %s, sequence length: %d" %(seqfile, 384 | len(one_sequence)) 385 | sequence.extend(one_sequence) 386 | 387 | # add two extra positions for 'unknown-log' and 'no-log' 388 | vocab_size = max(sequence) + 2 389 | 390 | return sequence, vocab_size 391 | 392 | 393 | def data_generator(sequence, vocab_size, mapping='m2m', sentence_length=40, 394 | step=3, random_offset=True, batch_size=128): 395 | """ 396 | Retrieves data from a plain txt file and formats it using one-hot vector. 397 | This method returns a data generator yeilding a batch of (X_train, y_train) 398 | every time being called. 399 | 400 | Arguments: 401 | sequence: {lsit}, the original input sequence 402 | vocab_size: {integer}, the number of unique id classes 403 | mapping: {string}, input to output mapping. 404 | 'o2o': one-to-one 405 | 'm2m': many-to-many 406 | sentence_length: {integer}, the length of each training sentence. 407 | step: {integer}, the sample steps. 408 | random_offset: {bool}, the offset is random between step or is 0. 409 | batch_size: {integer}, the number of sample per batch. 410 | Yields: 411 | {np.array}, training input data X 412 | {np.array}, training target data y 413 | """ 414 | # the number of current sample 415 | sample_count = 0 416 | 417 | # one-hot vector (all zeros except for a single one at 418 | # the exact postion of this id number) 419 | X_train = np.zeros((batch_size, sentence_length, vocab_size), 420 | dtype=np.bool) 421 | # expected outputs for each sentence 422 | if mapping == 'o2o': 423 | # if mapping is one-to-one 424 | y_train = np.zeros((batch_size, vocab_size), dtype=np.bool) 425 | elif mapping == 'm2m': 426 | # if mapping is many-to-many 427 | y_train = np.zeros((batch_size, sentence_length, vocab_size), 428 | dtype=np.bool) 429 | 430 | # continuousy creat batch data and next sentences 431 | while True: 432 | offset = np.random.randint(0, step) if random_offset else 0 433 | for i in range(offset, len(sequence) - sentence_length, step): 434 | # index of a this sample in this batch 435 | batch_index = sample_count % batch_size 436 | 437 | # re-initialzing the batch 438 | if batch_index == 0: 439 | X_train.fill(0) 440 | y_train.fill(0) 441 | 442 | # current sample and target outputs 443 | X_sentence = [] 444 | y_sentence = [] 445 | next_id = [] 446 | 447 | X_sentence = sequence[i : i + sentence_length] 448 | if mapping == 'o2o': 449 | # if mapping is one-to-one 450 | next_id = sequence[i + sentence_length] 451 | elif mapping == 'm2m': 452 | # if mapping is many-to-many 453 | y_sentence = sequence[i + 1 : i + sentence_length + 1] 454 | 455 | for t, id_ in enumerate(X_sentence): 456 | # mark the each corresponding character in a sentence as 1 457 | X_train[batch_index, t, id_] = 1 458 | # if mapping is many-to-many 459 | if mapping == 'm2m': 460 | y_train[batch_index, t, y_sentence[t]] = 1 461 | # if mapping is one-to-one 462 | # mark the corresponding character in expected output as 1 463 | if mapping == 'o2o': 464 | y_train[batch_index, next_id] = 1 465 | 466 | # sample count plus 1 467 | sample_count += 1 468 | 469 | if batch_index == batch_size-1: 470 | yield X_train, y_train 471 | 472 | 473 | def predict(sequence, input_len, analyzer, nb_predictions=80, 474 | mapping='m2m', sentence_length=40): 475 | """ 476 | Predict the next sequences using existing model and weights given some seed. 477 | 478 | Arguments: 479 | sequence: {lsit}, the original input sequence 480 | input_len: {integer}, the number of unique id classes 481 | analyzer: {SequenceAnalyzer}, the sequence analyzer 482 | nb_predictions: {integer}, number of predictions after giving the seed 483 | mapping: {string}, input to output mapping. 484 | 'o2o': one-to-one 485 | 'm2m': many-to-many 486 | sentence_length: {integer}, the length of each sentence. 487 | """ 488 | # generate elements 489 | for _ in range(nb_predictions): 490 | # start index of the seed, random number in range 491 | start_index = np.random.randint(0, len(sequence) - sentence_length - 1) 492 | # seed sentence 493 | sentence = sequence[start_index : start_index + sentence_length] 494 | 495 | # Y_true 496 | y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 497 | print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 498 | 499 | seed = np.zeros((1, sentence_length, input_len)) 500 | # format input 501 | for t in range(0, sentence_length): 502 | seed[0, t, sentence[t]] = 1 503 | 504 | # get predictions 505 | # verbose = 0, no logging 506 | predictions = analyzer.model.predict(seed, verbose=0)[0] 507 | 508 | # y_predicted 509 | if mapping == 'o2o': 510 | next_id = np.argmax(predictions) 511 | sys.stdout.write(' ' + str(next_id)) 512 | sys.stdout.flush() 513 | elif mapping == 'm2m': 514 | next_sentence = [] 515 | for pred in predictions: 516 | next_sentence.append(np.argmax(pred)) 517 | print "y_pred: " + ' '.join(str(id_).ljust(4) 518 | for id_ in next_sentence) 519 | # next_id = np.argmax(predictions[-1]) 520 | 521 | # y_true 522 | print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 523 | 524 | print "\n" 525 | 526 | 527 | def train(analyzer, train_data, nb_training_samples, 528 | val_data, nb_validation_samples, 529 | nb_epoch=50, nb_iterations=4): 530 | """ 531 | Trains the network. 532 | 533 | Arguments: 534 | analyzer: {SequenceAnalyzer}. 535 | train_data: {tuple}, training data (X_train, y_train). 536 | val_data: {tuple}, validation data (X_val, y_val). 537 | nb_training_samples: {integer}, the number training samples. 538 | nb_validation_samples: {integer}, the number validation samples. 539 | nb_iterations: {integer}, number of iterations. 540 | sentence_length: {integer}, the length of each training sentence. 541 | """ 542 | for iteration in range(1, nb_iterations+1): 543 | print "" 544 | print "------------------------ Start Training ------------------------" 545 | print "Iteration: ", iteration 546 | print "Number of epoch per iteration: ", nb_epoch 547 | 548 | # history of losses and accuracy 549 | history = History() 550 | 551 | # saves the model weights after each epoch 552 | # if the validation loss decreased 553 | checkpointer = ModelCheckpoint(filepath="weights.hdf5", 554 | verbose=1, save_best_only=True) 555 | 556 | # train the model with data generator 557 | analyzer.model.fit_generator(train_data, 558 | samples_per_epoch=nb_training_samples, 559 | nb_epoch=nb_epoch, verbose=1, 560 | callbacks=[history, checkpointer], 561 | validation_data=val_data, 562 | nb_val_samples=nb_validation_samples) 563 | 564 | analyzer.save_model("weights-after-iteration.hdf5") 565 | 566 | 567 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40): 568 | """ 569 | Scan the given sequence for detecting anormalies. 570 | 571 | Arguments: 572 | sequence: {lsit}, the original input sequence 573 | input_len: {integer}, the number of unique id classes 574 | analyzer: {SequenceAnalyzer}, the sequence analyzer 575 | mapping: {string}, input to output mapping. 576 | 'o2o': one-to-one 577 | 'm2m': many-to-many 578 | sentence_length: {integer}, the length of each sentence. 579 | """ 580 | # sequence length 581 | length = len(sequence) 582 | 583 | # predicted probabilities for each id 584 | # we assume the first sentence_length ids are true 585 | prob = [1] * sentence_length + [0] * (length - sentence_length) 586 | 587 | start_time = time.time() 588 | try: 589 | # generate elements 590 | for start_index in xrange(length - sentence_length): 591 | # seed sentence 592 | X = sequence[start_index : start_index + sentence_length] 593 | # print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 594 | 595 | # Y_true 596 | # y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 597 | # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 598 | y_next_true = sequence[start_index + sentence_length] 599 | 600 | seed = np.zeros((1, sentence_length, input_len)) 601 | # format input 602 | for t in range(0, sentence_length): 603 | seed[0, t, X[t]] = 1 604 | 605 | # get predictionsverbose = 0, no logging 606 | predictions = analyzer.model.predict(seed, verbose=0)[0] 607 | 608 | # y_predicted 609 | y_next_pred = 0 610 | next_prob = 0 611 | if mapping == 'o2o': 612 | next_prob = predictions[y_next_true] 613 | prob[start_index + sentence_length] = next_prob 614 | y_next_pred = np.argmax(predictions) 615 | elif mapping == 'm2m': 616 | # next_sentence = [] 617 | # for pred in predictions: 618 | # next_sentence.append(np.argmax(pred)) 619 | # y_next_pred = next_sentence[-1] 620 | # print "y_pred: " + ' '.join(str(id_).ljust(4) 621 | # for id_ in next_sentence) 622 | y_next_pred = np.argmax(predictions[-1]) 623 | next_prob = predictions[-1][y_next_true] 624 | prob[start_index + sentence_length] = next_prob 625 | 626 | print start_index, next_prob 627 | except KeyboardInterrupt: 628 | # print " |-Write the clusters into %s ..." %self.cluster_file 629 | with open('prob.txt', 'w') as prob_file: 630 | for p in prob: 631 | prob_file.write(str(p) + '\n') 632 | 633 | plt.plot(prob, 'r*') 634 | plt.xlim(0, 1000) 635 | plt.ylim(0, 1) 636 | plt.savefig("prob.png") 637 | plt.clf() 638 | plt.cla() 639 | 640 | stop_time = time.time() 641 | print "--- %s seconds ---\n" % (stop_time - start_time) 642 | 643 | return prob 644 | 645 | 646 | def run(hidden_len=512, batch_size=128, nb_batch=200, nb_epoch=50, 647 | nb_iterations=4, learning_rate=0.001, nb_predictions=20, 648 | mapping='m2m', sentence_length=80, step=80, mode='train'): 649 | """ 650 | Train, evaluate, or predict. 651 | 652 | Arguments: 653 | hidden_len: {integer}, the size of a hidden layer. 654 | batch_size: {interger}, the number of sentences per batch. 655 | nb_batch: {integer}, number of batches to be trained durign each epoch. 656 | nb_epoch: {interger}, number of epoches per iteration. 657 | nb_iterations: {integer}, number of iterations. 658 | learning_rate: {float}, learning rate. 659 | nb_predictions: {integer}, number of the ids predicted. 660 | mapping: {string}, input to output mapping. 661 | 'o2o': one-to-one 662 | 'm2m': many-to-many 663 | sentence_length: {integer}, the length of each training sentence. 664 | step: {integer}, the sample steps. 665 | mode: {string}, th running mode of this programm 666 | 'train': train and predict 667 | 'predict': only predict by loading existing model weights 668 | 'evaluate': evaluate the model in evaluation data set 669 | 'detect': detect a new log sequence for the probabilities 670 | """ 671 | # get parameters and dimensions of the model 672 | print "Loading training data..." 673 | train_sequence, input_len1 = get_sequence("./train_data/*") 674 | print "Loading validation data..." 675 | val_sequence, input_len2 = get_sequence("./validation_data/*") 676 | input_len = max(input_len1, input_len2) 677 | 678 | print "Training sequence length: %d" %len(train_sequence) 679 | print "Validation sequence length: %d" %len(val_sequence) 680 | print "#classes: %d\n" %input_len 681 | 682 | # data generator of X_train and y_train, with random offset 683 | train_data = data_generator(train_sequence, input_len, mapping=mapping, 684 | sentence_length=sentence_length, step=step, 685 | random_offset=True, batch_size=batch_size) 686 | 687 | # data generator of X_val and y _val, with random offset 688 | val_data = data_generator(val_sequence, input_len, mapping=mapping, 689 | sentence_length=sentence_length, step=step, 690 | random_offset=True, batch_size=batch_size) 691 | 692 | # two layered LSTM 512 hidden nodes and a dropout rate of 0.2 693 | analyzer = SequenceAnalyzer(sentence_length, 694 | input_len, hidden_len, input_len) 695 | 696 | # build model 697 | analyzer.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate, 698 | nb_layers=2, dropout=0.2) 699 | 700 | # plot model 701 | # analyzer.plot_model() 702 | 703 | # load the previous model weights 704 | # analyzer.load_model("weightsf4-61.hdf5") 705 | 706 | if mode == 'predict': 707 | print "Predict..." 708 | predict(val_sequence, input_len, analyzer, nb_predictions=nb_predictions, 709 | mapping=mapping, sentence_length=sentence_length) 710 | elif mode == 'evaluate': 711 | print "Evaluate..." 712 | print "Metrics: " + ', '.join(analyzer.model.metrics_names) 713 | X_val, y_val = data_generator(val_sequence, input_len, mapping=mapping, 714 | sentence_length=sentence_length, 715 | step=step, random_offset=False, 716 | batch_size=batch_size) 717 | results = analyzer.model.evaluate(X_val, y_val, #pylint: disable=W0612 718 | batch_size=batch_size, 719 | verbose=1) 720 | print "Loss: ", results[0] 721 | print "Accuracy: ", results[1] 722 | elif mode == 'train': 723 | print "Train..." 724 | # number of training sampes and validation samples 725 | nb_training_samples = batch_size * nb_batch 726 | nb_validation_samples = int(nb_training_samples * 0.05) 727 | 728 | try: 729 | train(analyzer, train_data, nb_training_samples, 730 | val_data, nb_validation_samples, 731 | nb_epoch=nb_epoch, nb_iterations=nb_iterations) 732 | except KeyboardInterrupt: 733 | analyzer.save_model("weights-stop.hdf5") 734 | elif mode == 'detect': 735 | print "Detect..." 736 | detect(val_sequence, input_len, analyzer, mapping=mapping, 737 | sentence_length=sentence_length) 738 | else: 739 | print "The mode = %s is not correct!!!" %mode 740 | 741 | return mode 742 | 743 | 744 | if __name__ == '__main__': 745 | run() 746 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | git+https://github.com/fchollet/keras.git 2 | git+https://github.com/Theano/Theano.git 3 | git+https://github.com/scipy/scipy.git 4 | git+https://github.com/numpy/numpy.git 5 | cython 6 | -------------------------------------------------------------------------------- /rnn_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fluency03/sequence-rnn-py/0a55a8fcc93644bca216afc660564d3a606886ab/rnn_model.png -------------------------------------------------------------------------------- /rnn_sequence_analyzer.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program analyze the integer sequence using Uni-diractional Recurrent Neural 3 | Network (RNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) 4 | based on the python library Keras. 5 | 6 | "Keras is a minimalist, highly modular neural networks library, written in 7 | Python and capable of running on top of either TensorFlow or Theano." 8 | ---- Keras (http://keras.io/) 9 | 10 | It is based on this Keras example - lstm_text_generation: 11 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py 12 | 13 | Author: Chang Liu (fluency03) 14 | Data: 2016-03-17 15 | """ 16 | 17 | from math import log 18 | import glob 19 | # import os 20 | import sys 21 | import csv 22 | import time 23 | import matplotlib.pyplot as plt 24 | import numpy as np 25 | 26 | from keras.callbacks import Callback, ModelCheckpoint 27 | from keras.layers import Activation, Dense, Dropout, LSTM, GRU 28 | from keras.layers.wrappers import TimeDistributed 29 | from keras.models import Sequential 30 | from keras.optimizers import RMSprop # pylint: disable=W0611 31 | from keras.utils.visualize_util import plot 32 | 33 | 34 | # random number generator with a fixed value for reproducibility 35 | np.random.seed(1337) 36 | 37 | 38 | def override(f): 39 | """ 40 | Override decorator. 41 | """ 42 | return f 43 | 44 | 45 | class SequenceAnalyzer(object): 46 | """ 47 | Sequence analyzer based on RNN Sequential Model. 48 | """ 49 | def __init__(self, sentence_length, input_len, hidden_len, output_len): 50 | self.sentence_length = sentence_length 51 | self.input_len = input_len 52 | self.hidden_len = hidden_len 53 | self.output_len = output_len 54 | self.model = Sequential() 55 | 56 | def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001, 57 | nb_layers=2, dropout=0.2): 58 | """ 59 | Stacked RNN with specified dropout rate (default 0.2), built with 60 | softmax activation, cross entropy loss and rmsprop optimizer. 61 | 62 | Arguments: 63 | layer: {string}, the type of the layers in the RNN Model. 64 | 'LSTM': LSTM layers 65 | 'GRU': GRU layers 66 | mapping: {string}, input to output mapping. 67 | 'o2o': one-to-one 68 | 'm2m': many-to-many 69 | learning_rate: {float}, learning rate. 70 | nb_layers: {integer}, number of layers in total. 71 | dropout: {float}, dropout value. 72 | """ 73 | print "Building Model..." 74 | print (" layer = %d-%s , mapping = %s , learning rate = %.5f, " 75 | "nb_layers = %d , dropout = %.2f" 76 | %(self.hidden_len, layer, mapping, learning_rate, 77 | nb_layers, dropout)) 78 | 79 | # check the layer type: LSTM or GRU 80 | if layer == 'LSTM': 81 | class LAYER(LSTM): 82 | """ 83 | LAYER as LSTM. 84 | """ 85 | pass 86 | elif layer == 'GRU': 87 | class LAYER(GRU): 88 | """ 89 | LAYER as GRU. 90 | """ 91 | pass 92 | 93 | # check whether return sequence for each of the layers 94 | return_sequences = [] 95 | if mapping == 'o2o': 96 | # if mapping is one-to-one 97 | for nl in range(nb_layers): 98 | if nl == nb_layers-1: 99 | return_sequences.append(False) 100 | else: 101 | return_sequences.append(True) 102 | elif mapping == 'm2m': 103 | # if mapping is many-to-many 104 | for _ in range(nb_layers): 105 | return_sequences.append(True) 106 | 107 | # first layer RNN with specified number of nodes in the hidden layer. 108 | self.model.add(LAYER(self.hidden_len, 109 | return_sequences=return_sequences[0], 110 | input_shape=(self.sentence_length, 111 | self.input_len))) 112 | self.model.add(Dropout(dropout)) 113 | 114 | # the following layers 115 | for nl in range(nb_layers-1): 116 | self.model.add(LAYER(self.hidden_len, 117 | return_sequences=return_sequences[nl+1])) 118 | self.model.add(Dropout(dropout)) 119 | 120 | if mapping == 'o2o': 121 | # if mapping is one-to-one 122 | self.model.add(Dense(self.output_len)) 123 | elif mapping == 'm2m': 124 | # if mapping is many-to-many 125 | self.model.add(TimeDistributed(Dense(self.output_len))) 126 | 127 | self.model.add(Activation('softmax')) 128 | 129 | rms = RMSprop(lr=learning_rate) 130 | self.model.compile(loss='categorical_crossentropy', 131 | optimizer=rms, 132 | metrics=['accuracy']) 133 | 134 | def save_model(self, filename, overwrite=False): 135 | """ 136 | Save the model weight into a hdf5 file. 137 | 138 | Arguments: 139 | filename: {string}, the name/path to the file 140 | to which the weights are going to be saved. 141 | overwrite: {bool}, overwrite existing file. 142 | """ 143 | print "Save Weights %s ..." %filename 144 | self.model.save_weights(filename, overwrite=overwrite) 145 | 146 | def load_model(self, filename): 147 | """ 148 | Load the model weight into a hdf5 file. 149 | 150 | Arguments: 151 | filename: {string}, the name/path to the file 152 | to which the weights are going to be loaded. 153 | """ 154 | print "Load Weights %s ..." %filename 155 | self.model.load_weights(filename) 156 | 157 | def plot_model(self, filename='rnn_model.png'): 158 | """ 159 | Plot model. 160 | 161 | Arguments: 162 | filename: {string}, the name/path to the file 163 | to which the weights are going to be plotted. 164 | """ 165 | print "Plot Model %s ..." %filename 166 | plot(self.model, to_file=filename) 167 | 168 | 169 | class History(Callback): 170 | """ 171 | Record the loss and accuracy history. 172 | """ 173 | @override 174 | def on_train_begin(self, logs={}): # pylint: disable=W0102 175 | """ 176 | A method starting at the begining of the training. 177 | 178 | Arguments: 179 | logs: {dictionary}, recording the training and validation 180 | losses and accuracy of every epoch. 181 | """ 182 | # training loss and accuracy 183 | self.train_losses = [] 184 | self.train_acc = [] 185 | # validation loss and accuracy 186 | self.val_losses = [] 187 | self.val_acc = [] 188 | 189 | @override 190 | def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102 191 | """ 192 | A method starting at the begining of the training. 193 | 194 | Arguments: 195 | epoch: {integer}, the current epoch. 196 | logs: {dictionary}, recording the training and validation 197 | losses and accuracy of every epoch. 198 | """ 199 | # record training loss and accuracy 200 | self.train_losses.append(logs.get('loss')) 201 | self.train_acc.append(logs.get('acc')) 202 | # record validation loss and accuracy 203 | self.val_losses.append(logs.get('val_loss')) 204 | self.val_acc.append(logs.get('val_acc')) 205 | 206 | # continutously save the train_loss, train_acc, val_loss, val_acc 207 | # into a csv file with 4 columns respeactively 208 | csv_name = 'history.csv' 209 | with open(csv_name, 'a') as csvfile: 210 | his_writer = csv.writer(csvfile) 211 | print "\n Save loss and accuracy into %s" %csv_name 212 | his_writer.writerow((logs.get('loss'), logs.get('acc'), 213 | logs.get('val_loss'), logs.get('val_acc'))) 214 | 215 | 216 | def sample(prob, temperature=0.2): 217 | """ 218 | Softmax function for reinforcement learning. 219 | 220 | Arguments: 221 | prob: {list}, a list of probabilities of each of the classes. 222 | temperature: {float}, Softmax temperature. 223 | Returns: 224 | {integer}, the most possible sample. 225 | """ 226 | prob = np.log(prob) / temperature 227 | prob = np.exp(prob) / np.sum(np.exp(prob)) 228 | return np.argmax(np.random.multinomial(1, prob, 1)) 229 | 230 | 231 | def get_sequence(filepath): 232 | """ 233 | Get the original sequence from file. 234 | 235 | Arguments: 236 | filename: {string}, the name/path of input log sequence file. 237 | Returns: 238 | {list}, the log sequence. 239 | {integer}, the size of vocabulary. 240 | """ 241 | # read file and convert ids of each line into array of numbers 242 | seqfiles = glob.glob(filepath) 243 | sequence = [] 244 | 245 | for seqfile in seqfiles: 246 | with open(seqfile, 'r') as f: 247 | one_sequence = [int(id_) for id_ in f] 248 | print " %s, sequence length: %d" %(seqfile, 249 | len(one_sequence)) 250 | sequence.extend(one_sequence) 251 | 252 | # add two extra positions for 'unknown-log' and 'no-log' 253 | vocab_size = max(sequence) + 2 254 | 255 | return sequence, vocab_size 256 | 257 | 258 | def get_data(sequence, vocab_size, mapping='m2m', sentence_length=40, step=3, 259 | random_offset=True): 260 | """ 261 | Retrieves data from a plain txt file and formats it using one-hot vector. 262 | 263 | Arguments: 264 | sequence: {lsit}, the original input sequence 265 | vocab_size: {integer}, the number of unique id classes 266 | mapping: {string}, input to output mapping. 267 | 'o2o': one-to-one 268 | 'm2m': many-to-many 269 | sentence_length: {integer}, the length of each training sentence. 270 | step: {integer}, the sample steps. 271 | random_offset: {bool}, the offset is random between step or is 0. 272 | Returns: 273 | {np.array}, training input data X 274 | {np.array}, training target data y 275 | """ 276 | X_sentences = [] 277 | y_sentences = [] 278 | next_ids = [] 279 | 280 | offset = np.random.randint(0, step) if random_offset else 0 281 | 282 | # creat batch data and next sentences 283 | for i in range(offset, len(sequence) - sentence_length, step): 284 | X_sentences.append(sequence[i : i + sentence_length]) 285 | if mapping == 'o2o': 286 | # if mapping is one-to-one 287 | next_ids.append(sequence[i + sentence_length]) 288 | elif mapping == 'm2m': 289 | # if mapping is many-to-many 290 | y_sentences.append(sequence[i + 1 : i + sentence_length + 1]) 291 | 292 | # number of sampes 293 | nb_samples = len(X_sentences) 294 | # print "total # of sentences: %d" %nb_samples 295 | 296 | # one-hot vector (all zeros except for a single one at 297 | # the exact postion of this id number) 298 | X_train = np.zeros((nb_samples, sentence_length, vocab_size), dtype=np.bool) 299 | # expected outputs for each sentence 300 | if mapping == 'o2o': 301 | # if mapping is one-to-one 302 | y_train = np.zeros((nb_samples, vocab_size), dtype=np.bool) 303 | elif mapping == 'm2m': 304 | # if mapping is many-to-many 305 | y_train = np.zeros((nb_samples, sentence_length, vocab_size), 306 | dtype=np.bool) 307 | 308 | for i, x_sentence in enumerate(X_sentences): 309 | for t, id_ in enumerate(x_sentence): 310 | # mark the each corresponding character in a sentence as 1 311 | X_train[i, t, id_] = 1 312 | # if mapping is many-to-many 313 | if mapping == 'm2m': 314 | y_train[i, t, y_sentences[i][t]] = 1 315 | # if mapping is one-to-one 316 | # mark the corresponding character in expected output as 1 317 | if mapping == 'o2o': 318 | y_train[i, next_ids[i]] = 1 319 | 320 | return X_train, y_train 321 | 322 | 323 | def predict(sequence, input_len, analyzer, nb_predictions=80, 324 | mapping='m2m', sentence_length=40): 325 | """ 326 | Predict the next sequences using existing model and weights given some seed. 327 | 328 | Arguments: 329 | sequence: {lsit}, the original input sequence 330 | input_len: {integer}, the number of unique id classes 331 | analyzer: {SequenceAnalyzer}, the sequence analyzer 332 | nb_predictions: {integer}, number of predictions after giving the seed 333 | mapping: {string}, input to output mapping. 334 | 'o2o': one-to-one 335 | 'm2m': many-to-many 336 | sentence_length: {integer}, the length of each sentence. 337 | """ 338 | # generate elements 339 | for _ in range(nb_predictions): 340 | # start index of the seed, random number in range 341 | start_index = np.random.randint(0, len(sequence) - sentence_length - 1) 342 | # seed sentence 343 | sentence = sequence[start_index : start_index + sentence_length] 344 | 345 | # Y_true 346 | y_true = sequence[start_index + 1 : start_index + sentence_length + 1] 347 | print "X: " + ' '.join(str(s).ljust(4) for s in sentence) 348 | 349 | seed = np.zeros((1, sentence_length, input_len)) 350 | # format input 351 | for t in range(0, sentence_length): 352 | seed[0, t, sentence[t]] = 1 353 | 354 | # get predictions 355 | # verbose = 0, no logging 356 | predictions = analyzer.model.predict(seed, verbose=0)[0] 357 | 358 | # y_predicted 359 | if mapping == 'o2o': 360 | next_id = np.argmax(predictions) 361 | sys.stdout.write(' ' + str(next_id)) 362 | sys.stdout.flush() 363 | elif mapping == 'm2m': 364 | next_sentence = [] 365 | for pred in predictions: 366 | next_sentence.append(np.argmax(pred)) 367 | print "y_pred: " + ' '.join(str(id_).ljust(4) 368 | for id_ in next_sentence) 369 | # next_id = np.argmax(predictions[-1]) 370 | 371 | # y_true 372 | print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true) 373 | 374 | print "\n" 375 | 376 | 377 | def train(analyzer, train_sequence, val_sequence, input_len, 378 | batch_size=128, nb_epoch=50, nb_iterations=4, 379 | sentence_length=40, step=40, mapping='m2m'): 380 | """ 381 | Trains the network. 382 | 383 | Arguments: 384 | analyzer: {SequenceAnalyzer}. 385 | train_sequence: {list}, training sequence. 386 | val_sequence: {list}, validation sequence. 387 | input_len: {integer}, the number of classes, i.e., the input length of 388 | neural network. 389 | batch_size: {interger}, the number of sentences per batch. 390 | nb_epoch: {integer}, number of epoches per iteration. 391 | nb_iterations: {integer}, number of iterations. 392 | sentence_length: {integer}, the length of each training sentence. 393 | step: {integer}, the sample steps. 394 | mapping: {string}, input to output mapping. 395 | 'o2o': one-to-one 396 | 'm2m': many-to-many 397 | """ 398 | for iteration in range(1, nb_iterations+1): 399 | # create training data, randomize the offset between steps 400 | X_train, y_train = get_data(train_sequence, input_len, mapping=mapping, 401 | sentence_length=sentence_length, step=step, 402 | random_offset=False) 403 | X_val, y_val = get_data(val_sequence, input_len, mapping=mapping, 404 | sentence_length=sentence_length, step=step, 405 | random_offset=False) 406 | print "" 407 | print "------------------------ Start Training ------------------------" 408 | print "Iteration: ", iteration 409 | print "Number of epoch per iteration: ", nb_epoch 410 | 411 | # history of losses and accuracy 412 | history = History() 413 | 414 | # saves the model weights after each epoch 415 | # if the validation loss decreased 416 | checkpointer = ModelCheckpoint(filepath="weights.hdf5", 417 | verbose=1, save_best_only=True) 418 | 419 | # train the model 420 | analyzer.model.fit(X_train, y_train, 421 | batch_size=batch_size, nb_epoch=nb_epoch, verbose=1, 422 | callbacks=[history, checkpointer], 423 | validation_data=(X_val, y_val)) 424 | 425 | analyzer.save_model("weights-after-iteration.hdf5", overwrite=True) 426 | 427 | 428 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40, 429 | nb_options=1): 430 | """ 431 | Scan the given sequence for detecting anormalies. 432 | 433 | Arguments: 434 | sequence: {lsit}, the original input sequence 435 | input_len: {integer}, the number of unique id classes 436 | analyzer: {SequenceAnalyzer}, the sequence analyzer 437 | mapping: {string}, input to output mapping. 438 | 'o2o': one-to-one 439 | 'm2m': many-to-many 440 | sentence_length: {integer}, the length of each sentence. 441 | nb_options: {interger}, number of predicted options. 442 | """ 443 | # sequence length 444 | length = len(sequence) 445 | 446 | # predicted probabilities for each id 447 | # we assume the first sentence_length ids are true 448 | probs = np.zeros((nb_options+1, length)) 449 | for o in xrange(nb_options+1): 450 | probs[o][:sentence_length] = 1.0 451 | 452 | # probability in negative log scale 453 | log_probs = np.zeros((nb_options+1, length)) 454 | 455 | # count the number of correct predictions 456 | nb_correct = [0] * (nb_options+1) 457 | 458 | start_time = time.time() 459 | try: 460 | # generate elements 461 | for start_index in xrange(length - sentence_length): 462 | # seed sentence 463 | X = sequence[start_index : start_index + sentence_length] 464 | y_next_true = sequence[start_index + sentence_length] 465 | 466 | seed = np.zeros((1, sentence_length, input_len)) 467 | # format input 468 | for t in range(0, sentence_length): 469 | seed[0, t, X[t]] = 1 470 | 471 | # get predictions, verbose = 0, no logging 472 | predictions = np.asarray(analyzer.model.predict(seed, verbose=0)[0]) 473 | 474 | # y_predicted 475 | y_next_pred = [] 476 | next_probs = [0.0] * (nb_options+1) 477 | if mapping == 'o2o': 478 | # y_next_pred[np.argmax(predictions)] = True 479 | # get the top-nb_options predictions with the high probability 480 | y_next_pred = np.argsort(predictions)[-nb_options:][::-1] 481 | # get the probability of the y_true 482 | next_probs[0] = predictions[y_next_true] 483 | elif mapping == 'm2m': 484 | # y_next_pred[np.argmax(predictions[-1])] = True 485 | # get the top-nb_options predictions with the high probability 486 | y_next_pred = np.argsort(predictions[-1])[-nb_options:][::-1] 487 | # get the probability of the y_true 488 | next_probs[0] = predictions[-1][y_next_true] 489 | 490 | print y_next_pred, y_next_true 491 | # chech whether the y_true is in the top-predicted options 492 | for i in xrange(nb_options): 493 | if y_next_true == y_next_pred[i]: 494 | next_probs[i+1] = 1.0 495 | nb_correct[i+1] += 1 496 | 497 | next_probs = np.maximum.accumulate(next_probs) 498 | print next_probs 499 | 500 | for j in xrange(nb_options+1): 501 | probs[j, start_index + sentence_length] = next_probs[j] 502 | # get the negative log probability 503 | log_probs[j, start_index + sentence_length] = -log(next_probs[j]) 504 | 505 | print start_index, next_probs 506 | 507 | except KeyboardInterrupt: 508 | print "KeyboardInterrupt" 509 | 510 | nb_correct = np.add.accumulate(nb_correct) 511 | for p in xrange(nb_options+1): 512 | print "Accuracy %d: %.4f%%" %(p, (nb_correct[p] * 100.0 / 513 | (start_index + 1))) # pylint: disable=W0631 514 | 515 | print " |-Plot figures ..." 516 | for q in xrange(nb_options+1): 517 | plot_and_write_prob(probs[q], 518 | "prob_"+str(q), 519 | [0, 50000, 0, 1], 520 | 'Normal') 521 | plot_and_write_prob(log_probs[q], 522 | "log_prob_"+str(q), 523 | [0, 50000, 0, 25], 524 | 'Log') 525 | 526 | stop_time = time.time() 527 | print "--- %s seconds ---\n" % (stop_time - start_time) 528 | 529 | return probs 530 | 531 | 532 | def plot_hist(prob, filename, plot_range, scale, cumulative, normed=True): 533 | """ 534 | Plot and write the (cumulative) probabilties distribution. 535 | """ 536 | if scale == 'Log': 537 | prob = [-p for p in prob] 538 | plt.hist(prob, bins=100, normed=normed, cumulative=cumulative) 539 | plt.ylabel('Probability in %s Scale' %scale) 540 | plt.ylabel('Distribution: Normalized=%s, Cumulated=%s.' %(normed, 541 | cumulative)) 542 | plt.grid(True) 543 | plt.axis(plot_range) 544 | plt.savefig(filename + ".png") 545 | plt.clf() 546 | plt.cla() 547 | 548 | 549 | def plot_and_write_prob(prob, filename, plot_range, scale): 550 | """ 551 | Plot and write the probabilties for each of the log. 552 | """ 553 | # print " |-Plot figures ..." 554 | plt.plot(prob, 'r*') 555 | plt.xlabel('Log') 556 | plt.ylabel('Probability in %s Scale' %scale) 557 | plt.axis(plot_range) 558 | plt.savefig(filename + ".png") 559 | plt.clf() 560 | plt.cla() 561 | 562 | # print " |-Write probabilities ..." 563 | with open(filename + '.txt', 'w') as prob_file: 564 | for p in prob: 565 | prob_file.write(str(p) + '\n') 566 | 567 | 568 | def run(hidden_len=512, batch_size=128, nb_epoch=50, nb_iterations=5, 569 | learning_rate=0.001, nb_predictions=20, mapping='m2m', 570 | sentence_length=40, step=40, mode='train'): 571 | """ 572 | Train, evaluate, or predict. 573 | 574 | Arguments: 575 | hidden_len: {integer}, the size of a hidden layer. 576 | batch_size: {interger}, the number of sentences per batch. 577 | nb_epoch: {interger}, number of epoches per iteration. 578 | nb_iterations: {integer}, number of iterations. 579 | learning_rate: {float}, learning rate. 580 | nb_predictions: {integer}, number of the ids predicted. 581 | mapping: {string}, input to output mapping. 582 | 'o2o': one-to-one 583 | 'm2m': many-to-many 584 | sentence_length: {integer}, the length of each training sentence. 585 | step: {integer}, the sample steps. 586 | mode: {string}, th running mode of this programm 587 | 'train': train and predict 588 | 'predict': only predict by loading existing model weights 589 | 'evaluate': evaluate the model in evaluation data set 590 | 'detect': detect a new log sequence for the probabilities 591 | """ 592 | # get parameters and dimensions of the model 593 | print "Loading training data..." 594 | train_sequence, input_len1 = get_sequence("./train_data/*") 595 | print "Loading validation data..." 596 | val_sequence, input_len2 = get_sequence("./validation_data/*") 597 | input_len = max(input_len1, input_len2) 598 | 599 | print "Training sequence length: %d" %len(train_sequence) 600 | print "Validation sequence length: %d" %len(val_sequence) 601 | print "#classes: %d\n" %input_len 602 | 603 | # two layered LSTM 512 hidden nodes and a dropout rate of 0.2 604 | rnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len) 605 | 606 | # build model 607 | rnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate, 608 | nb_layers=3, dropout=0.5) 609 | 610 | # plot model 611 | # rnn.plot_model() 612 | 613 | # load the previous model weights 614 | # rnn.load_model("weights-after-iteration-l1.hdf5") 615 | 616 | if mode == 'predict': 617 | print "Predict..." 618 | predict(val_sequence, input_len, rnn, nb_predictions=nb_predictions, 619 | mapping=mapping, sentence_length=sentence_length) 620 | elif mode == 'evaluate': 621 | print "Evaluate..." 622 | print "Metrics: " + ', '.join(rnn.model.metrics_names) 623 | X_val, y_val = get_data(val_sequence, input_len, mapping=mapping, 624 | sentence_length=sentence_length, step=step, 625 | random_offset=False) 626 | results = rnn.model.evaluate(X_val, y_val, #pylint: disable=W0612 627 | batch_size=batch_size, 628 | verbose=1) 629 | print "Loss: ", results[0] 630 | print "Accuracy: ", results[1] 631 | elif mode == 'train': 632 | print "Train..." 633 | try: 634 | train(rnn, train_sequence, val_sequence, input_len, 635 | batch_size=batch_size, nb_epoch=nb_epoch, 636 | nb_iterations=nb_iterations, 637 | sentence_length=sentence_length, 638 | step=step, mapping=mapping) 639 | except KeyboardInterrupt: 640 | rnn.save_model("weights-stop.hdf5", overwrite=True) 641 | elif mode == 'detect': 642 | print "Detect..." 643 | detect(val_sequence, input_len, rnn, mapping=mapping, 644 | sentence_length=sentence_length, nb_options=3) 645 | else: 646 | print "The mode = %s is not correct!!!" %mode 647 | 648 | return mode 649 | 650 | 651 | if __name__ == '__main__': 652 | run() 653 | --------------------------------------------------------------------------------