├── .travis.yml
├── README.md
├── brnn_model.png
├── markov_chain.py
├── naive_bayes.py
├── others
    ├── brnn_sequence_analyzer.py
    ├── brnn_sequence_analyzer_gen.py
    ├── rnn_sequence_analyzer_gen.py
    ├── sequence_analyzer.py
    └── sequence_analyzer_gen.py
├── requirements.txt
├── rnn_model.png
└── rnn_sequence_analyzer.py


/.travis.yml:
--------------------------------------------------------------------------------
 1 | before_install:
 2 |   - sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
 3 | language: python
 4 | python:
 5 |   - "2.7"
 6 | # command to install dependencies
 7 | install: "pip install -r requirements.txt"
 8 | # command to run tests
 9 | script: nosetests
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # sequence-rnn-py
  2 | 
  3 | [![Build Status](https://travis-ci.org/fluency03/sequence-rnn-py.svg?branch=master)](https://travis-ci.org/fluency03/sequence-rnn-py)
  4 | 
  5 | This program analyze the sequence using (Uni-directional and Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) based on the python library Keras ([Documents](http://keras.io/) and [Github](https://github.com/fchollet/keras)).
  6 | It is based on this [lstm_text_generation.py](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py) and this [imdb_bidirectional_lstm.py]( https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py) examples of Keras.
  7 | 
  8 | 
  9 | *This is part of my master thesis project and still in development.*
 10 | 
 11 | ## Requirements
 12 | 
 13 | - [Python 2.7](https://www.python.org/downloads/)
 14 | - [NumPy](http://www.numpy.org/): The fundamental package needed for scientific computing with Python.
 15 | - [SciPy](http://scipy.org/):  Python-based ecosystem of open-source software for mathematics, science, and engineering.
 16 | - [Theano](http://deeplearning.net/software/theano/): A Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
 17 | - [Tensorflow](https://www.tensorflow.org/): An open source software library for numerical computation using data flow graphs.
 18 | - [Keras>=1.0](http://keras.io/): A minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. Update the Keras:
 19 | 
 20 |     `pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps` .
 21 | 
 22 | - **GPU Support** (optional but highly recommended). Instructions of enabling GPU are here: [for Theano](http://deeplearning.net/software/theano/install.html#using-the-gpu) and [for TensorFlow](https://www.tensorflow.org/versions/r0.7/get_started/os_setup.html#optional-linux-enable-gpu-support).
 23 | - [pydot](https://github.com/erocarrera/pydot) and [graphviz](http://www.graphviz.org/) (optional, if you want to plot the model)
 24 | - [HDF5](https://www.hdfgroup.org/HDF5/) and [h5py](http://www.h5py.org/) (optional, if you use model saving/loading functions)
 25 | 
 26 | 
 27 | ## Materials
 28 | 
 29 | A serias of Recurrent Neural Networks Tutorial:
 30 | 
 31 | 1. [Part 1 - Introduction to RNNs](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
 32 | 2. [Part 2 - Implementing a RNN with Python, Numpy and Theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
 33 | 3. [Part 3 - Backpropagation Through Time and Vanishing Gradients](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/)
 34 | 4. [Part 4 - Implementing a GRU/LSTM RNN with Python and Theano](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/)
 35 | 
 36 | Two great materials about LSTM: [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of [Christopher Olah](http://colah.github.io/) and [Understanding LSTM and its diagrams](https://medium.com/@shiyan/understanding-lstm-and-its-diagrams-37e2f46f1714#.5hkwmotmr) of [Shi Yan](https://medium.com/@shiyan)
 37 | 
 38 | The best post of [Andrej Karpathy blog](http://karpathy.github.io/) regarding sequence prediction using RNN: [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
 39 | 
 40 | One deeper material about RNN: [Chapter 10 - Sequence Modeling: Recurrentand Recursive Nets](http://www.deeplearningbook.org/contents/rnn.html) of this book [MIT Deep Learning](http://www.deeplearningbook.org/).
 41 | 
 42 | 
 43 | ## Model
 44 | 
 45 | 
 46 | - Two layers of LSTMs Uni-directional RNN model:
 47 | 
 48 | ![ RNN LSTM ](https://github.com/fluency03/sequence-rnn-py/blob/master/rnn_model.png "RNN LSTM")
 49 | 
 50 | 
 51 | - One layer of LSTM Bi-Directional RNN model:
 52 | 
 53 | ![ BRNN LSTM ](https://github.com/fluency03/sequence-rnn-py/blob/master/brnn_model.png "BRNN LSTM")
 54 | 
 55 | 
 56 | - Naive Bayes model:
 57 | 
 58 | [naive_bayes.py](https://github.com/fluency03/sequence-rnn-py/blob/master/naive_bayes.py) is a simple Naive Bayes model used for comparison.
 59 | 
 60 | 
 61 | ## Data
 62 | 
 63 | - Training Set
 64 | 
 65 | - Validation Set
 66 | 
 67 | - Test Set
 68 | 
 69 | 
 70 | 
 71 | ## Training
 72 | 
 73 | This [hyperas](https://github.com/maxpumperla/hyperas) may help. It is *A very simple convenience wrapper around [hyperopt](https://github.com/hyperopt/hyperopt) for fast prototyping with keras models.* It is used for hyper-parameter optimization. An example can be found [here](https://github.com/maxpumperla/hyperas/blob/master/examples/lstm.py).
 74 | 
 75 | Two good materials:
 76 | 
 77 | - [CHAPTER 3: Improving the way neural networks learn](http://neuralnetworksanddeeplearning.com/chap3.html) from [Michael Nielsen](http://michaelnielsen.org/)
 78 | - [Neural Networks Part 2: Setting up the Data and the Loss](http://cs231n.github.io/neural-networks-2/) and [Neural Networks Part 3: Learning and Evaluation](http://cs231n.github.io/neural-networks-3/) from Stanford CS class [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.github.io/)
 79 | 
 80 | Considerations:
 81 | 
 82 | - **Batch Size**: how many streams of data are processed in parallel at one time.
 83 | 
 84 | 
 85 | - **Samples per epoch** and **Batches per epoch**: how many samples or batches considered per epoch. Based on some of my experiments: (i) the more #samples there are, the higher the accuracy can reach at the stable stage and the less the loss can be at the stable stage; (ii) the more #batches (integer ratio of #sample/batch_size) there are, the higher the accuracy can reach at table stage and the less the loss can be at stable stage and the less iterations it will take to reach the same loss/accuracy value.
 86 | 
 87 | 
 88 | - **Sentence Length**: according to [char-rnn](https://github.com/karpathy/char-rnn):
 89 |  > The length of each data stream, which is also the limit at which the gradients can propagate backwards in time. For example, if seq_length is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not find dependencies longer than this length in number of characters.
 90 | 
 91 |  This is actually the limitation of the model's long term memory.
 92 | 
 93 |  > Thus, if you have a very difficult dataset where there are a lot of long-term dependencies, you will want to increase this setting.
 94 | 
 95 | 
 96 | - **Offset during sampling**: offset is the start index when sampling the X_train and y_train from original sequence. The offset can be fixed value or random value ranging between 0 ~ step-1.
 97 | 
 98 | 
 99 | - **Data size vs. #parameters** in total:
100 |  - #layers: the number of layers, [here](https://github.com/karpathy/char-rnn) suggests that always use num_layers of either 2 or 3.
101 |  - layer size: the number of units per layer.
102 | 
103 |  Acoording to [char-rnn](https://github.com/karpathy/char-rnn), the two important quantities to keep track of here are:
104 |  - The total number of parameters in your model.
105 |  - The size of your dataset.
106 |  These two should be about the same order of magnitude.
107 | 
108 |  **How to calculate the number of parameters in RNN?** For example, consider one layer of LSTM:
109 |  - if it has the layer size of `H=512`;
110 |  - if we have the vocabulary size as `C=3000` (the number of unique classes);
111 |  - the LSTM layer will have three parameter matrix - `U` with dimension `(H, C)=(512, 3000)`, `V` with dimension `(C, H)=(3000, 512)`, `W` with dimension `(H, H)=(512, 512)`;
112 |  - the total number of parameter for one layer will be: `2HC + H^2`, which is **3,334,144** in this case.
113 |  - That is 3 million parameters for only one layer!
114 | 
115 | 
116 | - **Learning Rate**: This ratio (percentage) influences the speed (step of the gradient descent) and quality of learning. The greater the ratio, the faster the neuron trains; the lower the ratio, the more accurate the training is. According to [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069v1.pdf) [\[1\]](https://github.com/fluency03/sequence-rnn-py#1-greff-klaus-rupesh-kumar-srivastava-jan-koutník-bas-r-steunebrink-and-jürgen-schmidhuber-lstm-a-search-space-odyssey-arxiv-preprint-arxiv150304069-2015):
117 | > The learning rate is by far the most important hyperparameter. And based on their suggestion, while searching for a good learning rate for the LSTM, it is sufficient to do a coarse search by starting with a high value (e.g. 1.0) and dividing it by ten until performance stops increasing.
118 | 
119 | 
120 | - **[Dropout](http://keras.io/layers/core/#dropout)**: an float between 0 and 1, indicating how much percentage of the hidden layer data are ignored when feeding to next layer. It is a powerful regularization method and mainly used for avoiding overfitting. If your model is overfitting, it better to increase the value of dropout.
121 | 
122 | 
123 | - **Reinforcement learning function**: The *temperature* parameter is dividing the predicted log probabilities before the *[Softmax](https://en.wikipedia.org/wiki/Softmax_function)*, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.
124 | 
125 | 
126 | - **Loss function**: [categorical_crossentropy](http://keras.io/objectives/)
127 | 
128 | 
129 | - **Optimizer**: [RMSprop](http://keras.io/optimizers/#rmsprop), you can try other options like simple [SGD](http://keras.io/optimizers/#sgd), [Adagrad](http://keras.io/optimizers/#adagrad) and [Adam](http://keras.io/optimizers/#adam).
130 | 
131 | 
132 | ## Reference
133 | 
134 | ###### [1] Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. "*[LSTM: A search space odyssey.](http://arxiv.org/pdf/1503.04069v1.pdf)*" arXiv preprint arXiv:1503.04069 (2015).
135 | 


--------------------------------------------------------------------------------
/brnn_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fluency03/sequence-rnn-py/0a55a8fcc93644bca216afc660564d3a606886ab/brnn_model.png


--------------------------------------------------------------------------------
/markov_chain.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Markov Chain, a comparable model to RNN as a baseline.
  3 | 
  4 | The LabeledMarkovPredictor is riginally written by Erik Ylipaa at SICS.
  5 | 
  6 | Author: Chang Liu (fluency03)
  7 | Data: 2016-04-15
  8 | """
  9 | 
 10 | # import unittest
 11 | # from collections import Counter, defaultdict
 12 | 
 13 | import glob
 14 | import numpy as np
 15 | from hmmlearn.hmm import MultinomialHMM
 16 | 
 17 | 
 18 | class LabeledMarkovPredictor(object):
 19 |     """
 20 |     Model which builds a first order markov model of the labeled data.
 21 |     """
 22 |     def __init__(self, num_classes, # pylint: disable=W0613
 23 |                  eval_during_training=False, **kwargs):
 24 |         """
 25 |         Create a new LabeledMarkov predictor.
 26 | 
 27 |         Arguments:
 28 |             num_classes: {integer}, the number of class labels in the data.
 29 |             eval_during_training: {bool}, if True, loss will be calculated
 30 |                 during training. For the Markov chain this means very little.
 31 |                 Disabling this speeds up training.
 32 |             kwargs:
 33 |         """
 34 |         self.num_classes = num_classes
 35 |         self.eval_during_training = eval_during_training
 36 |         self.dirty_counts = True  #
 37 |         self.setup_params()
 38 | 
 39 |     def setup_params(self):
 40 |         """
 41 |         Set up other perematers. The model is basically just a matrix. Each row
 42 |         of the matrix is the conditional probabilty for the next symbol in the
 43 |         sequence, given the current symbol. We set all entries to 1, giving us
 44 |         a uniform distribution as a prior.
 45 |         """
 46 |         # the matrix initialized with all ones
 47 |         self.W = np.ones((self.num_classes, self.num_classes), np.uint64)
 48 | 
 49 |         # Give the class count two dimensions, but put the second to 1, so it's
 50 |         # broadcastable over W when we wish to divide. Set to to the number of
 51 |         # classes, so that it will give us the uniform distribution as a prior
 52 |         self.class_counts = np.full(self.num_classes,
 53 |                                     self.num_classes,
 54 |                                     dtype=np.uint64)
 55 | 
 56 |         self.log_class_counts = np.log(np.full(self.num_classes,
 57 |                                                self.num_classes,
 58 |                                                dtype=np.uint64))
 59 |         self.dirty_counts = False
 60 | 
 61 |     def train(self, training_arguments, *args, **kwargs): # pylint: disable=W0613
 62 |         """
 63 |         Updates the model based on the input batch. The input should be a tuple
 64 |         of two ndarray training-batches.
 65 | 
 66 |         Arguments:
 67 |             training_arguments: {tuple}, should be a tuple of x- and y-batches.
 68 |                 (x_batch, y_bath). The batches should be ndarray matrices of
 69 |                 integer labels. The first dimension is the time dimension, the
 70 |                 second the batch dimension. The shape is considered to have the
 71 |                 semantics: (sequence_length, batch_size).
 72 |             args:
 73 |             kwargs:
 74 |         Returns: {tuple}, (training_loss, info_dict). The training loss will
 75 |             be the average negative log of the probability of the y_batch before
 76 |             training on the x_batch. The info_dict is an empty dictionary for
 77 |             this model. If eval_during_training was set to False when the model
 78 |             was instantiated, None is returned instead of the loss.
 79 |         """
 80 |         # We disregard any arguments except the training arguments tuple
 81 |         try:
 82 |             x_batch, y_batch, mask = training_arguments # pylint: disable=W0612
 83 |         except ValueError:
 84 |             x_batch, y_batch = training_arguments
 85 |             mask = None
 86 | 
 87 |         sequence_length, batch_size = x_batch.shape
 88 | 
 89 |         # We go over each timestep and increase all the columns denoted by the
 90 |         # y's for the rows denoted by the x's
 91 |         for t in range(sequence_length):
 92 |             for batch_num in range(batch_size):
 93 |                 x = x_batch[t, batch_num]
 94 |                 y = y_batch[t, batch_num]
 95 |                 self.W[x, y] += 1
 96 |                 self.class_counts[x] += 1
 97 |                 self.dirty_counts = True
 98 | 
 99 |         info_dict = dict()
100 |         if self.eval_during_training:
101 |             loss = self.evaluate(training_arguments)
102 |         else:
103 |             loss = None
104 |         return loss, info_dict
105 | 
106 |     def evaluate(self, training_argument):
107 |         """
108 |         Get the average negative log probability for the y_batch, using the
109 |         model predicted probabilities from the x_batch.
110 | 
111 |         Arguments:
112 |             training_argument: {tuple}, a pair of ndarrays (x_batch, y_batch).
113 |                 The batches should be matrices of integers of the same shape,
114 |                 where the first dimension is time, the second is over batches.
115 |         Returns: {float}, The average negative log probability the model
116 |             assigned the correct answers of the y_batch given the x_batch.
117 |         """
118 |         # We disregard any arguments except the training arguments tuple
119 |         try:
120 |             x_batch, y_batch, mask = training_argument # pylint: disable=W0612
121 |         except ValueError:
122 |             x_batch, y_batch = training_argument
123 |             mask = None
124 | 
125 |         x_batch = x_batch.astype(np.int)
126 |         y_batch = y_batch.astype(np.int)
127 |         sequence_length, batch_size = x_batch.shape
128 | 
129 |         # np.seterr(divide='ignore'). We ignore division by zero, since we will
130 |         # be performing many of them. We will return the negative log likelihood
131 |         # per sequence. This will be the logarithm of normalized value for each
132 |         # of the entries in the matrix. The matrix needs to be normalized by row
133 |         # P = np.divide(self.W, self.class_counts)
134 |         # P[np.where(np.isnan(P))] = 1/self.num_classes
135 |         # Any rows with NaN, we replace with a uniform score
136 |         flat_x = x_batch.flatten()
137 |         flat_y = y_batch.flatten()
138 |         if self.dirty_counts:
139 |             self.log_class_counts = np.log(self.class_counts)
140 |             self.dirty_counts = False
141 | 
142 |         # We should take the negative log of the probabilities, this is the same
143 |         # as taking the log of the W[x,y]/count[x], which is the same as
144 |         # log(W[x,y]) - log(count[x])
145 |         # probs = self.W[flat_x, flat_y] / self.class_counts[flat_x]
146 |         # log_probs = np.log(probs)
147 |         log_probs = (np.log(self.W[flat_x, flat_y]) -
148 |                      self.log_class_counts[flat_x])
149 |         loss = - float(np.sum(log_probs))
150 |         # for consistency, divide the negative log loss with the batch size and
151 |         # sequence length returning the same loss as the RNN models
152 |         sequence_loss = loss / (batch_size * sequence_length)
153 |         return sequence_loss
154 | 
155 |     def predict(self, x_batch):
156 |         """
157 |         Arguments:
158 |             x_batch: {np.array}, An ndarray of integer labels.
159 |         Returns: {integer}. The predicted label the same shape as x_batch.
160 |         """
161 |         x_batch = x_batch.astype(np.int)
162 |         # for each entry in x_batch, it will pick out a row for W.
163 |         label_counts = self.W[x_batch]
164 |         # along each row picked by the x_batch
165 |         # return the index of the highest count
166 |         return np.argmax(label_counts, axis=-1)
167 | 
168 | 
169 | def transpose(theList):
170 |     """
171 |     Transpose matrix for Markov Chain model.
172 | 
173 |     Arguments:
174 |         theList: {list}, the input list.
175 |     Returns: {np.array}, the transposed np.array.
176 |     """
177 |     return np.asarray(theList).transpose()
178 | 
179 | 
180 | def get_sequence(filepath):
181 |     """
182 |     Get the original sequence from file.
183 | 
184 |     Arguments:
185 |         filename: {string}, the name/path of input log sequence file.
186 |     Returns:
187 |         {list}, the log sequence.
188 |         {integer}, the size of vocabulary.
189 |     """
190 |     # read file and convert ids of each line into array of numbers
191 |     seqfiles = glob.glob(filepath)
192 |     sequence = []
193 | 
194 |     for seqfile in seqfiles:
195 |         with open(seqfile, 'r') as f:
196 |             one_sequence = [int(id_) for id_ in f]
197 |             print "        %s, sequence length: %d" %(seqfile,
198 |                                                       len(one_sequence))
199 |             sequence.extend(one_sequence)
200 | 
201 |     # add two extra positions for 'unknown-log' and 'no-log'
202 |     vocab_size = max(sequence) + 2
203 | 
204 |     return sequence, vocab_size
205 | 
206 | 
207 | def get_data(sequence, sentence_length=40, random_offset=False):
208 |     """
209 |     Retrieves data from a plain txt file and formats it using one-hot vector.
210 | 
211 |     Arguments:
212 |         sequence: {lsit}, the original input sequence
213 |         sentence_length: {integer}, the length of each training sentence.
214 |         random_offset: {bool}, the offset is random between step or is 0.
215 |     Returns:
216 |         {list}, training input data X
217 |         {list}, training target data y
218 |     """
219 |     X_sentences = []
220 |     y_sentences = []
221 | 
222 |     offset = np.random.randint(0, sentence_length) if random_offset else 0
223 | 
224 |     # creat batch data and next sentences
225 |     for i in range(offset, len(sequence) - sentence_length, sentence_length):
226 |         X_sentences.append(sequence[i : i + sentence_length])
227 |         y_sentences.append(sequence[i + 1 : i + sentence_length + 1])
228 | 
229 |     return X_sentences, y_sentences
230 | 
231 | 
232 | def train(sentence_length=40):
233 |     """
234 |     Train the markov chain.
235 | 
236 |     Arguments:
237 |         sentence_length: {integer}, length of one sentence in the data set.
238 |     """
239 |     # get parameters and dimensions of the model
240 |     print "Loading training data..."
241 |     train_sequence, input_len1 = get_sequence("./train_data/*")
242 |     print "Loading validation data..."
243 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
244 |     nb_classes = max(input_len1, input_len2)
245 | 
246 |     print "Training sequence length: %d" %len(train_sequence)
247 |     print "Validation sequence length: %d" %len(val_sequence)
248 |     print "#classes: %d\n" %nb_classes
249 | 
250 |     X_train, y_train = get_data(train_sequence,
251 |                                 sentence_length=sentence_length,
252 |                                 random_offset=False)
253 |     X_val, y_val = get_data(val_sequence,
254 |                             sentence_length=sentence_length,
255 |                             random_offset=False)
256 | 
257 |     print "Build Markov Chain..."
258 |     model = LabeledMarkovPredictor(nb_classes)
259 | 
260 |     print "Train the model..."
261 |     model.train((transpose(X_train), transpose(y_train)))
262 | 
263 |     print "Validating..."
264 |     validation_loss = 0
265 |     validation_loss = model.evaluate((transpose(X_val), transpose(y_val)))
266 | 
267 |     print "Validation loss: {}".format(validation_loss)
268 | 
269 | 
270 | # TODO: not working yet
271 | def train_hmm():
272 |     """
273 |     HMM for sequence learning.
274 |     """
275 |     print "Loading training data..."
276 |     train_sequence, num_classes = get_sequence("./train_data/*")
277 | 
278 |     print "Build HMM..."
279 |     model = MultinomialHMM(n_components=2)
280 | 
281 |     print "Train HMM..."
282 |     model.fit([train_sequence])
283 | 
284 | 
285 | 
286 | if __name__ == '__main__':
287 |     train()
288 |     # train_hmm()
289 | 


--------------------------------------------------------------------------------
/naive_bayes.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Simple Naive Bayes classifier implimentation for sequence prediction.
  3 | 
  4 | Author: Chang Liu (fluency03)
  5 | Data: 2016-05-12
  6 | """
  7 | 
  8 | import cPickle as pickle
  9 | import glob
 10 | import os
 11 | import time
 12 | from math import log
 13 | import numpy as np
 14 | from rnn_sequence_analyzer import plot_hist, plot_and_write_prob
 15 | 
 16 | 
 17 | class NaiveBayes(object):
 18 |     """
 19 |     Simple Naive Bayes classifier implimentation for sequence prediction.
 20 |     """
 21 |     def __init__(self, window_size, nb_classes, alpha=1.0):
 22 |         """
 23 |         Initialization. Set up some parameters. Build up the matrix.
 24 | 
 25 |         Arguments:
 26 |             window_size: {integer}, the size of input window.
 27 |             nb_classes: {integer}, number of uniques classes.
 28 |             alpha: {float}, the smoothing priors alpha >= 0 accounts for
 29 |                 features not present in the learning samples and prevents zero
 30 |                 probabilities in further computations. Setting alpha = 1 is
 31 |                 called Laplace smoothing, while alpha < 1 is called
 32 |                 Lidstone smoothing.
 33 | 
 34 |         """
 35 |         self.window_size = window_size
 36 |         self.nb_classes = nb_classes
 37 |         self.alpha = alpha
 38 |         self.build()
 39 | 
 40 |     def build(self):
 41 |         """
 42 |         Build up the matrix.
 43 |         """
 44 |         self.ny = np.zeros((self.nb_classes,), dtype=np.int)
 45 |         self.nx_y = np.zeros((self.window_size,
 46 |                               self.nb_classes,
 47 |                               self.nb_classes), dtype=np.int)
 48 | 
 49 |     def train(self, X, y):
 50 |         """
 51 |         Train the model.
 52 | 
 53 |         Arguments:
 54 |             X: {array}, X training data.
 55 |             y: {array}, y training data.
 56 |         """
 57 |         N = len(y)
 58 |         for i in xrange(N):
 59 |             self.ny[y[i]] += 1
 60 |             for j in xrange(self.window_size):
 61 |                 self.nx_y[j, X[i, j], y[i]] += 1
 62 | 
 63 |     def save_model(self, filename):
 64 |         """
 65 |         Save the model information to a file.
 66 |         """
 67 |         print "    |-Write the model into %s ..." %filename
 68 |         with open(filename, 'w') as pkl_file:
 69 |             pickle.dump({'ny': self.ny, 'nx_y': self.nx_y,
 70 |                          'window_size': self.window_size,
 71 |                          'nb_classes': self.nb_classes,
 72 |                          'alpha': self.alpha}, pkl_file)
 73 | 
 74 |     def load_model(self, filename):
 75 |         """
 76 |         Load the model information from a file.
 77 |         """
 78 |         if os.path.isfile(filename):
 79 |             print "%s existing, loading it...\n" %filename
 80 |             with open(filename) as pkl_file:
 81 |                 model = pickle.load(pkl_file)
 82 |                 self.ny = model['ny']
 83 |                 self.nx_y = model['nx_y']
 84 |                 # self.window_size = model['window_size']
 85 |                 # self.nb_classes = model['nb_classes']
 86 |                 # self.alpha = model['alpha']
 87 |         else:
 88 |             print "File does not exist!"
 89 | 
 90 |     def evaluate(self, X, y, normalization=True, log_scale=False):
 91 |         """
 92 |         Evaluate the model.
 93 | 
 94 |         Arguments:
 95 |             X: {array}, X evaluation data.
 96 |             y: {array}, y evaluation data.
 97 |             normalization: {bool}, whether do the normalization.
 98 |             log_scale: {bool}, whether transfer probabilities on log scale.
 99 |         """
100 |         def scale(p):
101 |             """
102 |             Probability in log scale.
103 |             """
104 |             return log(p) if log_scale else p
105 | 
106 |         def normalize(py_x):
107 |             """
108 |             Normalize the probabilities.
109 |             """
110 |             py_x_sum = np.sum(py_x)
111 |             return np.asarray([py_x[p] / py_x_sum
112 |                                for p in xrange(self.nb_classes)])
113 | 
114 |         N = np.sum(self.ny)
115 |         length = len(y)
116 |         print "length: %d " %length
117 |         correct = 0
118 | 
119 |         probs = np.zeros(length)
120 |         if not log_scale:
121 |             probs[:self.window_size] = 1.0
122 | 
123 |         # ------------------- Prior ------------------- #
124 |         py = np.zeros(self.nb_classes)
125 |         for i in xrange(self.nb_classes):
126 |             py[i] = ((self.ny[i] + self.alpha) /
127 |                      (N + self.alpha * self.nb_classes))
128 | 
129 |         for i in xrange(length):
130 |             print "evaluating %d ..." %i
131 |             # ------------------- Likelihood ------------------- #
132 |             px_y = np.zeros((self.nb_classes, self.window_size))
133 |             for p in xrange(self.nb_classes):
134 |                 for k in xrange(self.window_size):
135 |                     px_y[p, k] = ((self.nx_y[k, X[i, k], p] +
136 |                                    self.alpha) /
137 |                                   (self.ny[p] +
138 |                                    self.alpha * self.nb_classes))
139 |             # ------------------- Posterior ------------------- #
140 |             py_x = np.zeros(self.nb_classes)
141 |             for j in xrange(self.nb_classes):
142 |                 py_x[j] = py[j] * np.prod(px_y[j])
143 | 
144 |             # ------------------- Normalization ------------------- #
145 |             if normalization:
146 |                 py_x = normalize(py_x)
147 | 
148 |             # ------------------- Prediction ------------------- #
149 |             # check the prediction
150 |             y_pred = np.argmax(py_x)
151 |             y_true = y[i]
152 | 
153 |             max_prob = scale(py_x[y_pred])
154 |             print ("y_pred: %d , max_prod: %.8f, y_true_prob: %.8f ,"
155 |                    %(y_pred, max_prob, scale(py_x[y_true])))
156 | 
157 |             if y_true == y_pred:
158 |                 correct += 1
159 | 
160 |             probs[i + self.window_size] = max_prob
161 | 
162 |         accuracy = (correct * 100.0) / length
163 |         print "Accuracy: %.4f%%" %accuracy
164 | 
165 |         print "    |-Plot figures ..."
166 |         plot_and_write_prob(probs,
167 |                             "nb_prob_",
168 |                             [0, 50000, 0, 1],
169 |                             'Log' if log_scale else 'Normal')
170 | 
171 |     def evaluate_all(self, X, y, nb_options=3, normalization=True): # pylint: disable=R0912
172 |         """
173 |         Evaluate the model.
174 | 
175 |         Arguments:
176 |             X: {array}, X evaluation data.
177 |             y: {array}, y evaluation data.
178 |             nb_options: {interger}, number of predicted options.
179 |             normalization: {bool}, whether do the normalization.
180 |         """
181 |         N = np.sum(self.ny)
182 |         length = len(y)
183 |         print "length: %d " %length
184 | 
185 |         probs = np.zeros((nb_options+1, length + self.window_size))
186 |         for o in xrange(nb_options+1):
187 |             probs[o][:self.window_size] = 1.0
188 | 
189 |         # probability in negative log scale
190 |         log_probs = np.zeros((nb_options+1, length + self.window_size))
191 | 
192 |         # count the number of correct predictions
193 |         nb_correct = [0] * (nb_options+1)
194 | 
195 |         # ------------------- Prior ------------------- #
196 |         py = np.zeros(self.nb_classes)
197 |         for i in xrange(self.nb_classes):
198 |             py[i] = ((self.ny[i] + self.alpha) /
199 |                      (N + self.alpha * self.nb_classes))
200 | 
201 |         try:
202 |             for i in xrange(length):
203 |                 print "evaluating %d ..." %i
204 |                 # ------------------- Likelihood ------------------- #
205 |                 px_y = np.zeros((self.nb_classes, self.window_size))
206 |                 for p in xrange(self.nb_classes):
207 |                     for k in xrange(self.window_size):
208 |                         px_y[p, k] = ((self.nx_y[k, X[i, k], p] +
209 |                                        self.alpha) /
210 |                                       (self.ny[p] +
211 |                                        self.alpha * self.nb_classes))
212 |                 # ------------------- Posterior ------------------- #
213 |                 py_x = np.zeros(self.nb_classes)
214 |                 for j in xrange(self.nb_classes):
215 |                     py_x[j] = py[j] * np.prod(px_y[j])
216 | 
217 |                 # ------------------- Normalization ------------------- #
218 |                 if normalization:
219 |                     py_x_sum = np.sum(py_x)
220 |                     py_x = np.asarray([py_x[p] / py_x_sum
221 |                                        for p in xrange(self.nb_classes)])
222 | 
223 |                 # ------------------- Prediction ------------------- #
224 |                 # check the prediction
225 |                 y_pred = np.argsort(py_x)[-nb_options:][::-1]
226 |                 y_true = y[i]
227 |                 print y_pred, y_true
228 | 
229 |                 next_probs = [0.0] * (nb_options+1)
230 |                 next_probs[0] = py_x[y_true]
231 | 
232 |                 for o in xrange(nb_options):
233 |                     if y_true == y_pred[o]:
234 |                         next_probs[o+1] = 1.0
235 |                         nb_correct[o+1] += 1
236 | 
237 |                 next_probs = np.maximum.accumulate(next_probs)
238 |                 print next_probs
239 | 
240 |                 for k in xrange(nb_options+1):
241 |                     probs[k, i + self.window_size] = next_probs[k]
242 |                     # get the negative log probability
243 |                     log_probs[k, i + self.window_size] = -log(next_probs[k])
244 | 
245 |         except:
246 |             print "KeyboardInterrupt"
247 | 
248 |         nb_correct = np.add.accumulate(nb_correct)
249 |         for n in xrange(nb_options+1):
250 |             print "Accuracy %d: %.4f%%" %(n, (nb_correct[n] * 100.0 / (i + 1))) # pylint: disable=W0631
251 | 
252 |         print "    |-Plot figures ..."
253 |         for q in xrange(nb_options+1):
254 |             plot_and_write_prob(probs[q],
255 |                                 "nb_prob_"+str(q),
256 |                                 [0, 50000, 0, 1],
257 |                                 'Normal')
258 |             plot_and_write_prob(log_probs[q],
259 |                                 "nb_log_prob_"+str(q),
260 |                                 [0, 50000, 0, 25],
261 |                                 'Log')
262 |     def predict(self, X):
263 |         """
264 |         Predict next sequence.
265 |         """
266 |         pass
267 | 
268 | 
269 | 
270 | def get_sequence(filepath):
271 |     """
272 |     Get the original sequence from file.
273 | 
274 |     Arguments:
275 |         filename: {string}, the name/path of input log sequence file.
276 |     Returns:
277 |         {list}, the log sequence.
278 |         {integer}, the size of vocabulary.
279 |         {integer}, total length of the sequences.
280 |     """
281 |     # read file and convert ids of each line into array of numbers
282 |     seqfiles = glob.glob(filepath)
283 |     sequences = []
284 |     total_length = 0
285 |     max_value = 0
286 | 
287 |     for seqfile in seqfiles:
288 |         sequence = []
289 |         with open(seqfile, 'r') as f:
290 |             one_sequence = [int(id_) for id_ in f]
291 |             print "        %s, sequence length: %d" %(seqfile,
292 |                                                       len(one_sequence))
293 |             sequence.extend(one_sequence)
294 |             total_length += len(one_sequence)
295 |         max_new = np.amax(sequence)
296 |         max_value = max_new if max_new > max_value else max_value
297 |         sequences.append(sequence)
298 | 
299 |     # add two extra positions for 'unknown-log' and 'no-log'
300 |     vocab_size = max_value + 2
301 | 
302 |     return sequences, vocab_size, total_length
303 | 
304 | 
305 | def get_data(sequence, sentence_length=40, step=3, random_offset=True):
306 |     """
307 |     Retrieves data from a plain txt file and formats it using one-hot vector.
308 | 
309 |     Arguments:
310 |         sequence: {lsit}, the original input sequence
311 |         vocab_size: {integer}, the number of unique id classes
312 |         sentence_length: {integer}, the length of each training sentence.
313 |         step: {integer}, the sample steps.
314 |         random_offset: {bool}, the offset is random between step or is 0.
315 |     Returns:
316 |         {np.array}, training input data X
317 |         {np.array}, training target data y
318 |     """
319 |     X_sentences = []
320 |     next_ids = []
321 | 
322 |     offset = np.random.randint(0, step) if random_offset else 0
323 | 
324 |     # creat batch data and next sentences
325 |     for i in range(offset, len(sequence) - sentence_length, step):
326 |         X_sentences.append(sequence[i : i + sentence_length])
327 |         next_ids.append(sequence[i + sentence_length])
328 | 
329 |     # number of sampes
330 |     # nb_samples = len(X_sentences)
331 |     # print "total # of sentences: %d" %nb_samples
332 | 
333 |     return np.asarray(X_sentences), np.asarray(next_ids)
334 | 
335 | 
336 | def main(sentence_length=3, mode='train'):
337 |     """
338 |     Train the model.
339 | 
340 |     Arguments:
341 |         sentence_length: {integer}, the length of each training sentence.
342 |     """
343 |     # get parameters and dimensions of the model
344 |     print "Loading training data..."
345 |     train_sequence, input_len1, total_length1 = get_sequence("./train_data/*")
346 | 
347 |     print "Loading validation data..."
348 |     val_sequence, input_len2, total_length2 = get_sequence("./validation_data/*")
349 | 
350 |     input_len = max(input_len1, input_len2)
351 | 
352 |     print "Training sequence length: %d" %total_length1
353 |     print "Validation sequence length: %d" %total_length2
354 |     print "#classes: %d\n" %input_len
355 | 
356 |     start_time = time.time()
357 | 
358 |     nb = NaiveBayes(window_size=sentence_length,
359 |                     nb_classes=input_len,
360 |                     alpha=1.0/input_len)
361 | 
362 |     if mode == 'train':
363 |         print "Train the model...\n"
364 |         for sequence in train_sequence:
365 |             X_train, y_train = get_data(sequence, sentence_length=sentence_length,
366 |                                         step=1, random_offset=False)
367 |             nb.train(X_train, y_train)
368 |         # nb.save_model('2.pkl')
369 |     elif mode == 'load':
370 |         nb.load_model('2.pkl')
371 | 
372 |     print "Evaluate the model...\n"
373 |     # for sequence in val_sequence:
374 |     #     X_val, y_val = get_data(sequence, sentence_length=sentence_length,
375 |     #                             step=1, random_offset=False)
376 |     #     nb.evaluate(X_val, y_val, normalization=True, log_scale=False)
377 | 
378 |     for sequence in val_sequence:
379 |         X_val, y_val = get_data(sequence, sentence_length=sentence_length,
380 |                                 step=1, random_offset=False)
381 |         nb.evaluate_all(X_val, y_val, nb_options=3, normalization=True)
382 | 
383 |     stop_time = time.time()
384 |     print "Stop...\n"
385 |     print "--- %s seconds ---\n" % (stop_time - start_time)
386 | 
387 | if __name__ == '__main__':
388 |     main()
389 | 


--------------------------------------------------------------------------------
/others/brnn_sequence_analyzer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program analyze the integer sequence using Bi-diractional Recurrent Neural
  3 | Network (BRNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
  4 | based on the python library Keras.
  5 | 
  6 | "Keras is a minimalist, highly modular neural networks library, written in
  7 |  Python and capable of running on top of either TensorFlow or Theano."
  8 |                                                 ---- Keras (http://keras.io/)
  9 | 
 10 | It is based on this Keras example - imdb_bidirectional_lstm.py:
 11 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py
 12 | 
 13 | Author: Chang Liu (fluency03)
 14 | Data: 2016-03-26
 15 | """
 16 | 
 17 | import glob
 18 | # import os
 19 | import sys
 20 | import csv
 21 | import time
 22 | import matplotlib.pyplot as plt
 23 | import numpy as np
 24 | 
 25 | from keras.callbacks import Callback, ModelCheckpoint
 26 | from keras.layers import Input, Dense, Dropout, LSTM, GRU, merge
 27 | from keras.layers.wrappers import TimeDistributed
 28 | from keras.models import Model
 29 | from keras.optimizers import RMSprop # pylint: disable=W0611
 30 | from keras.utils.visualize_util import plot
 31 | 
 32 | 
 33 | # random number generator with a fixed value for reproducibility
 34 | np.random.seed(1337)
 35 | 
 36 | 
 37 | def override(f):
 38 |     """
 39 |     Override decorator.
 40 |     """
 41 |     return f
 42 | 
 43 | 
 44 | class SequenceAnalyzer(object):
 45 |     """
 46 |     Sequence analyzer based on RNN Graph model.
 47 |     """
 48 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
 49 |         self.sentence_length = sentence_length
 50 |         self.input_len = input_len
 51 |         self.hidden_len = hidden_len
 52 |         self.output_len = output_len
 53 |         self.model = None
 54 | 
 55 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
 56 |               nb_layers=2, dropout=0.2):
 57 |         """
 58 |         Bidirectional RNN with specified dropout rate (default 0.2), built with
 59 |         softmax activation, cross entropy loss and rmsprop optimizer.
 60 | 
 61 |         Arguments:
 62 |             layer: {string}, the type of the layers in the RNN Model.
 63 |                 'LSTM': LSTM layers
 64 |                 'GRU': GRU layers
 65 |             mapping: {string}, input to output mapping.
 66 |                 'o2o': one-to-one
 67 |                 'm2m': many-to-many
 68 |             learning_rate: {float}, learning rate.
 69 |             nb_layers: {integer}, number of layers in total.
 70 |             dropout: {float}, dropout value.
 71 |         """
 72 |         print "Building Model..."
 73 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f "
 74 |                "nb_layers = %d , dropout = %.2f"
 75 |                %(self.hidden_len, layer, mapping, learning_rate,
 76 |                  nb_layers, dropout))
 77 | 
 78 |         # check the layer type: LSTM or GRU
 79 |         if layer == 'LSTM':
 80 |             class LAYER(LSTM):
 81 |                 """
 82 |                 LAYER as LSTM.
 83 |                 """
 84 |                 pass
 85 |         elif layer == 'GRU':
 86 |             class LAYER(GRU):
 87 |                 """
 88 |                 LAYER as GRU.
 89 |                 """
 90 |                 pass
 91 | 
 92 |         # check whether return sequence for each of the layers
 93 |         return_sequences = []
 94 |         if mapping == 'o2o':
 95 |             # if mapping is one-to-one
 96 |             for nl in range(nb_layers):
 97 |                 if nl == nb_layers-1:
 98 |                     return_sequences.append(False)
 99 |                 else:
100 |                     return_sequences.append(True)
101 |         elif mapping == 'm2m':
102 |             # if mapping is many-to-many
103 |             for _ in range(nb_layers):
104 |                 return_sequences.append(True)
105 | 
106 |         # add input
107 |         input_layer = Input(shape=(self.sentence_length, self.input_len),
108 |                             dtype='float32')
109 | 
110 |         # first Bi-directional LSTM layer
111 |         forward1 = LAYER(self.hidden_len,
112 |                          return_sequences=return_sequences[0])(input_layer)
113 |         forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612
114 |         backward1 = LAYER(self.hidden_len,
115 |                           return_sequences=return_sequences[0],
116 |                           go_backwards=True)(input_layer)
117 |         backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612
118 | 
119 |         # following Bi-directional layers
120 |         for nl in range(nb_layers-1):
121 |             exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)"
122 |                  %('forward' + str(nl+2),
123 |                    return_sequences[nl+1],
124 |                    'forward_dropout' + str(nl+1)))
125 |             exec("%s = Dropout(dropout)(%s)"
126 |                  %('forward_dropout' + str(nl+2),
127 |                    'forward' + str(nl+2)))
128 |             exec(("%s = LAYER(self.hidden_len, return_sequences=%s, "
129 |                   "go_backwards=True)(%s)")
130 |                  %('backward' + str(nl+2),
131 |                    return_sequences[nl+1],
132 |                    'backward_dropout' + str(nl+1)))
133 |             exec("%s = Dropout(dropout)(%s)"
134 |                  %('backward_dropout' + str(nl+2),
135 |                    'backward' + str(nl+2)))
136 | 
137 |         merged_layer = merge([locals()['forward_dropout' + str(nb_layers)],
138 |                               locals()['backward_dropout' + str(nb_layers)]],
139 |                              mode='concat', concat_axis=-1)
140 | 
141 |         if mapping == 'o2o':
142 |             output_layer = Dense(self.output_len,
143 |                                  activation='softmax')(merged_layer)
144 |         elif mapping == 'm2m':
145 |             output_layer = TimeDistributed(
146 |                 Dense(self.output_len, activation='softmax'))(merged_layer)
147 | 
148 |         # add ouput
149 |         self.model = Model(input=input_layer, output=output_layer)
150 | 
151 |         rms = RMSprop(lr=learning_rate)
152 |         # try using different optimizers and different optimizer configs
153 |         self.model.compile(loss='categorical_crossentropy',
154 |                            optimizer=rms,
155 |                            metrics=['accuracy'])
156 | 
157 |     def save_model(self, filename, overwrite=False):
158 |         """
159 |         Save the model weight into a hdf5 file.
160 | 
161 |         Arguments:
162 |             filename: {string}, the name/path to the file
163 |                 to which the weights are going to be saved.
164 |             overwrite: {bool}, overwrite existing file.
165 |         """
166 |         print "Save Weights %s ..." %filename
167 |         self.model.save_weights(filename, overwrite=overwrite)
168 | 
169 |     def load_model(self, filename):
170 |         """
171 |         Load the model weight into a hdf5 file.
172 | 
173 |         Arguments:
174 |             filename: {string}, the name/path to the file
175 |                 to which the weights are going to be loaded.
176 |         """
177 |         print "Load Weights %s ..." %filename
178 |         self.model.load_weights(filename)
179 | 
180 |     def plot_model(self, filename='brnn_model.png'):
181 |         """
182 |         Plot model.
183 | 
184 |         Arguments:
185 |             filename: {string}, the name/path to the file
186 |                 to which the weights are going to be plotted.
187 |         """
188 |         print "Plot Model %s ..." %filename
189 |         plot(self.model, to_file=filename)
190 | 
191 | 
192 | class History(Callback):
193 |     """
194 |     Record the loss and accuracy history.
195 |     """
196 |     @override
197 |     def on_train_begin(self, logs={}): # pylint: disable=W0102
198 |         """
199 |         A method starting at the begining of the training.
200 | 
201 |         Arguments:
202 |             logs: {dictionary}, recording the training and validation
203 |                 losses and accuracy of every epoch.
204 |         """
205 |         # training loss and accuracy
206 |         self.train_losses = []
207 |         self.train_acc = []
208 |         # validation loss and accuracy
209 |         self.val_losses = []
210 |         self.val_acc = []
211 | 
212 |     @override
213 |     def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102
214 |         """
215 |         A method starting at the begining of the training.
216 | 
217 |         Arguments:
218 |             epoch: {integer}, the current epoch.
219 |             logs: {dictionary}, recording the training and validation
220 |                 losses and accuracy of every epoch.
221 |         """
222 |         # record training loss and accuracy
223 |         self.train_losses.append(logs.get('loss'))
224 |         self.train_acc.append(logs.get('acc'))
225 |         # record validation loss and accuracy
226 |         self.val_losses.append(logs.get('val_loss'))
227 |         self.val_acc.append(logs.get('val_acc'))
228 | 
229 |         # continutously save the train_loss, train_acc, val_loss, val_acc
230 |         # into a csv file with 4 columns respeactively
231 |         csv_name = 'history.csv'
232 |         with open(csv_name, 'a') as csvfile:
233 |             his_writer = csv.writer(csvfile)
234 |             print "\n    Save loss and accuracy into %s" %csv_name
235 |             his_writer.writerow((logs.get('loss'), logs.get('acc'),
236 |                                  logs.get('val_loss'), logs.get('val_acc')))
237 | 
238 | 
239 | def sample(prob, temperature=0.2):
240 |     """
241 |     Softmax function for reinforcement learning.
242 | 
243 |     Arguments:
244 |         prob: {list}, a list of probabilities of each of the classes.
245 |         temperature: {float}, Softmax temperature.
246 |     Returns:
247 |         {integer}, the most possible sample.
248 |     """
249 |     prob = np.log(prob) / temperature
250 |     prob = np.exp(prob) / np.sum(np.exp(prob))
251 |     return np.argmax(np.random.multinomial(1, prob, 1))
252 | 
253 | 
254 | def get_sequence(filepath):
255 |     """
256 |     Get the original sequence from file.
257 | 
258 |     Arguments:
259 |         filename: {string}, the name/path of input log sequence file.
260 |     Returns:
261 |         {list}, the log sequence.
262 |         {integer}, the size of vocabulary.
263 |     """
264 |     # read file and convert ids of each line into array of numbers
265 |     seqfiles = glob.glob(filepath)
266 |     sequence = []
267 | 
268 |     for seqfile in seqfiles:
269 |         with open(seqfile, 'r') as f:
270 |             one_sequence = [int(id_) for id_ in f]
271 |             print "        %s, sequence length: %d" %(seqfile,
272 |                                                       len(one_sequence))
273 |             sequence.extend(one_sequence)
274 | 
275 |     # add two extra positions for 'unknown-log' and 'no-log'
276 |     vocab_size = max(sequence) + 2
277 | 
278 |     return sequence, vocab_size
279 | 
280 | 
281 | def get_data(sequence, vocab_size, mapping='m2m', sentence_length=40, step=3,
282 |              random_offset=True):
283 |     """
284 |     Retrieves data from a plain txt file and formats it using one-hot vector.
285 | 
286 |     Arguments:
287 |         sequence: {lsit}, the original input sequence
288 |         vocab_size: {integer}, the number of unique id classes
289 |         mapping: {string}, input to output mapping.
290 |             'o2o': one-to-one
291 |             'm2m': many-to-many
292 |         sentence_length: {integer}, the length of each training sentence.
293 |         step: {integer}, the sample steps.
294 |         random_offset: {bool}, the offset is random between step or is 0.
295 |     Returns:
296 |         {np.array}, training input data X
297 |         {np.array}, training target data y
298 |     """
299 |     X_sentences = []
300 |     y_sentences = []
301 |     next_ids = []
302 | 
303 |     offset = np.random.randint(0, step) if random_offset else 0
304 | 
305 |     # creat batch data and next sentences
306 |     for i in range(offset, len(sequence) - sentence_length, step):
307 |         X_sentences.append(sequence[i : i + sentence_length])
308 |         if mapping == 'o2o':
309 |             # if mapping is one-to-one
310 |             next_ids.append(sequence[i + sentence_length])
311 |         elif mapping == 'm2m':
312 |             # if mapping is many-to-many
313 |             y_sentences.append(sequence[i + 1 : i + sentence_length + 1])
314 | 
315 |     # number of sampes
316 |     nb_samples = len(X_sentences)
317 |     # print "total # of sentences: %d" %nb_samples
318 | 
319 |     # one-hot vector (all zeros except for a single one at
320 |     # the exact postion of this id number)
321 |     X_train = np.zeros((nb_samples, sentence_length, vocab_size), dtype=np.bool)
322 |     # expected outputs for each sentence
323 |     if mapping == 'o2o':
324 |         # if mapping is one-to-one
325 |         y_train = np.zeros((nb_samples, vocab_size), dtype=np.bool)
326 |     elif mapping == 'm2m':
327 |         # if mapping is many-to-many
328 |         y_train = np.zeros((nb_samples, sentence_length, vocab_size),
329 |                            dtype=np.bool)
330 | 
331 |     for i, x_sentence in enumerate(X_sentences):
332 |         for t, id_ in enumerate(x_sentence):
333 |             # mark the each corresponding character in a sentence as 1
334 |             X_train[i, t, id_] = 1
335 |             # if mapping is many-to-many
336 |             if mapping == 'm2m':
337 |                 y_train[i, t, y_sentences[i][t]] = 1
338 |         # if mapping is one-to-one
339 |         # mark the corresponding character in expected output as 1
340 |         if mapping == 'o2o':
341 |             y_train[i, next_ids[i]] = 1
342 | 
343 |     return X_train, y_train
344 | 
345 | 
346 | def predict(sequence, input_len, analyzer, nb_predictions=80,
347 |             mapping='m2m', sentence_length=40):
348 |     """
349 |     Predict the next sequences using existing model and weights given some seed.
350 | 
351 |     Arguments:
352 |         sequence: {lsit}, the original input sequence
353 |         input_len: {integer}, the number of unique id classes
354 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
355 |         nb_predictions: {integer}, number of predictions after giving the seed
356 |         mapping: {string}, input to output mapping.
357 |             'o2o': one-to-one
358 |             'm2m': many-to-many
359 |         sentence_length: {integer}, the length of each sentence.
360 |     """
361 |     # generate elements
362 |     for _ in range(nb_predictions):
363 |         # start index of the seed, random number in range
364 |         start_index = np.random.randint(0, len(sequence) - sentence_length - 1)
365 |         # seed sentence
366 |         sentence = sequence[start_index : start_index + sentence_length]
367 | 
368 |         # Y_true
369 |         y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
370 |         print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
371 | 
372 |         seed = np.zeros((1, sentence_length, input_len))
373 |         # format input
374 |         for t in range(0, sentence_length):
375 |             seed[0, t, sentence[t]] = 1
376 | 
377 |         # get predictions
378 |         # verbose = 0, no logging
379 |         predictions = analyzer.model.predict(seed, verbose=0)[0]
380 | 
381 |         # y_predicted
382 |         if mapping == 'o2o':
383 |             next_id = np.argmax(predictions)
384 |             sys.stdout.write(' ' + str(next_id))
385 |             sys.stdout.flush()
386 |         elif mapping == 'm2m':
387 |             next_sentence = []
388 |             for pred in predictions:
389 |                 next_sentence.append(np.argmax(pred))
390 |             print "y_pred: " + ' '.join(str(id_).ljust(4)
391 |                                         for id_ in next_sentence)
392 |             # next_id = np.argmax(predictions[-1])
393 | 
394 |         # y_true
395 |         print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
396 | 
397 |         print "\n"
398 | 
399 | 
400 | def train(analyzer, train_sequence, val_sequence, input_len,
401 |           batch_size=128, nb_epoch=50, nb_iterations=4,
402 |           sentence_length=40, step=40, mapping='m2m'):
403 |     """
404 |     Trains the network.
405 | 
406 |     Arguments:
407 |         analyzer: {SequenceAnalyzer}.
408 |         train_sequence: {list}, training sequence.
409 |         val_sequence: {list}, validation sequence.
410 |         input_len: {integer}, the number of classes, i.e., the input length of
411 |             neural network.
412 |         batch_size: {interger}, the number of sentences per batch.
413 |         nb_epoch: {integer}, number of epoches per iteration.
414 |         nb_iterations: {integer}, number of iterations.
415 |         sentence_length: {integer}, the length of each training sentence.
416 |         step: {integer}, the sample steps.
417 |         mapping: {string}, input to output mapping.
418 |             'o2o': one-to-one
419 |             'm2m': many-to-many
420 |     """
421 |     for iteration in range(1, nb_iterations+1):
422 |         # create training data, randomize the offset between steps
423 |         X_train, y_train = get_data(train_sequence, input_len, mapping=mapping,
424 |                                     sentence_length=sentence_length, step=step,
425 |                                     random_offset=False)
426 |         X_val, y_val = get_data(val_sequence, input_len, mapping=mapping,
427 |                                 sentence_length=sentence_length, step=step,
428 |                                 random_offset=False)
429 |         print ""
430 |         print "------------------------ Start Training ------------------------"
431 |         print "Iteration: ", iteration
432 |         print "Number of epoch per iteration: ", nb_epoch
433 | 
434 |         # history of losses and accuracy
435 |         history = History()
436 | 
437 |         # saves the model weights after each epoch
438 |         # if the validation loss decreased
439 |         checkpointer = ModelCheckpoint(filepath="weights.hdf5",
440 |                                        verbose=1, save_best_only=True)
441 | 
442 |         # train the model
443 |         analyzer.model.fit(X_train, y_train,
444 |                            batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
445 |                            callbacks=[history, checkpointer],
446 |                            validation_data=(X_val, y_val))
447 | 
448 |         analyzer.save_model("weights-after-iteration.hdf5")
449 | 
450 | 
451 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40):
452 |     """
453 |     Scan the given sequence for detecting anormalies.
454 | 
455 |     Arguments:
456 |         sequence: {lsit}, the original input sequence
457 |         input_len: {integer}, the number of unique id classes
458 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
459 |         mapping: {string}, input to output mapping.
460 |             'o2o': one-to-one
461 |             'm2m': many-to-many
462 |         sentence_length: {integer}, the length of each sentence.
463 |     """
464 |     # sequence length
465 |     length = len(sequence)
466 | 
467 |     # predicted probabilities for each id
468 |     # we assume the first sentence_length ids are true
469 |     prob = [1] * sentence_length + [0] * (length - sentence_length)
470 | 
471 |     start_time = time.time()
472 |     try:
473 |         # generate elements
474 |         for start_index in xrange(length - sentence_length):
475 |             # seed sentence
476 |             X = sequence[start_index : start_index + sentence_length]
477 |             # print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
478 | 
479 |             # Y_true
480 |             # y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
481 |             # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
482 |             y_next_true = sequence[start_index + sentence_length]
483 | 
484 |             seed = np.zeros((1, sentence_length, input_len))
485 |             # format input
486 |             for t in range(0, sentence_length):
487 |                 seed[0, t, X[t]] = 1
488 | 
489 |             # get predictionsverbose = 0, no logging
490 |             predictions = analyzer.model.predict(seed, verbose=0)[0]
491 | 
492 |             # y_predicted
493 |             y_next_pred = 0
494 |             next_prob = 0
495 |             if mapping == 'o2o':
496 |                 next_prob = predictions[y_next_true]
497 |                 prob[start_index + sentence_length] = next_prob
498 |                 y_next_pred = np.argmax(predictions)
499 |             elif mapping == 'm2m':
500 |                 # next_sentence = []
501 |                 # for pred in predictions:
502 |                 #     next_sentence.append(np.argmax(pred))
503 |                 # y_next_pred = next_sentence[-1]
504 |                 # print "y_pred: " + ' '.join(str(id_).ljust(4)
505 |                 #                             for id_ in next_sentence)
506 |                 y_next_pred = np.argmax(predictions[-1])
507 |                 next_prob = predictions[-1][y_next_true]
508 |                 prob[start_index + sentence_length] = next_prob
509 | 
510 |             print start_index, next_prob
511 |     except KeyboardInterrupt:
512 |         # print "    |-Write the clusters into %s ..." %self.cluster_file
513 |         with open('prob.txt', 'w') as prob_file:
514 |             for p in prob:
515 |                 prob_file.write(str(p) + '\n')
516 | 
517 |         plt.plot(prob, 'r*')
518 |         plt.xlim(0, 1000)
519 |         plt.ylim(0, 1)
520 |         plt.savefig("prob.png")
521 |         plt.clf()
522 |         plt.cla()
523 | 
524 |     stop_time = time.time()
525 |     print "--- %s seconds ---\n" % (stop_time - start_time)
526 | 
527 |     return prob
528 | 
529 | 
530 | def run(hidden_len=512, batch_size=128, nb_epoch=50, nb_iterations=4,
531 |         learning_rate=0.001, nb_predictions=20, mapping='m2m',
532 |         sentence_length=80, step=80, mode='train'):
533 |     """
534 |     Train, evaluate, or predict.
535 | 
536 |     Arguments:
537 |         hidden_len: {integer}, the size of a hidden layer.
538 |         batch_size: {interger}, the number of sentences per batch.
539 |         nb_epoch: {interger}, number of epoches per iteration.
540 |         nb_iterations: {integer}, number of iterations.
541 |         learning_rate: {float}, learning rate.
542 |         nb_predictions: {integer}, number of the ids predicted.
543 |         mapping: {string}, input to output mapping.
544 |             'o2o': one-to-one
545 |             'm2m': many-to-many
546 |         sentence_length: {integer}, the length of each training sentence.
547 |         step: {integer}, the sample steps.
548 |         mode: {string}, th running mode of this programm
549 |             'train': train and predict
550 |             'predict': only predict by loading existing model weights
551 |             'evaluate': evaluate the model in evaluation data set
552 |             'detect': detect a new log sequence for the probabilities
553 |     """
554 |     # get parameters and dimensions of the model
555 |     print "Loading training data..."
556 |     train_sequence, input_len1 = get_sequence("./train_data/*")
557 |     print "Loading validation data..."
558 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
559 |     input_len = max(input_len1, input_len2)
560 | 
561 |     print "Training sequence length: %d" %len(train_sequence)
562 |     print "Validation sequence length: %d" %len(val_sequence)
563 |     print "#classes: %d\n" %input_len
564 | 
565 |     # two layered LSTM 512 hidden nodes and a dropout rate of 0.2
566 |     brnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len)
567 | 
568 |     # build model
569 |     brnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate,
570 |                nb_layers=2, dropout=0.2)
571 | 
572 |     # plot model
573 |     # brnn.plot_model()
574 | 
575 |     # load the previous model weights
576 |     # brnn.load_model("weightsf4-61.hdf5")
577 | 
578 |     if mode == 'predict':
579 |         print "Predict..."
580 |         predict(val_sequence, input_len, brnn, nb_predictions=nb_predictions,
581 |                 mapping=mapping, sentence_length=sentence_length)
582 |     elif mode == 'evaluate':
583 |         print "Evaluate..."
584 |         print "Metrics: " + ', '.join(brnn.model.metrics_names)
585 |         X_val, y_val = get_data(val_sequence, input_len, mapping=mapping,
586 |                                 sentence_length=sentence_length, step=step,
587 |                                 random_offset=False)
588 |         results = brnn.model.evaluate(X_val, y_val, #pylint: disable=W0612
589 |                                       batch_size=batch_size,
590 |                                       verbose=1)
591 |         print "Loss: ", results[0]
592 |         print "Accuracy: ", results[1]
593 |     elif mode == 'train':
594 |         print "Train..."
595 |         try:
596 |             train(brnn, train_sequence, val_sequence, input_len,
597 |                   batch_size=batch_size, nb_epoch=nb_epoch,
598 |                   nb_iterations=nb_iterations,
599 |                   sentence_length=sentence_length,
600 |                   step=step, mapping=mapping)
601 |         except KeyboardInterrupt:
602 |             brnn.save_model("weights-stop.hdf5")
603 |     elif mode == 'detect':
604 |         print "Detect..."
605 |         detect(val_sequence, input_len, brnn, mapping=mapping,
606 |                sentence_length=sentence_length)
607 |     else:
608 |         print "The mode = %s is not correct!!!" %mode
609 | 
610 |     return mode
611 | 
612 | 
613 | if __name__ == '__main__':
614 |     run()
615 | 


--------------------------------------------------------------------------------
/others/brnn_sequence_analyzer_gen.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program analyze the integer sequence using Bi-diractional Recurrent Neural
  3 | Network (BRNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
  4 | based on the python library Keras.
  5 | 
  6 | Input data is Generator and the training is by calling model.fit_generator().
  7 | 
  8 | "Keras is a minimalist, highly modular neural networks library, written in
  9 |  Python and capable of running on top of either TensorFlow or Theano."
 10 |                                                 ---- Keras (http://keras.io/)
 11 | 
 12 | It is based on this Keras example - imdb_bidirectional_lstm.py:
 13 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py
 14 | 
 15 | Author: Chang Liu (fluency03)
 16 | Data: 2016-04-03
 17 | """
 18 | 
 19 | import glob
 20 | # import os
 21 | import sys
 22 | import csv
 23 | import time
 24 | import matplotlib.pyplot as plt
 25 | import numpy as np
 26 | 
 27 | from keras.callbacks import Callback, ModelCheckpoint
 28 | from keras.layers import Input, Dense, Dropout, LSTM, GRU, merge
 29 | from keras.layers.wrappers import TimeDistributed
 30 | from keras.models import Model
 31 | from keras.optimizers import RMSprop # pylint: disable=W0611
 32 | from keras.utils.visualize_util import plot
 33 | 
 34 | 
 35 | # random number generator with a fixed value for reproducibility
 36 | np.random.seed(1337)
 37 | 
 38 | 
 39 | def override(f):
 40 |     """
 41 |     Override decorator.
 42 |     """
 43 |     return f
 44 | 
 45 | 
 46 | class SequenceAnalyzer(object):
 47 |     """
 48 |     Sequence analyzer based on RNN Graph model.
 49 |     """
 50 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
 51 |         self.sentence_length = sentence_length
 52 |         self.input_len = input_len
 53 |         self.hidden_len = hidden_len
 54 |         self.output_len = output_len
 55 |         self.model = None
 56 | 
 57 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
 58 |               nb_layers=2, dropout=0.2):
 59 |         """
 60 |         Bidirectional RNN with specified dropout rate (default 0.2), built with
 61 |         softmax activation, cross entropy loss and rmsprop optimizer.
 62 | 
 63 |         Arguments:
 64 |             layer: {string}, the type of the layers in the RNN Model.
 65 |                 'LSTM': LSTM layers
 66 |                 'GRU': GRU layers
 67 |             mapping: {string}, input to output mapping.
 68 |                 'o2o': one-to-one
 69 |                 'm2m': many-to-many
 70 |             learning_rate: {float}, learning rate.
 71 |             nb_layers: {integer}, number of layers in total.
 72 |             dropout: {float}, dropout value.
 73 |         """
 74 |         print "Building Model..."
 75 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f "
 76 |                "nb_layers = %d , dropout = %.2f"
 77 |                %(self.hidden_len, layer, mapping, learning_rate,
 78 |                  nb_layers, dropout))
 79 | 
 80 |         # check the layer type: LSTM or GRU
 81 |         if layer == 'LSTM':
 82 |             class LAYER(LSTM):
 83 |                 """
 84 |                 LAYER as LSTM.
 85 |                 """
 86 |                 pass
 87 |         elif layer == 'GRU':
 88 |             class LAYER(GRU):
 89 |                 """
 90 |                 LAYER as GRU.
 91 |                 """
 92 |                 pass
 93 | 
 94 |         # check whether return sequence for each of the layers
 95 |         return_sequences = []
 96 |         if mapping == 'o2o':
 97 |             # if mapping is one-to-one
 98 |             for nl in range(nb_layers):
 99 |                 if nl == nb_layers-1:
100 |                     return_sequences.append(False)
101 |                 else:
102 |                     return_sequences.append(True)
103 |         elif mapping == 'm2m':
104 |             # if mapping is many-to-many
105 |             for _ in range(nb_layers):
106 |                 return_sequences.append(True)
107 | 
108 |         # add input
109 |         input_layer = Input(shape=(self.sentence_length, self.input_len),
110 |                             dtype='float32')
111 | 
112 |         # first Bi-directional LSTM layer
113 |         forward1 = LAYER(self.hidden_len,
114 |                          return_sequences=return_sequences[0])(input_layer)
115 |         forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612
116 |         backward1 = LAYER(self.hidden_len,
117 |                           return_sequences=return_sequences[0],
118 |                           go_backwards=True)(input_layer)
119 |         backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612
120 | 
121 |         # following Bi-directional layers
122 |         for nl in range(nb_layers-1):
123 |             exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)"
124 |                  %('forward' + str(nl+2),
125 |                    return_sequences[nl+1],
126 |                    'forward_dropout' + str(nl+1)))
127 |             exec("%s = Dropout(dropout)(%s)"
128 |                  %('forward_dropout' + str(nl+2),
129 |                    'forward' + str(nl+2)))
130 |             exec(("%s = LAYER(self.hidden_len, return_sequences=%s, "
131 |                   "go_backwards=True)(%s)")
132 |                  %('backward' + str(nl+2),
133 |                    return_sequences[nl+1],
134 |                    'backward_dropout' + str(nl+1)))
135 |             exec("%s = Dropout(dropout)(%s)"
136 |                  %('backward_dropout' + str(nl+2),
137 |                    'backward' + str(nl+2)))
138 | 
139 |         merged_layer = merge([locals()['forward_dropout' + str(nb_layers)],
140 |                               locals()['backward_dropout' + str(nb_layers)]],
141 |                              mode='concat', concat_axis=-1)
142 | 
143 |         if mapping == 'o2o':
144 |             output_layer = Dense(self.output_len,
145 |                                  activation='softmax')(merged_layer)
146 |         elif mapping == 'm2m':
147 |             output_layer = TimeDistributed(
148 |                 Dense(self.output_len, activation='softmax'))(merged_layer)
149 | 
150 |         # add ouput
151 |         self.model = Model(input=input_layer, output=output_layer)
152 | 
153 |         rms = RMSprop(lr=learning_rate)
154 |         # try using different optimizers and different optimizer configs
155 |         self.model.compile(loss='categorical_crossentropy',
156 |                            optimizer=rms,
157 |                            metrics=['accuracy'])
158 | 
159 |     def save_model(self, filename, overwrite=False):
160 |         """
161 |         Save the model weight into a hdf5 file.
162 | 
163 |         Arguments:
164 |             filename: {string}, the name/path to the file
165 |                 to which the weights are going to be saved.
166 |             overwrite: {bool}, overwrite existing file.
167 |         """
168 |         print "Save Weights %s ..." %filename
169 |         self.model.save_weights(filename, overwrite=overwrite)
170 | 
171 |     def load_model(self, filename):
172 |         """
173 |         Load the model weight into a hdf5 file.
174 | 
175 |         Arguments:
176 |             filename: {string}, the name/path to the file
177 |                 to which the weights are going to be loaded.
178 |         """
179 |         print "Load Weights %s ..." %filename
180 |         self.model.load_weights(filename)
181 | 
182 |     def plot_model(self, filename='brnn_model.png'):
183 |         """
184 |         Plot model.
185 | 
186 |         Arguments:
187 |             filename: {string}, the name/path to the file
188 |                 to which the weights are going to be plotted.
189 |         """
190 |         print "Plot Model %s ..." %filename
191 |         plot(self.model, to_file=filename)
192 | 
193 | 
194 | class History(Callback):
195 |     """
196 |     Record the loss and accuracy history.
197 |     """
198 |     @override
199 |     def on_train_begin(self, logs={}): # pylint: disable=W0102
200 |         """
201 |         A method starting at the begining of the training.
202 | 
203 |         Arguments:
204 |             logs: {dictionary}, recording the training and validation
205 |                 losses and accuracy of every epoch.
206 |         """
207 |         # training loss and accuracy
208 |         self.train_losses = []
209 |         self.train_acc = []
210 |         # validation loss and accuracy
211 |         self.val_losses = []
212 |         self.val_acc = []
213 | 
214 |     @override
215 |     def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102
216 |         """
217 |         A method starting at the begining of the training.
218 | 
219 |         Arguments:
220 |             epoch: {integer}, the current epoch.
221 |             logs: {dictionary}, recording the training and validation
222 |                 losses and accuracy of every epoch.
223 |         """
224 |         # record training loss and accuracy
225 |         self.train_losses.append(logs.get('loss'))
226 |         self.train_acc.append(logs.get('acc'))
227 |         # record validation loss and accuracy
228 |         self.val_losses.append(logs.get('val_loss'))
229 |         self.val_acc.append(logs.get('val_acc'))
230 | 
231 |         # continutously save the train_loss, train_acc, val_loss, val_acc
232 |         # into a csv file with 4 columns respeactively
233 |         csv_name = 'history.csv'
234 |         with open(csv_name, 'a') as csvfile:
235 |             his_writer = csv.writer(csvfile)
236 |             print "\n    Save loss and accuracy into %s" %csv_name
237 |             his_writer.writerow((logs.get('loss'), logs.get('acc'),
238 |                                  logs.get('val_loss'), logs.get('val_acc')))
239 | 
240 | 
241 | def sample(prob, temperature=0.2):
242 |     """
243 |     Softmax function for reinforcement learning.
244 | 
245 |     Arguments:
246 |         prob: {list}, a list of probabilities of each of the classes.
247 |         temperature: {float}, Softmax temperature.
248 |     Returns:
249 |         {integer}, the most possible sample.
250 |     """
251 |     prob = np.log(prob) / temperature
252 |     prob = np.exp(prob) / np.sum(np.exp(prob))
253 |     return np.argmax(np.random.multinomial(1, prob, 1))
254 | 
255 | 
256 | def get_sequence(filepath):
257 |     """
258 |     Get the original sequence from file.
259 | 
260 |     Arguments:
261 |         filename: {string}, the name/path of input log sequence file.
262 |     Returns:
263 |         {list}, the log sequence.
264 |         {integer}, the size of vocabulary.
265 |     """
266 |     # read file and convert ids of each line into array of numbers
267 |     seqfiles = glob.glob(filepath)
268 |     sequence = []
269 | 
270 |     for seqfile in seqfiles:
271 |         with open(seqfile, 'r') as f:
272 |             one_sequence = [int(id_) for id_ in f]
273 |             print "        %s, sequence length: %d" %(seqfile,
274 |                                                       len(one_sequence))
275 |             sequence.extend(one_sequence)
276 | 
277 |     # add two extra positions for 'unknown-log' and 'no-log'
278 |     vocab_size = max(sequence) + 2
279 | 
280 |     return sequence, vocab_size
281 | 
282 | 
283 | def data_generator(sequence, vocab_size, mapping='m2m', sentence_length=40,
284 |                    step=3, random_offset=True, batch_size=128):
285 |     """
286 |     Retrieves data from a plain txt file and formats it using one-hot vector.
287 |     This method returns a data generator yeilding a batch of (X_train, y_train)
288 |     every time being called.
289 | 
290 |     Arguments:
291 |         sequence: {lsit}, the original input sequence
292 |         vocab_size: {integer}, the number of unique id classes
293 |         mapping: {string}, input to output mapping.
294 |             'o2o': one-to-one
295 |             'm2m': many-to-many
296 |         sentence_length: {integer}, the length of each training sentence.
297 |         step: {integer}, the sample steps.
298 |         random_offset: {bool}, the offset is random between step or is 0.
299 |         batch_size: {integer}, the number of sample per batch.
300 |     Yields:
301 |         {np.array}, training input data X
302 |         {np.array}, training target data y
303 |     """
304 |     # the number of current sample
305 |     sample_count = 0
306 | 
307 |     # one-hot vector (all zeros except for a single one at
308 |     # the exact postion of this id number)
309 |     X_train = np.zeros((batch_size, sentence_length, vocab_size),
310 |                        dtype=np.bool)
311 |     # expected outputs for each sentence
312 |     if mapping == 'o2o':
313 |         # if mapping is one-to-one
314 |         y_train = np.zeros((batch_size, vocab_size), dtype=np.bool)
315 |     elif mapping == 'm2m':
316 |         # if mapping is many-to-many
317 |         y_train = np.zeros((batch_size, sentence_length, vocab_size),
318 |                            dtype=np.bool)
319 | 
320 |     # continuousy creat batch data and next sentences
321 |     while True:
322 |         offset = np.random.randint(0, step) if random_offset else 0
323 |         for i in range(offset, len(sequence) - sentence_length, step):
324 |             # index of a this sample in this batch
325 |             batch_index = sample_count % batch_size
326 | 
327 |             # re-initialzing the batch
328 |             if batch_index == 0:
329 |                 X_train.fill(0)
330 |                 y_train.fill(0)
331 | 
332 |             # current sample and target outputs
333 |             X_sentence = []
334 |             y_sentence = []
335 |             next_id = []
336 | 
337 |             X_sentence = sequence[i : i + sentence_length]
338 |             if mapping == 'o2o':
339 |                 # if mapping is one-to-one
340 |                 next_id = sequence[i + sentence_length]
341 |             elif mapping == 'm2m':
342 |                 # if mapping is many-to-many
343 |                 y_sentence = sequence[i + 1 : i + sentence_length + 1]
344 | 
345 |             for t, id_ in enumerate(X_sentence):
346 |                 # mark the each corresponding character in a sentence as 1
347 |                 X_train[batch_index, t, id_] = 1
348 |                 # if mapping is many-to-many
349 |                 if mapping == 'm2m':
350 |                     y_train[batch_index, t, y_sentence[t]] = 1
351 |             # if mapping is one-to-one
352 |             # mark the corresponding character in expected output as 1
353 |             if mapping == 'o2o':
354 |                 y_train[batch_index, next_id] = 1
355 | 
356 |             # sample count plus 1
357 |             sample_count += 1
358 | 
359 |             if batch_index == batch_size-1:
360 |                 yield X_train, y_train
361 | 
362 | 
363 | def predict(sequence, input_len, analyzer, nb_predictions=80,
364 |             mapping='m2m', sentence_length=40):
365 |     """
366 |     Predict the next sequences using existing model and weights given some seed.
367 | 
368 |     Arguments:
369 |         sequence: {lsit}, the original input sequence
370 |         input_len: {integer}, the number of unique id classes
371 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
372 |         nb_predictions: {integer}, number of predictions after giving the seed
373 |         mapping: {string}, input to output mapping.
374 |             'o2o': one-to-one
375 |             'm2m': many-to-many
376 |         sentence_length: {integer}, the length of each sentence.
377 |     """
378 |     # generate elements
379 |     for _ in range(nb_predictions):
380 |         # start index of the seed, random number in range
381 |         start_index = np.random.randint(0, len(sequence) - sentence_length - 1)
382 |         # seed sentence
383 |         sentence = sequence[start_index : start_index + sentence_length]
384 | 
385 |         # Y_true
386 |         y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
387 |         print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
388 | 
389 |         seed = np.zeros((1, sentence_length, input_len))
390 |         # format input
391 |         for t in range(0, sentence_length):
392 |             seed[0, t, sentence[t]] = 1
393 | 
394 |         # get predictions
395 |         # verbose = 0, no logging
396 |         predictions = analyzer.model.predict(seed, verbose=0)[0]
397 | 
398 |         # y_predicted
399 |         if mapping == 'o2o':
400 |             next_id = np.argmax(predictions)
401 |             sys.stdout.write(' ' + str(next_id))
402 |             sys.stdout.flush()
403 |         elif mapping == 'm2m':
404 |             next_sentence = []
405 |             for pred in predictions:
406 |                 next_sentence.append(np.argmax(pred))
407 |             print "y_pred: " + ' '.join(str(id_).ljust(4)
408 |                                         for id_ in next_sentence)
409 |             # next_id = np.argmax(predictions[-1])
410 | 
411 |         # y_true
412 |         print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
413 | 
414 |         print "\n"
415 | 
416 | 
417 | def train(analyzer, train_data, nb_training_samples,
418 |           val_data, nb_validation_samples,
419 |           nb_epoch=50, nb_iterations=4):
420 |     """
421 |     Trains the network.
422 | 
423 |     Arguments:
424 |         analyzer: {SequenceAnalyzer}.
425 |         train_data: {tuple}, training data (X_train, y_train).
426 |         val_data: {tuple}, validation data (X_val, y_val).
427 |         nb_training_samples: {integer}, the number training samples.
428 |         nb_validation_samples: {integer}, the number validation samples.
429 |         nb_iterations: {integer}, number of iterations.
430 |         sentence_length: {integer}, the length of each training sentence.
431 |     """
432 |     for iteration in range(1, nb_iterations+1):
433 |         print ""
434 |         print "------------------------ Start Training ------------------------"
435 |         print "Iteration: ", iteration
436 |         print "Number of epoch per iteration: ", nb_epoch
437 | 
438 |         # history of losses and accuracy
439 |         history = History()
440 | 
441 |         # saves the model weights after each epoch
442 |         # if the validation loss decreased
443 |         checkpointer = ModelCheckpoint(filepath="weights.hdf5",
444 |                                        verbose=1, save_best_only=True)
445 | 
446 |         # train the model with data generator
447 |         analyzer.model.fit_generator(train_data,
448 |                                      samples_per_epoch=nb_training_samples,
449 |                                      nb_epoch=nb_epoch, verbose=1,
450 |                                      callbacks=[history, checkpointer],
451 |                                      validation_data=val_data,
452 |                                      nb_val_samples=nb_validation_samples)
453 | 
454 |         analyzer.save_model("weights-after-iteration.hdf5")
455 | 
456 | 
457 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40):
458 |     """
459 |     Scan the given sequence for detecting anormalies.
460 | 
461 |     Arguments:
462 |         sequence: {lsit}, the original input sequence
463 |         input_len: {integer}, the number of unique id classes
464 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
465 |         mapping: {string}, input to output mapping.
466 |             'o2o': one-to-one
467 |             'm2m': many-to-many
468 |         sentence_length: {integer}, the length of each sentence.
469 |     """
470 |     # sequence length
471 |     length = len(sequence)
472 | 
473 |     # predicted probabilities for each id
474 |     # we assume the first sentence_length ids are true
475 |     prob = [1] * sentence_length + [0] * (length - sentence_length)
476 | 
477 |     start_time = time.time()
478 |     try:
479 |         # generate elements
480 |         for start_index in xrange(length - sentence_length):
481 |             # seed sentence
482 |             X = sequence[start_index : start_index + sentence_length]
483 |             # print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
484 | 
485 |             # Y_true
486 |             # y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
487 |             # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
488 |             y_next_true = sequence[start_index + sentence_length]
489 | 
490 |             seed = np.zeros((1, sentence_length, input_len))
491 |             # format input
492 |             for t in range(0, sentence_length):
493 |                 seed[0, t, X[t]] = 1
494 | 
495 |             # get predictionsverbose = 0, no logging
496 |             predictions = analyzer.model.predict(seed, verbose=0)[0]
497 | 
498 |             # y_predicted
499 |             y_next_pred = 0
500 |             next_prob = 0
501 |             if mapping == 'o2o':
502 |                 next_prob = predictions[y_next_true]
503 |                 prob[start_index + sentence_length] = next_prob
504 |                 y_next_pred = np.argmax(predictions)
505 |             elif mapping == 'm2m':
506 |                 # next_sentence = []
507 |                 # for pred in predictions:
508 |                 #     next_sentence.append(np.argmax(pred))
509 |                 # y_next_pred = next_sentence[-1]
510 |                 # print "y_pred: " + ' '.join(str(id_).ljust(4)
511 |                 #                             for id_ in next_sentence)
512 |                 y_next_pred = np.argmax(predictions[-1])
513 |                 next_prob = predictions[-1][y_next_true]
514 |                 prob[start_index + sentence_length] = next_prob
515 | 
516 |             print start_index, next_prob
517 |     except KeyboardInterrupt:
518 |         # print "    |-Write the clusters into %s ..." %self.cluster_file
519 |         with open('prob.txt', 'w') as prob_file:
520 |             for p in prob:
521 |                 prob_file.write(str(p) + '\n')
522 | 
523 |         plt.plot(prob, 'r*')
524 |         plt.xlim(0, 1000)
525 |         plt.ylim(0, 1)
526 |         plt.savefig("prob.png")
527 |         plt.clf()
528 |         plt.cla()
529 | 
530 |     stop_time = time.time()
531 |     print "--- %s seconds ---\n" % (stop_time - start_time)
532 | 
533 |     return prob
534 | 
535 | 
536 | def run(hidden_len=512, batch_size=128, nb_batch=200, nb_epoch=50,
537 |         nb_iterations=4, learning_rate=0.001, nb_predictions=20,
538 |         mapping='m2m', sentence_length=80, step=80, mode='train'):
539 |     """
540 |     Train, evaluate, or predict.
541 | 
542 |     Arguments:
543 |         hidden_len: {integer}, the size of a hidden layer.
544 |         batch_size: {interger}, the number of sentences per batch.
545 |         nb_batch: {integer}, number of batches to be trained durign each epoch.
546 |         nb_epoch: {interger}, number of epoches per iteration.
547 |         nb_iterations: {integer}, number of iterations.
548 |         learning_rate: {float}, learning rate.
549 |         nb_predictions: {integer}, number of the ids predicted.
550 |         mapping: {string}, input to output mapping.
551 |             'o2o': one-to-one
552 |             'm2m': many-to-many
553 |         sentence_length: {integer}, the length of each training sentence.
554 |         step: {integer}, the sample steps.
555 |         mode: {string}, th running mode of this programm
556 |             'train': train and predict
557 |             'predict': only predict by loading existing model weights
558 |             'evaluate': evaluate the model in evaluation data set
559 |             'detect': detect a new log sequence for the probabilities
560 |     """
561 |     # get parameters and dimensions of the model
562 |     print "Loading training data..."
563 |     train_sequence, input_len1 = get_sequence("./train_data/*")
564 |     print "Loading validation data..."
565 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
566 |     input_len = max(input_len1, input_len2)
567 | 
568 |     print "Training sequence length: %d" %len(train_sequence)
569 |     print "Validation sequence length: %d" %len(val_sequence)
570 |     print "#classes: %d\n" %input_len
571 | 
572 |     # data generator of X_train and y_train, with random offset
573 |     train_data = data_generator(train_sequence, input_len, mapping=mapping,
574 |                                 sentence_length=sentence_length, step=step,
575 |                                 random_offset=True, batch_size=batch_size)
576 | 
577 |     # data generator of X_val and y _val,  with random offset
578 |     val_data = data_generator(val_sequence, input_len, mapping=mapping,
579 |                               sentence_length=sentence_length, step=step,
580 |                               random_offset=True, batch_size=batch_size)
581 | 
582 |     # two layered LSTM 512 hidden nodes and a dropout rate of 0.2
583 |     brnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len)
584 | 
585 |     # build model
586 |     brnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate,
587 |                nb_layers=2, dropout=0.2)
588 | 
589 |     # plot model
590 |     # brnn.plot_model()
591 | 
592 |     # load the previous model weights
593 |     # brnn.load_model("weightsf4-61.hdf5")
594 | 
595 |     if mode == 'predict':
596 |         print "Predict..."
597 |         predict(val_sequence, input_len, brnn, nb_predictions=nb_predictions,
598 |                 mapping=mapping, sentence_length=sentence_length)
599 |     elif mode == 'evaluate':
600 |         print "Evaluate..."
601 |         print "Metrics: " + ', '.join(brnn.model.metrics_names)
602 |         X_val, y_val = data_generator(val_sequence, input_len, mapping=mapping,
603 |                                       sentence_length=sentence_length,
604 |                                       step=step, random_offset=False,
605 |                                       batch_size=batch_size)
606 |         results = brnn.model.evaluate(X_val, y_val, #pylint: disable=W0612
607 |                                       batch_size=batch_size,
608 |                                       verbose=1)
609 |         print "Loss: ", results[0]
610 |         print "Accuracy: ", results[1]
611 |     elif mode == 'train':
612 |         print "Train..."
613 |         # number of training sampes and validation samples
614 |         nb_training_samples = batch_size * nb_batch
615 |         nb_validation_samples = int(nb_training_samples * 0.05)
616 | 
617 |         try:
618 |             train(brnn, train_data, nb_training_samples,
619 |                   val_data, nb_validation_samples,
620 |                   nb_epoch=nb_epoch, nb_iterations=nb_iterations)
621 |         except KeyboardInterrupt:
622 |             brnn.save_model("weights-stop.hdf5")
623 |     elif mode == 'detect':
624 |         print "Detect..."
625 |         detect(val_sequence, input_len, brnn, mapping=mapping,
626 |                sentence_length=sentence_length)
627 |     else:
628 |         print "The mode = %s is not correct!!!" %mode
629 | 
630 |     return mode
631 | 
632 | 
633 | if __name__ == '__main__':
634 |     run()
635 | 


--------------------------------------------------------------------------------
/others/rnn_sequence_analyzer_gen.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program analyze the integer sequence using Uni-diractional Recurrent Neural
  3 | Network (RNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
  4 | based on the python library Keras.
  5 | 
  6 | Input data is Generator and the training is by calling model.fit_generator().
  7 | 
  8 | "Keras is a minimalist, highly modular neural networks library, written in
  9 |  Python and capable of running on top of either TensorFlow or Theano."
 10 |                                                 ---- Keras (http://keras.io/)
 11 | 
 12 | It is based on this Keras example - lstm_text_generation:
 13 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
 14 | 
 15 | Author: Chang Liu (fluency03)
 16 | Data: 2016-04-01
 17 | """
 18 | 
 19 | import glob
 20 | # import os
 21 | import sys
 22 | import csv
 23 | import time
 24 | import matplotlib.pyplot as plt
 25 | import numpy as np
 26 | 
 27 | from keras.callbacks import Callback, ModelCheckpoint
 28 | from keras.layers import Activation, Dense, Dropout, LSTM, GRU
 29 | from keras.layers.wrappers import TimeDistributed
 30 | from keras.models import Sequential
 31 | from keras.optimizers import RMSprop # pylint: disable=W0611
 32 | from keras.utils.visualize_util import plot
 33 | 
 34 | 
 35 | # random number generator with a fixed value for reproducibility
 36 | np.random.seed(1337)
 37 | 
 38 | 
 39 | def override(f):
 40 |     """
 41 |     Override decorator.
 42 |     """
 43 |     return f
 44 | 
 45 | 
 46 | class SequenceAnalyzer(object):
 47 |     """
 48 |     Sequence analyzer based on RNN Sequential Model.
 49 |     """
 50 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
 51 |         self.sentence_length = sentence_length
 52 |         self.input_len = input_len
 53 |         self.hidden_len = hidden_len
 54 |         self.output_len = output_len
 55 |         self.model = Sequential()
 56 | 
 57 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
 58 |               nb_layers=2, dropout=0.2):
 59 |         """
 60 |         Stacked RNN with specified dropout rate (default 0.2), built with
 61 |         softmax activation, cross entropy loss and rmsprop optimizer.
 62 | 
 63 |         Arguments:
 64 |             layer: {string}, the type of the layers in the RNN Model.
 65 |                 'LSTM': LSTM layers
 66 |                 'GRU': GRU layers
 67 |             mapping: {string}, input to output mapping.
 68 |                 'o2o': one-to-one
 69 |                 'm2m': many-to-many
 70 |             learning_rate: {float}, learning rate.
 71 |             nb_layers: {integer}, number of layers in total.
 72 |             dropout: {float}, dropout value.
 73 |         """
 74 |         print "Building Model..."
 75 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f "
 76 |                "nb_layers = %d , dropout = %.2f"
 77 |                %(self.hidden_len, layer, mapping, learning_rate,
 78 |                  nb_layers, dropout))
 79 | 
 80 |         # check the layer type: LSTM or GRU
 81 |         if layer == 'LSTM':
 82 |             class LAYER(LSTM):
 83 |                 """
 84 |                 LAYER as LSTM.
 85 |                 """
 86 |                 pass
 87 |         elif layer == 'GRU':
 88 |             class LAYER(GRU):
 89 |                 """
 90 |                 LAYER as GRU.
 91 |                 """
 92 |                 pass
 93 | 
 94 |         # check whether return sequence for each of the layers
 95 |         return_sequences = []
 96 |         if mapping == 'o2o':
 97 |             # if mapping is one-to-one
 98 |             for nl in range(nb_layers):
 99 |                 if nl == nb_layers-1:
100 |                     return_sequences.append(False)
101 |                 else:
102 |                     return_sequences.append(True)
103 |         elif mapping == 'm2m':
104 |             # if mapping is many-to-many
105 |             for _ in range(nb_layers):
106 |                 return_sequences.append(True)
107 | 
108 |         # first layer RNN with specified number of nodes in the hidden layer.
109 |         self.model.add(LAYER(self.hidden_len,
110 |                              return_sequences=return_sequences[0],
111 |                              input_shape=(self.sentence_length,
112 |                                           self.input_len)))
113 |         self.model.add(Dropout(dropout))
114 | 
115 |         # the following layers
116 |         for nl in range(nb_layers-1):
117 |             self.model.add(LAYER(self.hidden_len,
118 |                                  return_sequences=return_sequences[nl+1]))
119 |             self.model.add(Dropout(dropout))
120 | 
121 |         if mapping == 'o2o':
122 |             # if mapping is one-to-one
123 |             self.model.add(Dense(self.output_len))
124 |         elif mapping == 'm2m':
125 |             # if mapping is many-to-many
126 |             self.model.add(TimeDistributed(Dense(self.output_len)))
127 | 
128 |         self.model.add(Activation('softmax'))
129 | 
130 |         rms = RMSprop(lr=learning_rate)
131 |         self.model.compile(loss='categorical_crossentropy',
132 |                            optimizer=rms,
133 |                            metrics=['accuracy'])
134 | 
135 |     def save_model(self, filename, overwrite=False):
136 |         """
137 |         Save the model weight into a hdf5 file.
138 | 
139 |         Arguments:
140 |             filename: {string}, the name/path to the file
141 |                 to which the weights are going to be saved.
142 |             overwrite: {bool}, overwrite existing file.
143 |         """
144 |         print "Save Weights %s ..." %filename
145 |         self.model.save_weights(filename, overwrite=overwrite)
146 | 
147 |     def load_model(self, filename):
148 |         """
149 |         Load the model weight into a hdf5 file.
150 | 
151 |         Arguments:
152 |             filename: {string}, the name/path to the file
153 |                 to which the weights are going to be loaded.
154 |         """
155 |         print "Load Weights %s ..." %filename
156 |         self.model.load_weights(filename)
157 | 
158 |     def plot_model(self, filename='rnn_model.png'):
159 |         """
160 |         Plot model.
161 | 
162 |         Arguments:
163 |             filename: {string}, the name/path to the file
164 |                 to which the weights are going to be plotted.
165 |         """
166 |         print "Plot Model %s ..." %filename
167 |         plot(self.model, to_file=filename)
168 | 
169 | 
170 | class History(Callback):
171 |     """
172 |     Record the loss and accuracy history.
173 |     """
174 |     @override
175 |     def on_train_begin(self, logs={}): # pylint: disable=W0102
176 |         """
177 |         A method starting at the begining of the training.
178 | 
179 |         Arguments:
180 |             logs: {dictionary}, recording the training and validation
181 |                 losses and accuracy of every epoch.
182 |         """
183 |         # training loss and accuracy
184 |         self.train_losses = []
185 |         self.train_acc = []
186 |         # validation loss and accuracy
187 |         self.val_losses = []
188 |         self.val_acc = []
189 | 
190 |     @override
191 |     def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102
192 |         """
193 |         A method starting at the begining of the training.
194 | 
195 |         Arguments:
196 |             epoch: {integer}, the current epoch.
197 |             logs: {dictionary}, recording the training and validation
198 |                 losses and accuracy of every epoch.
199 |         """
200 |         # record training loss and accuracy
201 |         self.train_losses.append(logs.get('loss'))
202 |         self.train_acc.append(logs.get('acc'))
203 |         # record validation loss and accuracy
204 |         self.val_losses.append(logs.get('val_loss'))
205 |         self.val_acc.append(logs.get('val_acc'))
206 | 
207 |         # continutously save the train_loss, train_acc, val_loss, val_acc
208 |         # into a csv file with 4 columns respeactively
209 |         csv_name = 'history.csv'
210 |         with open(csv_name, 'a') as csvfile:
211 |             his_writer = csv.writer(csvfile)
212 |             print "\n    Save loss and accuracy into %s" %csv_name
213 |             his_writer.writerow((logs.get('loss'), logs.get('acc'),
214 |                                  logs.get('val_loss'), logs.get('val_acc')))
215 | 
216 | 
217 | def sample(prob, temperature=0.2):
218 |     """
219 |     Softmax function for reinforcement learning.
220 | 
221 |     Arguments:
222 |         prob: {list}, a list of probabilities of each of the classes.
223 |         temperature: {float}, Softmax temperature.
224 |     Returns:
225 |         {integer}, the most possible sample.
226 |     """
227 |     prob = np.log(prob) / temperature
228 |     prob = np.exp(prob) / np.sum(np.exp(prob))
229 |     return np.argmax(np.random.multinomial(1, prob, 1))
230 | 
231 | 
232 | def get_sequence(filepath):
233 |     """
234 |     Get the original sequence from file.
235 | 
236 |     Arguments:
237 |         filename: {string}, the name/path of input log sequence file.
238 |     Returns:
239 |         {list}, the log sequence.
240 |         {integer}, the size of vocabulary.
241 |     """
242 |     # read file and convert ids of each line into array of numbers
243 |     seqfiles = glob.glob(filepath)
244 |     sequence = []
245 | 
246 |     for seqfile in seqfiles:
247 |         with open(seqfile, 'r') as f:
248 |             one_sequence = [int(id_) for id_ in f]
249 |             print "        %s, sequence length: %d" %(seqfile,
250 |                                                       len(one_sequence))
251 |             sequence.extend(one_sequence)
252 | 
253 |     # add two extra positions for 'unknown-log' and 'no-log'
254 |     vocab_size = max(sequence) + 2
255 | 
256 |     return sequence, vocab_size
257 | 
258 | 
259 | def data_generator(sequence, vocab_size, mapping='m2m', sentence_length=40,
260 |                    step=3, random_offset=True, batch_size=64):
261 |     """
262 |     Retrieves data from a plain txt file and formats it using one-hot vector.
263 |     This method returns a data generator yeilding a batch of (X_train, y_train)
264 |     every time being called.
265 | 
266 |     Arguments:
267 |         sequence: {lsit}, the original input sequence
268 |         vocab_size: {integer}, the number of unique id classes
269 |         mapping: {string}, input to output mapping.
270 |             'o2o': one-to-one
271 |             'm2m': many-to-many
272 |         sentence_length: {integer}, the length of each training sentence.
273 |         step: {integer}, the sample steps.
274 |         random_offset: {bool}, the offset is random between step or is 0.
275 |         batch_size: {integer}, the number of sample per batch.
276 |     Yields:
277 |         {np.array}, training input data X
278 |         {np.array}, training target data y
279 |     """
280 |     # the number of current sample
281 |     sample_count = 0
282 | 
283 |     # one-hot vector (all zeros except for a single one at
284 |     # the exact postion of this id number)
285 |     X_train = np.zeros((batch_size, sentence_length, vocab_size), dtype=np.bool)
286 | 
287 |     # expected outputs for each sentence
288 |     if mapping == 'o2o':
289 |         # if mapping is one-to-one
290 |         y_train = np.zeros((batch_size, vocab_size), dtype=np.bool)
291 |     elif mapping == 'm2m':
292 |         # if mapping is many-to-many
293 |         y_train = np.zeros((batch_size, sentence_length, vocab_size),
294 |                            dtype=np.bool)
295 | 
296 |     # continuousy creat batch data and next sentences
297 |     while True:
298 |         offset = np.random.randint(0, step) if random_offset else 0
299 |         for i in range(offset, len(sequence) - sentence_length, step):
300 |             # index of a this sample in this batch
301 |             batch_index = sample_count % batch_size
302 |             # print sample_count
303 |             # print batch_index
304 | 
305 |             # re-initialzing the batch
306 |             if batch_index == 0:
307 |                 X_train.fill(0)
308 |                 y_train.fill(0)
309 | 
310 |             # current sample and target outputs
311 |             X_sentence = []
312 |             y_sentence = []
313 |             next_id = []
314 | 
315 |             X_sentence = sequence[i : i + sentence_length]
316 |             if mapping == 'o2o':
317 |                 # if mapping is one-to-one
318 |                 next_id = sequence[i + sentence_length]
319 |             elif mapping == 'm2m':
320 |                 # if mapping is many-to-many
321 |                 y_sentence = sequence[i + 1 : i + sentence_length + 1]
322 | 
323 |             for t, id_ in enumerate(X_sentence):
324 |                 # mark the each corresponding character in a sentence as 1
325 |                 X_train[batch_index, t, id_] = 1
326 |                 # if mapping is many-to-many
327 |                 if mapping == 'm2m':
328 |                     y_train[batch_index, t, y_sentence[t]] = 1
329 |             # if mapping is one-to-one
330 |             # mark the corresponding character in expected output as 1
331 |             if mapping == 'o2o':
332 |                 y_train[batch_index, next_id] = 1
333 | 
334 |             # sample count plus 1
335 |             sample_count += 1
336 | 
337 |             if batch_index == batch_size-1:
338 |                 yield X_train, y_train
339 | 
340 | 
341 | def predict(sequence, input_len, analyzer, nb_predictions=80,
342 |             mapping='m2m', sentence_length=40):
343 |     """
344 |     Predict the next sequences using existing model and weights given some seed.
345 | 
346 |     Arguments:
347 |         sequence: {lsit}, the original input sequence
348 |         input_len: {integer}, the number of unique id classes
349 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
350 |         nb_predictions: {integer}, number of predictions after giving the seed
351 |         mapping: {string}, input to output mapping.
352 |             'o2o': one-to-one
353 |             'm2m': many-to-many
354 |         sentence_length: {integer}, the length of each sentence.
355 |     """
356 |     # generate elements
357 |     for _ in range(nb_predictions):
358 |         # start index of the seed, random number in range
359 |         start_index = np.random.randint(0, len(sequence) - sentence_length - 1)
360 |         # seed sentence
361 |         sentence = sequence[start_index : start_index + sentence_length]
362 | 
363 |         # Y_true
364 |         y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
365 |         print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
366 | 
367 |         seed = np.zeros((1, sentence_length, input_len))
368 |         # format input
369 |         for t in range(0, sentence_length):
370 |             seed[0, t, sentence[t]] = 1
371 | 
372 |         # get predictions
373 |         # verbose = 0, no logging
374 |         predictions = analyzer.model.predict(seed, verbose=0)[0]
375 | 
376 |         # y_predicted
377 |         if mapping == 'o2o':
378 |             next_id = np.argmax(predictions)
379 |             sys.stdout.write(' ' + str(next_id))
380 |             sys.stdout.flush()
381 |         elif mapping == 'm2m':
382 |             next_sentence = []
383 |             for pred in predictions:
384 |                 next_sentence.append(np.argmax(pred))
385 |             print "y_pred: " + ' '.join(str(id_).ljust(4)
386 |                                         for id_ in next_sentence)
387 |             # next_id = np.argmax(predictions[-1])
388 | 
389 |         # y_true
390 |         print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
391 | 
392 |         print "\n"
393 | 
394 | 
395 | def train(analyzer, train_data, nb_training_samples,
396 |           val_data, nb_validation_samples,
397 |           nb_epoch=50, nb_iterations=4):
398 |     """
399 |     Trains the network.
400 | 
401 |     Arguments:
402 |         analyzer: {SequenceAnalyzer}.
403 |         train_data: {tuple}, training data (X_train, y_train).
404 |         val_data: {tuple}, validation data (X_val, y_val).
405 |         nb_training_samples: {integer}, the number training samples.
406 |         nb_validation_samples: {integer}, the number validation samples.
407 |         nb_iterations: {integer}, number of iterations.
408 |         sentence_length: {integer}, the length of each training sentence.
409 |     """
410 |     for iteration in range(1, nb_iterations+1):
411 |         print ""
412 |         print "------------------------ Start Training ------------------------"
413 |         print "Iteration: ", iteration
414 |         print "Number of epoch per iteration: ", nb_epoch
415 | 
416 |         # history of losses and accuracy
417 |         history = History()
418 | 
419 |         # saves the model weights after each epoch
420 |         # if the validation loss decreased
421 |         checkpointer = ModelCheckpoint(filepath="weights.hdf5",
422 |                                        verbose=1, save_best_only=True)
423 | 
424 |         # train the model with data generator
425 |         analyzer.model.fit_generator(train_data,
426 |                                      samples_per_epoch=nb_training_samples,
427 |                                      nb_epoch=nb_epoch, verbose=1,
428 |                                      callbacks=[history, checkpointer],
429 |                                      validation_data=val_data,
430 |                                      nb_val_samples=nb_validation_samples)
431 | 
432 |         analyzer.save_model("weights-after-iteration.hdf5")
433 | 
434 | 
435 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40):
436 |     """
437 |     Scan the given sequence for detecting anormalies.
438 | 
439 |     Arguments:
440 |         sequence: {lsit}, the original input sequence
441 |         input_len: {integer}, the number of unique id classes
442 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
443 |         mapping: {string}, input to output mapping.
444 |             'o2o': one-to-one
445 |             'm2m': many-to-many
446 |         sentence_length: {integer}, the length of each sentence.
447 |     """
448 |     # sequence length
449 |     length = len(sequence)
450 | 
451 |     # predicted probabilities for each id
452 |     # we assume the first sentence_length ids are true
453 |     prob = [1] * sentence_length + [0] * (length - sentence_length)
454 | 
455 |     start_time = time.time()
456 |     try:
457 |         # generate elements
458 |         for start_index in xrange(length - sentence_length):
459 |             # seed sentence
460 |             X = sequence[start_index : start_index + sentence_length]
461 |             # print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
462 | 
463 |             # Y_true
464 |             # y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
465 |             # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
466 |             y_next_true = sequence[start_index + sentence_length]
467 | 
468 |             seed = np.zeros((1, sentence_length, input_len))
469 |             # format input
470 |             for t in range(0, sentence_length):
471 |                 seed[0, t, X[t]] = 1
472 | 
473 |             # get predictionsverbose = 0, no logging
474 |             predictions = analyzer.model.predict(seed, verbose=0)[0]
475 | 
476 |             # y_predicted
477 |             y_next_pred = 0
478 |             next_prob = 0
479 |             if mapping == 'o2o':
480 |                 next_prob = predictions[y_next_true]
481 |                 prob[start_index + sentence_length] = next_prob
482 |                 y_next_pred = np.argmax(predictions)
483 |             elif mapping == 'm2m':
484 |                 # next_sentence = []
485 |                 # for pred in predictions:
486 |                 #     next_sentence.append(np.argmax(pred))
487 |                 # y_next_pred = next_sentence[-1]
488 |                 # print "y_pred: " + ' '.join(str(id_).ljust(4)
489 |                 #                             for id_ in next_sentence)
490 |                 y_next_pred = np.argmax(predictions[-1])
491 |                 next_prob = predictions[-1][y_next_true]
492 |                 prob[start_index + sentence_length] = next_prob
493 | 
494 |             print start_index, next_prob
495 |     except KeyboardInterrupt:
496 |         # print "    |-Write the clusters into %s ..." %self.cluster_file
497 |         with open('prob.txt', 'w') as prob_file:
498 |             for p in prob:
499 |                 prob_file.write(str(p) + '\n')
500 | 
501 |         plt.plot(prob, 'r*')
502 |         plt.xlim(0, 1000)
503 |         plt.ylim(0, 1)
504 |         plt.savefig("prob.png")
505 |         plt.clf()
506 |         plt.cla()
507 | 
508 |     stop_time = time.time()
509 |     print "--- %s seconds ---\n" % (stop_time - start_time)
510 | 
511 |     return prob
512 | 
513 | 
514 | def run(hidden_len=512, batch_size=128, nb_batch=200, nb_epoch=50,
515 |         nb_iterations=4, learning_rate=0.001, nb_predictions=20, mapping='m2m',
516 |         sentence_length=80, step=80, mode='train'):
517 |     """
518 |     Train, evaluate, or predict.
519 | 
520 |     Arguments:
521 |         hidden_len: {integer}, the size of a hidden layer.
522 |         batch_size: {interger}, the number of sentences per batch.
523 |         nb_batch: {integer}, number of batches to be trained durign each epoch.
524 |         nb_epoch: {interger}, number of epoches per iteration.
525 |         nb_iterations: {integer}, number of iterations.
526 |         learning_rate: {float}, learning rate.
527 |         nb_predictions: {integer}, number of the ids predicted.
528 |         mapping: {string}, input to output mapping.
529 |             'o2o': one-to-one
530 |             'm2m': many-to-many
531 |         sentence_length: {integer}, the length of each training sentence.
532 |         step: {integer}, the sample steps.
533 |         mode: {string}, th running mode of this programm
534 |             'train': train and predict
535 |             'predict': only predict by loading existing model weights
536 |             'evaluate': evaluate the model in evaluation data set
537 |             'detect': detect a new log sequence for the probabilities
538 |     """
539 |     # get parameters and dimensions of the model
540 |     print "Loading training data..."
541 |     train_sequence, input_len1 = get_sequence("./train_data/*")
542 |     print "Loading validation data..."
543 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
544 |     input_len = max(input_len1, input_len2)
545 | 
546 |     print "Training sequence length: %d" %len(train_sequence)
547 |     print "Validation sequence length: %d" %len(val_sequence)
548 |     print "#classes: %d\n" %input_len
549 | 
550 |     # data generator of X_train and y_train, with random offset
551 |     train_data = data_generator(train_sequence, input_len, mapping=mapping,
552 |                                 sentence_length=sentence_length, step=step,
553 |                                 random_offset=True, batch_size=batch_size)
554 | 
555 |     # data generator of X_val and y _val,  with random offset
556 |     val_data = data_generator(val_sequence, input_len, mapping=mapping,
557 |                               sentence_length=sentence_length, step=step,
558 |                               random_offset=True, batch_size=batch_size)
559 | 
560 |     # two layered LSTM 512 hidden nodes and a dropout rate of 0.2
561 |     rnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len)
562 | 
563 |     # build model
564 |     rnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate,
565 |               nb_layers=2, dropout=0.2)
566 | 
567 |     # plot model
568 |     # rnn.plot_model()
569 | 
570 |     # load the previous model weights
571 |     # rnn.load_model("weightsf4-61.hdf5")
572 | 
573 |     if mode == 'predict':
574 |         print "Predict..."
575 |         predict(val_sequence, input_len, rnn, nb_predictions=nb_predictions,
576 |                 mapping=mapping, sentence_length=sentence_length)
577 |     elif mode == 'evaluate':
578 |         print "Evaluate..."
579 |         print "Metrics: " + ', '.join(rnn.model.metrics_names)
580 |         X_val, y_val = data_generator(val_sequence, input_len, mapping=mapping,
581 |                                       sentence_length=sentence_length,
582 |                                       step=step, random_offset=False,
583 |                                       batch_size=batch_size)
584 |         results = rnn.model.evaluate(X_val, y_val, #pylint: disable=W0612
585 |                                      batch_size=batch_size,
586 |                                      verbose=1)
587 |         print "Loss: ", results[0]
588 |         print "Accuracy: ", results[1]
589 |     elif mode == 'train':
590 |         print "Train..."
591 |         # number of training sampes and validation samples
592 |         nb_training_samples = batch_size * nb_batch
593 |         nb_validation_samples = int(nb_training_samples * 0.05)
594 | 
595 |         try:
596 |             train(rnn, train_data, nb_training_samples,
597 |                   val_data, nb_validation_samples,
598 |                   nb_epoch=nb_epoch, nb_iterations=nb_iterations)
599 |         except KeyboardInterrupt:
600 |             rnn.save_model("weights-stop.hdf5")
601 |     elif mode == 'detect':
602 |         print "Detect..."
603 |         detect(val_sequence, input_len, rnn, mapping=mapping,
604 |                sentence_length=sentence_length)
605 |     else:
606 |         print "The mode = %s is not correct!!!" %mode
607 | 
608 |     return mode
609 | 
610 | 
611 | if __name__ == '__main__':
612 |     run()
613 | 


--------------------------------------------------------------------------------
/others/sequence_analyzer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program analyze the integer sequence using (Uni-directional and
  3 | Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory
  4 | (LSTM) and Gated Recurrent Unit (GRU) based on the python library Keras.
  5 | 
  6 | "Keras is a minimalist, highly modular neural networks library, written in
  7 |  Python and capable of running on top of either TensorFlow or Theano."
  8 |                                                 ---- Keras (http://keras.io/)
  9 | 
 10 | Uni-directional model is based on the Keras example - lstm_text_generation:
 11 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
 12 | 
 13 | Bi-directional model is based on the Keras example - imdb_bidirectional_lstm.py:
 14 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py
 15 | 
 16 | Author: Chang Liu (fluency03)
 17 | Data: 2016-03-27
 18 | """
 19 | 
 20 | import glob
 21 | # import os
 22 | import sys
 23 | import csv
 24 | import time
 25 | import matplotlib.pyplot as plt
 26 | import numpy as np
 27 | 
 28 | from keras.callbacks import Callback, ModelCheckpoint
 29 | from keras.layers import Input, Activation, Dense, Dropout, LSTM, GRU, merge
 30 | from keras.layers.wrappers import TimeDistributed
 31 | from keras.models import Sequential, Model
 32 | from keras.optimizers import RMSprop # pylint: disable=W0611
 33 | from keras.utils.visualize_util import plot
 34 | 
 35 | 
 36 | # random number generator with a fixed value for reproducibility
 37 | np.random.seed(1337)
 38 | 
 39 | 
 40 | def override(f):
 41 |     """
 42 |     Override decorator.
 43 |     """
 44 |     return f
 45 | 
 46 | 
 47 | class SequenceAnalyzer(object):
 48 |     """
 49 |     Sequence analyzer based on RNN.
 50 |     """
 51 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
 52 |         self.sentence_length = sentence_length
 53 |         self.input_len = input_len
 54 |         self.hidden_len = hidden_len
 55 |         self.output_len = output_len
 56 |         # model is defined at child class
 57 |         self.model = None
 58 | 
 59 |     def build(self, layer, mapping, learning_rate, nb_layers, dropout):
 60 |         """
 61 |         Build model.
 62 |         """
 63 |         pass
 64 | 
 65 |     def save_model(self, filename, overwrite=False):
 66 |         """
 67 |         Save the model weight into a hdf5 file.
 68 | 
 69 |         Arguments:
 70 |             filename: {string}, the name/path to the file
 71 |                 to which the weights are going to be saved.
 72 |             overwrite: {bool}, overwrite existing file.
 73 |         """
 74 |         print "Save Weights %s ..." %filename
 75 |         self.model.save_weights(filename, overwrite=overwrite)
 76 | 
 77 |     def load_model(self, filename):
 78 |         """
 79 |         Load the model weight into a hdf5 file.
 80 | 
 81 |         Arguments:
 82 |             filename: {string}, the name/path to the file
 83 |                 to which the weights are going to be loaded.
 84 |         """
 85 |         print "Load Weights %s ..." %filename
 86 |         self.model.load_weights(filename)
 87 | 
 88 |     def plot_model(self, filename):
 89 |         """
 90 |         Plot model.
 91 | 
 92 |         Arguments:
 93 |             filename: {string}, the name/path to the file
 94 |                 to which the model graphic is plotted.
 95 |         """
 96 |         print "Plot Model %s ..." %filename
 97 |         plot(self.model, to_file=filename)
 98 | 
 99 | 
100 | class URNN(SequenceAnalyzer):
101 |     """
102 |     Uni-directional RNN model of the sequence analyzer. Sequential Model.
103 |     """
104 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
105 |         super(URNN, self).__init__(sentence_length,
106 |                                    input_len, hidden_len, output_len,
107 |                                    return_sequence=True)
108 |         self.model = Sequential()
109 | 
110 |     @override
111 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
112 |               nb_layers=2, dropout=0.2):
113 |         """
114 |         Stacked RNN with specified dropout rate (default 0.2), built with
115 |         softmax activation, cross entropy loss and rmsprop optimizer.
116 | 
117 |         Arguments:
118 |             layer: {string}, the type of the layers in the RNN Model.
119 |                 'LSTM': LSTM layers
120 |                 'GRU': GRU layers
121 |             mapping: {string}, input to output mapping.
122 |                 'o2o': one-to-one
123 |                 'm2m': many-to-many
124 |             learning_rate: {float}, learning rate.
125 |             nb_layers: {integer}, number of layers in total.
126 |             dropout: {float}, dropout value.
127 |         """
128 |         print "Building Model..."
129 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f "
130 |                "nb_layers = %d , dropout = %.2f"
131 |                %(self.hidden_len, layer, mapping, learning_rate,
132 |                  nb_layers, dropout))
133 | 
134 |         # check the layer type: LSTM or GRU
135 |         if layer == 'LSTM':
136 |             class LAYER(LSTM):
137 |                 """
138 |                 LAYER as LSTM.
139 |                 """
140 |                 pass
141 |         elif layer == 'GRU':
142 |             class LAYER(GRU):
143 |                 """
144 |                 LAYER as GRU.
145 |                 """
146 |                 pass
147 | 
148 |         # check whether return sequence for each of the layers
149 |         return_sequences = []
150 |         if mapping == 'o2o':
151 |             # if mapping is one-to-one
152 |             for nl in range(nb_layers):
153 |                 if nl == nb_layers-1:
154 |                     return_sequences.append(False)
155 |                 else:
156 |                     return_sequences.append(True)
157 |         elif mapping == 'm2m':
158 |             # if mapping is many-to-many
159 |             for _ in range(nb_layers):
160 |                 return_sequences.append(True)
161 | 
162 |         # first layer RNN with specified number of nodes in the hidden layer.
163 |         self.model.add(LAYER(self.hidden_len,
164 |                              return_sequences=return_sequences[0],
165 |                              input_shape=(self.sentence_length,
166 |                                           self.input_len)))
167 |         self.model.add(Dropout(dropout))
168 | 
169 |         # the following layers
170 |         for nl in range(nb_layers-1):
171 |             self.model.add(LAYER(self.hidden_len,
172 |                                  return_sequences=return_sequences[nl+1]))
173 |             self.model.add(Dropout(dropout))
174 | 
175 |         if mapping == 'o2o':
176 |             # if mapping is one-to-one
177 |             self.model.add(Dense(self.output_len))
178 |         elif mapping == 'm2m':
179 |             # if mapping is many-to-many
180 |             self.model.add(TimeDistributed(Dense(self.output_len)))
181 | 
182 |         self.model.add(Activation('softmax'))
183 | 
184 |         rms = RMSprop(lr=learning_rate)
185 |         self.model.compile(loss='categorical_crossentropy',
186 |                            optimizer=rms,
187 |                            metrics=['accuracy'])
188 | 
189 | 
190 | class BRNN(SequenceAnalyzer):
191 |     """
192 |     Bi-directional RNN model of the sequence analyzer. Graph Model.
193 |     """
194 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
195 |         super(BRNN, self).__init__(sentence_length,
196 |                                    input_len, hidden_len, output_len)
197 | 
198 |     @override
199 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
200 |               nb_layers=2, dropout=0.2):
201 |         """
202 |         Bidirectional RNN with specified dropout rate (default 0.2), built with
203 |         softmax activation, cross entropy loss and rmsprop optimizer.
204 | 
205 |         Arguments:
206 |             layer: {string}, the type of the layers in the RNN Model.
207 |                 'LSTM': LSTM layers
208 |                 'GRU': GRU layers
209 |             mapping: {string}, input to output mapping.
210 |                 'o2o': one-to-one
211 |                 'm2m': many-to-many
212 |             learning_rate: {float}, learning rate.
213 |             nb_layers: {integer}, number of layers in total.
214 |             dropout: {float}, dropout value.
215 |         """
216 |         print "Building Model..."
217 |         print ("    layer = %d-%s , mapping = %s , "
218 |                "nb_layers = %d , dropout = %.2f"
219 |                %(self.hidden_len, layer, mapping, nb_layers, dropout))
220 | 
221 |         # check the layer type: LSTM or GRU
222 |         if layer == 'LSTM':
223 |             class LAYER(LSTM):
224 |                 """
225 |                 LAYER as LSTM.
226 |                 """
227 |                 pass
228 |         elif layer == 'GRU':
229 |             class LAYER(GRU):
230 |                 """
231 |                 LAYER as GRU.
232 |                 """
233 |                 pass
234 | 
235 |         # check whether return sequence for each of the layers
236 |         return_sequences = []
237 |         if mapping == 'o2o':
238 |             # if mapping is one-to-one
239 |             for nl in range(nb_layers):
240 |                 if nl == nb_layers-1:
241 |                     return_sequences.append(False)
242 |                 else:
243 |                     return_sequences.append(True)
244 |         elif mapping == 'm2m':
245 |             # if mapping is many-to-many
246 |             for _ in range(nb_layers):
247 |                 return_sequences.append(True)
248 | 
249 |         # add input
250 |         input_layer = Input(shape=(self.sentence_length, self.input_len),
251 |                             dtype='float32')
252 | 
253 |         # first Bi-directional LSTM layer
254 |         forward1 = LAYER(self.hidden_len,
255 |                          return_sequences=return_sequences[0])(input_layer)
256 |         forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612
257 |         backward1 = LAYER(self.hidden_len,
258 |                           return_sequences=return_sequences[0],
259 |                           go_backwards=True)(input_layer)
260 |         backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612
261 | 
262 |         # following Bi-directional layers
263 |         for nl in range(nb_layers-1):
264 |             exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)"
265 |                  %('forward' + str(nl+2),
266 |                    return_sequences[nl+1],
267 |                    'forward_dropout' + str(nl+1)))
268 |             exec("%s = Dropout(dropout)(%s)"
269 |                  %('forward_dropout' + str(nl+2),
270 |                    'forward' + str(nl+2)))
271 |             exec(("%s = LAYER(self.hidden_len, return_sequences=%s, "
272 |                   "go_backwards=True)(%s)")
273 |                  %('backward' + str(nl+2),
274 |                    return_sequences[nl+1],
275 |                    'backward_dropout' + str(nl+1)))
276 |             exec("%s = Dropout(dropout)(%s)"
277 |                  %('backward_dropout' + str(nl+2),
278 |                    'backward' + str(nl+2)))
279 | 
280 |         merged_layer = merge([locals()['forward_dropout' + str(nb_layers)],
281 |                               locals()['backward_dropout' + str(nb_layers)]],
282 |                              mode='concat', concat_axis=-1)
283 | 
284 |         if mapping == 'o2o':
285 |             output_layer = Dense(self.output_len,
286 |                                  activation='softmax')(merged_layer)
287 |         elif mapping == 'm2m':
288 |             output_layer = TimeDistributed(
289 |                 Dense(self.output_len, activation='softmax'))(merged_layer)
290 | 
291 |         # add ouput
292 |         self.model = Model(input=input_layer, output=output_layer)
293 | 
294 |         rms = RMSprop(lr=learning_rate)
295 |         # try using different optimizers and different optimizer configs
296 |         self.model.compile(loss='categorical_crossentropy',
297 |                            optimizer=rms,
298 |                            metrics=['accuracy'])
299 | 
300 | 
301 | class History(Callback):
302 |     """
303 |     Record the loss and accuracy history.
304 |     """
305 |     @override
306 |     def on_train_begin(self, logs={}): # pylint: disable=W0102
307 |         """
308 |         A method starting at the begining of the training.
309 | 
310 |         Arguments:
311 |             logs: {dictionary}, recording the training and validation
312 |                 losses and accuracy of every epoch.
313 |         """
314 |         # training loss and accuracy
315 |         self.train_losses = []
316 |         self.train_acc = []
317 |         # validation loss and accuracy
318 |         self.val_losses = []
319 |         self.val_acc = []
320 | 
321 |     @override
322 |     def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102
323 |         """
324 |         A method starting at the begining of the training.
325 | 
326 |         Arguments:
327 |             epoch: {integer}, the current epoch.
328 |             logs: {dictionary}, recording the training and validation
329 |                 losses and accuracy of every epoch.
330 |         """
331 |         # record training loss and accuracy
332 |         self.train_losses.append(logs.get('loss'))
333 |         self.train_acc.append(logs.get('acc'))
334 |         # record validation loss and accuracy
335 |         self.val_losses.append(logs.get('val_loss'))
336 |         self.val_acc.append(logs.get('val_acc'))
337 | 
338 |         # continutously save the train_loss, train_acc, val_loss, val_acc
339 |         # into a csv file with 4 columns respeactively
340 |         csv_name = 'history.csv'
341 |         with open(csv_name, 'a') as csvfile:
342 |             his_writer = csv.writer(csvfile)
343 |             print "\n    Save loss and accuracy into %s" %csv_name
344 |             his_writer.writerow((logs.get('loss'), logs.get('acc'),
345 |                                  logs.get('val_loss'), logs.get('val_acc')))
346 | 
347 | 
348 | def sample(prob, temperature=0.2):
349 |     """
350 |     Softmax function for reinforcement learning.
351 | 
352 |     Arguments:
353 |         prob: {list}, a list of probabilities of each of the classes.
354 |         temperature: {float}, Softmax temperature.
355 |     Returns:
356 |         {integer}, the most possible sample.
357 |     """
358 |     prob = np.log(prob) / temperature
359 |     prob = np.exp(prob) / np.sum(np.exp(prob))
360 |     return np.argmax(np.random.multinomial(1, prob, 1))
361 | 
362 | 
363 | def get_sequence(filepath):
364 |     """
365 |     Get the original sequence from file.
366 | 
367 |     Arguments:
368 |         filename: {string}, the name/path of input log sequence file.
369 |     Returns:
370 |         {list}, the log sequence.
371 |         {integer}, the size of vocabulary.
372 |     """
373 |     # read file and convert ids of each line into array of numbers
374 |     seqfiles = glob.glob(filepath)
375 |     sequence = []
376 | 
377 |     for seqfile in seqfiles:
378 |         with open(seqfile, 'r') as f:
379 |             one_sequence = [int(id_) for id_ in f]
380 |             print "        %s, sequence length: %d" %(seqfile,
381 |                                                       len(one_sequence))
382 |             sequence.extend(one_sequence)
383 | 
384 |     # add two extra positions for 'unknown-log' and 'no-log'
385 |     vocab_size = max(sequence) + 2
386 | 
387 |     return sequence, vocab_size
388 | 
389 | 
390 | def get_data(sequence, vocab_size, mapping='m2m', sentence_length=40, step=3,
391 |              random_offset=True):
392 |     """
393 |     Retrieves data from a plain txt file and formats it using one-hot vector.
394 | 
395 |     Arguments:
396 |         sequence: {lsit}, the original input sequence
397 |         vocab_size: {integer}, the number of unique id classes
398 |         mapping: {string}, input to output mapping.
399 |             'o2o': one-to-one
400 |             'm2m': many-to-many
401 |         sentence_length: {integer}, the length of each training sentence.
402 |         step: {integer}, the sample steps.
403 |         random_offset: {bool}, the offset is random between step or is 0.
404 |     Returns:
405 |         {np.array}, training input data X
406 |         {np.array}, training target data y
407 |     """
408 |     X_sentences = []
409 |     y_sentences = []
410 |     next_ids = []
411 | 
412 |     offset = np.random.randint(0, step) if random_offset else 0
413 | 
414 |     # creat batch data and next sentences
415 |     for i in range(offset, len(sequence) - sentence_length, step):
416 |         X_sentences.append(sequence[i : i + sentence_length])
417 |         if mapping == 'o2o':
418 |             # if mapping is one-to-one
419 |             next_ids.append(sequence[i + sentence_length])
420 |         elif mapping == 'm2m':
421 |             # if mapping is many-to-many
422 |             y_sentences.append(sequence[i + 1 : i + sentence_length + 1])
423 | 
424 |     # number of sampes
425 |     nb_samples = len(X_sentences)
426 |     # print "total # of sentences: %d" %nb_samples
427 | 
428 |     # one-hot vector (all zeros except for a single one at
429 |     # the exact postion of this id number)
430 |     X_train = np.zeros((nb_samples, sentence_length, vocab_size), dtype=np.bool)
431 |     # expected outputs for each sentence
432 |     if mapping == 'o2o':
433 |         # if mapping is one-to-one
434 |         y_train = np.zeros((nb_samples, vocab_size), dtype=np.bool)
435 |     elif mapping == 'm2m':
436 |         # if mapping is many-to-many
437 |         y_train = np.zeros((nb_samples, sentence_length, vocab_size),
438 |                            dtype=np.bool)
439 | 
440 |     for i, x_sentence in enumerate(X_sentences):
441 |         for t, id_ in enumerate(x_sentence):
442 |             # mark the each corresponding character in a sentence as 1
443 |             X_train[i, t, id_] = 1
444 |             # if mapping is many-to-many
445 |             if mapping == 'm2m':
446 |                 y_train[i, t, y_sentences[i][t]] = 1
447 |         # if mapping is one-to-one
448 |         # mark the corresponding character in expected output as 1
449 |         if mapping == 'o2o':
450 |             y_train[i, next_ids[i]] = 1
451 | 
452 |     return X_train, y_train
453 | 
454 | 
455 | def predict(sequence, input_len, analyzer, nb_predictions=80,
456 |             mapping='m2m', sentence_length=40):
457 |     """
458 |     Predict the next sequences using existing model and weights given some seed.
459 | 
460 |     Arguments:
461 |         sequence: {lsit}, the original input sequence
462 |         input_len: {integer}, the number of unique id classes
463 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
464 |         nb_predictions: {integer}, number of predictions after giving the seed
465 |         mapping: {string}, input to output mapping.
466 |             'o2o': one-to-one
467 |             'm2m': many-to-many
468 |         sentence_length: {integer}, the length of each sentence.
469 |     """
470 |     # generate elements
471 |     for _ in range(nb_predictions):
472 |         # start index of the seed, random number in range
473 |         start_index = np.random.randint(0, len(sequence) - sentence_length - 1)
474 |         # seed sentence
475 |         sentence = sequence[start_index : start_index + sentence_length]
476 | 
477 |         # Y_true
478 |         y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
479 |         print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
480 | 
481 |         seed = np.zeros((1, sentence_length, input_len))
482 |         # format input
483 |         for t in range(0, sentence_length):
484 |             seed[0, t, sentence[t]] = 1
485 | 
486 |         # get predictions
487 |         # verbose = 0, no logging
488 |         predictions = analyzer.model.predict(seed, verbose=0)[0]
489 | 
490 |         # y_predicted
491 |         if mapping == 'o2o':
492 |             next_id = np.argmax(predictions)
493 |             sys.stdout.write(' ' + str(next_id))
494 |             sys.stdout.flush()
495 |         elif mapping == 'm2m':
496 |             next_sentence = []
497 |             for pred in predictions:
498 |                 next_sentence.append(np.argmax(pred))
499 |             print "y_pred: " + ' '.join(str(id_).ljust(4)
500 |                                         for id_ in next_sentence)
501 |             # next_id = np.argmax(predictions[-1])
502 | 
503 |         # y_true
504 |         print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
505 | 
506 |         print "\n"
507 | 
508 | 
509 | def train(analyzer, train_sequence, val_sequence, input_len,
510 |           batch_size=128, nb_epoch=50, nb_iterations=4,
511 |           sentence_length=40, step=40, mapping='m2m'):
512 |     """
513 |     Trains the network.
514 | 
515 |     Arguments:
516 |         analyzer: {SequenceAnalyzer}.
517 |         train_sequence: {list}, training sequence.
518 |         val_sequence: {list}, validation sequence.
519 |         input_len: {integer}, the number of classes, i.e., the input length of
520 |             neural network.
521 |         batch_size: {interger}, the number of sentences per batch.
522 |         nb_epoch: {integer}, number of epoches per iteration.
523 |         nb_iterations: {integer}, number of iterations.
524 |         sentence_length: {integer}, the length of each training sentence.
525 |         step: {integer}, the sample steps.
526 |         mapping: {string}, input to output mapping.
527 |             'o2o': one-to-one
528 |             'm2m': many-to-many
529 |     """
530 |     for iteration in range(1, nb_iterations+1):
531 |         # create training data, randomize the offset between steps
532 |         X_train, y_train = get_data(train_sequence, input_len, mapping=mapping,
533 |                                     sentence_length=sentence_length, step=step,
534 |                                     random_offset=False)
535 |         X_val, y_val = get_data(val_sequence, input_len, mapping=mapping,
536 |                                 sentence_length=sentence_length, step=step,
537 |                                 random_offset=False)
538 |         print ""
539 |         print "------------------------ Start Training ------------------------"
540 |         print "Iteration: ", iteration
541 |         print "Number of epoch per iteration: ", nb_epoch
542 | 
543 |         # history of losses and accuracy
544 |         history = History()
545 | 
546 |         # saves the model weights after each epoch
547 |         # if the validation loss decreased
548 |         checkpointer = ModelCheckpoint(filepath="weights.hdf5",
549 |                                        verbose=1, save_best_only=True)
550 | 
551 |         # train the model
552 |         analyzer.model.fit(X_train, y_train,
553 |                            batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
554 |                            callbacks=[history, checkpointer],
555 |                            validation_data=(X_val, y_val))
556 | 
557 |         analyzer.save_model("weights-after-iteration.hdf5")
558 | 
559 | 
560 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40):
561 |     """
562 |     Scan the given sequence for detecting anormalies.
563 | 
564 |     Arguments:
565 |         sequence: {lsit}, the original input sequence
566 |         input_len: {integer}, the number of unique id classes
567 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
568 |         mapping: {string}, input to output mapping.
569 |             'o2o': one-to-one
570 |             'm2m': many-to-many
571 |         sentence_length: {integer}, the length of each sentence.
572 |     """
573 |     # sequence length
574 |     length = len(sequence)
575 | 
576 |     # predicted probabilities for each id
577 |     # we assume the first sentence_length ids are true
578 |     prob = [1] * sentence_length + [0] * (length - sentence_length)
579 | 
580 |     start_time = time.time()
581 |     try:
582 |         # generate elements
583 |         for start_index in xrange(length - sentence_length):
584 |             # seed sentence
585 |             X = sequence[start_index : start_index + sentence_length]
586 |             # print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
587 | 
588 |             # Y_true
589 |             # y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
590 |             # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
591 |             y_next_true = sequence[start_index + sentence_length]
592 | 
593 |             seed = np.zeros((1, sentence_length, input_len))
594 |             # format input
595 |             for t in range(0, sentence_length):
596 |                 seed[0, t, X[t]] = 1
597 | 
598 |             # get predictionsverbose = 0, no logging
599 |             predictions = analyzer.model.predict(seed, verbose=0)[0]
600 | 
601 |             # y_predicted
602 |             y_next_pred = 0
603 |             next_prob = 0
604 |             if mapping == 'o2o':
605 |                 next_prob = predictions[y_next_true]
606 |                 prob[start_index + sentence_length] = next_prob
607 |                 y_next_pred = np.argmax(predictions)
608 |             elif mapping == 'm2m':
609 |                 # next_sentence = []
610 |                 # for pred in predictions:
611 |                 #     next_sentence.append(np.argmax(pred))
612 |                 # y_next_pred = next_sentence[-1]
613 |                 # print "y_pred: " + ' '.join(str(id_).ljust(4)
614 |                 #                             for id_ in next_sentence)
615 |                 y_next_pred = np.argmax(predictions[-1])
616 |                 next_prob = predictions[-1][y_next_true]
617 |                 prob[start_index + sentence_length] = next_prob
618 | 
619 |             print start_index, next_prob
620 |     except KeyboardInterrupt:
621 |         # print "    |-Write the clusters into %s ..." %self.cluster_file
622 |         with open('prob.txt', 'w') as prob_file:
623 |             for p in prob:
624 |                 prob_file.write(str(p) + '\n')
625 | 
626 |         plt.plot(prob, 'r*')
627 |         plt.xlim(0, 1000)
628 |         plt.ylim(0, 1)
629 |         plt.savefig("prob.png")
630 |         plt.clf()
631 |         plt.cla()
632 | 
633 |     stop_time = time.time()
634 |     print "--- %s seconds ---\n" % (stop_time - start_time)
635 | 
636 |     return prob
637 | 
638 | 
639 | def run(hidden_len=512, batch_size=128, nb_epoch=50, nb_iterations=4,
640 |         learning_rate=0.001, nb_predictions=20, mapping='m2m',
641 |         sentence_length=80, step=80, mode='train'):
642 |     """
643 |     Train, evaluate, or predict.
644 | 
645 |     Arguments:
646 |         hidden_len: {integer}, the size of a hidden layer.
647 |         batch_size: {interger}, the number of sentences per batch.
648 |         nb_epoch: {interger}, number of epoches per iteration.
649 |         nb_iterations: {integer}, number of iterations.
650 |         learning_rate: {float}, learning rate.
651 |         nb_predictions: {integer}, number of the ids predicted.
652 |         mapping: {string}, input to output mapping.
653 |             'o2o': one-to-one
654 |             'm2m': many-to-many
655 |         sentence_length: {integer}, the length of each training sentence.
656 |         step: {integer}, the sample steps.
657 |         mode: {string}, th running mode of this programm
658 |             'train': train and predict
659 |             'predict': only predict by loading existing model weights
660 |             'evaluate': evaluate the model in evaluation data set
661 |             'detect': detect a new log sequence for the probabilities
662 |     """
663 |     # get parameters and dimensions of the model
664 |     print "Loading training data..."
665 |     train_sequence, input_len1 = get_sequence("./train_data/*")
666 |     print "Loading validation data..."
667 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
668 |     input_len = max(input_len1, input_len2)
669 | 
670 |     print "Training sequence length: %d" %len(train_sequence)
671 |     print "Validation sequence length: %d" %len(val_sequence)
672 |     print "#classes: %d\n" %input_len
673 | 
674 |     # two layered LSTM 512 hidden nodes and a dropout rate of 0.2
675 |     analyzer = SequenceAnalyzer(sentence_length,
676 |                                 input_len, hidden_len, input_len)
677 | 
678 |     # build model
679 |     analyzer.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate,
680 |                    nb_layers=2, dropout=0.2)
681 | 
682 |     # plot model
683 |     # analyzer.plot_model()
684 | 
685 |     # load the previous model weights
686 |     # analyzer.load_model("weightsf4-61.hdf5")
687 | 
688 |     if mode == 'predict':
689 |         print "Predict..."
690 |         predict(val_sequence, input_len, analyzer,
691 |                 nb_predictions=nb_predictions, mapping=mapping,
692 |                 sentence_length=sentence_length)
693 |     elif mode == 'evaluate':
694 |         print "Evaluate..."
695 |         print "Metrics: " + ', '.join(analyzer.model.metrics_names)
696 |         X_val, y_val = get_data(val_sequence, input_len, mapping=mapping,
697 |                                 sentence_length=sentence_length, step=step,
698 |                                 random_offset=False)
699 |         results = analyzer.model.evaluate(X_val, y_val, #pylint: disable=W0612
700 |                                           batch_size=batch_size,
701 |                                           verbose=1)
702 |         print "Loss: ", results[0]
703 |         print "Accuracy: ", results[1]
704 |     elif mode == 'train':
705 |         print "Train..."
706 |         try:
707 |             train(analyzer, train_sequence, val_sequence, input_len,
708 |                   batch_size=batch_size, nb_epoch=nb_epoch,
709 |                   nb_iterations=nb_iterations,
710 |                   sentence_length=sentence_length,
711 |                   step=step, mapping=mapping)
712 |         except KeyboardInterrupt:
713 |             analyzer.save_model("weights-stop.hdf5")
714 |     elif mode == 'detect':
715 |         print "Detect..."
716 |         detect(val_sequence, input_len, analyzer, mapping=mapping,
717 |                sentence_length=sentence_length)
718 |     else:
719 |         print "The mode = %s is not correct!!!" %mode
720 | 
721 |     return mode
722 | 
723 | 
724 | if __name__ == '__main__':
725 |     run()
726 | 


--------------------------------------------------------------------------------
/others/sequence_analyzer_gen.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program analyze the integer sequence using (Uni-directional and
  3 | Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory
  4 | (LSTM) and Gated Recurrent Unit (GRU) based on the python library Keras.
  5 | 
  6 | Input data is Generator and the training is by calling model.fit_generator().
  7 | 
  8 | "Keras is a minimalist, highly modular neural networks library, written in
  9 |  Python and capable of running on top of either TensorFlow or Theano."
 10 |                                                 ---- Keras (http://keras.io/)
 11 | 
 12 | Uni-directional model is based on the Keras example - lstm_text_generation:
 13 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
 14 | 
 15 | Bi-directional model is based on the Keras example - imdb_bidirectional_lstm.py:
 16 | https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py
 17 | 
 18 | Author: Chang Liu (fluency03)
 19 | Data: 2016-04-03
 20 | """
 21 | 
 22 | import glob
 23 | # import os
 24 | import sys
 25 | import csv
 26 | import time
 27 | import matplotlib.pyplot as plt
 28 | import numpy as np
 29 | 
 30 | from keras.callbacks import Callback, ModelCheckpoint
 31 | from keras.layers import Input, Activation, Dense, Dropout, LSTM, GRU, merge
 32 | from keras.layers.wrappers import TimeDistributed
 33 | from keras.models import Sequential, Model
 34 | from keras.optimizers import RMSprop # pylint: disable=W0611
 35 | from keras.utils.visualize_util import plot
 36 | 
 37 | 
 38 | # random number generator with a fixed value for reproducibility
 39 | np.random.seed(1337)
 40 | 
 41 | 
 42 | def override(f):
 43 |     """
 44 |     Override decorator.
 45 |     """
 46 |     return f
 47 | 
 48 | 
 49 | class SequenceAnalyzer(object):
 50 |     """
 51 |     Sequence analyzer based on RNN.
 52 |     """
 53 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
 54 |         self.sentence_length = sentence_length
 55 |         self.input_len = input_len
 56 |         self.hidden_len = hidden_len
 57 |         self.output_len = output_len
 58 |         # model is defined at child class
 59 |         self.model = None
 60 | 
 61 |     def build(self, layer, mapping, learning_rate, nb_layers, dropout):
 62 |         """
 63 |         Build model.
 64 |         """
 65 |         pass
 66 | 
 67 |     def save_model(self, filename, overwrite=False):
 68 |         """
 69 |         Save the model weight into a hdf5 file.
 70 | 
 71 |         Arguments:
 72 |             filename: {string}, the name/path to the file
 73 |                 to which the weights are going to be saved.
 74 |             overwrite: {bool}, overwrite existing file.
 75 |         """
 76 |         print "Save Weights %s ..." %filename
 77 |         self.model.save_weights(filename, overwrite=overwrite)
 78 | 
 79 |     def load_model(self, filename):
 80 |         """
 81 |         Load the model weight into a hdf5 file.
 82 | 
 83 |         Arguments:
 84 |             filename: {string}, the name/path to the file
 85 |                 to which the weights are going to be loaded.
 86 |         """
 87 |         print "Load Weights %s ..." %filename
 88 |         self.model.load_weights(filename)
 89 | 
 90 |     def plot_model(self, filename):
 91 |         """
 92 |         Plot model.
 93 | 
 94 |         Arguments:
 95 |             filename: {string}, the name/path to the file
 96 |                 to which the model graphic is plotted.
 97 |         """
 98 |         print "Plot Model %s ..." %filename
 99 |         plot(self.model, to_file=filename)
100 | 
101 | 
102 | class URNN(SequenceAnalyzer):
103 |     """
104 |     Uni-directional RNN model of the sequence analyzer. Sequential Model.
105 |     """
106 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
107 |         super(URNN, self).__init__(sentence_length,
108 |                                    input_len, hidden_len, output_len,
109 |                                    return_sequence=True)
110 |         self.model = Sequential()
111 | 
112 |     @override
113 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
114 |               nb_layers=2, dropout=0.2):
115 |         """
116 |         Stacked RNN with specified dropout rate (default 0.2), built with
117 |         softmax activation, cross entropy loss and rmsprop optimizer.
118 | 
119 |         Arguments:
120 |             layer: {string}, the type of the layers in the RNN Model.
121 |                 'LSTM': LSTM layers
122 |                 'GRU': GRU layers
123 |             mapping: {string}, input to output mapping.
124 |                 'o2o': one-to-one
125 |                 'm2m': many-to-many
126 |             learning_rate: {float}, learning rate.
127 |             nb_layers: {integer}, number of layers in total.
128 |             dropout: {float}, dropout value.
129 |         """
130 |         print "Building Model..."
131 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f "
132 |                "nb_layers = %d , dropout = %.2f"
133 |                %(self.hidden_len, layer, mapping, learning_rate,
134 |                  nb_layers, dropout))
135 | 
136 |         # check the layer type: LSTM or GRU
137 |         if layer == 'LSTM':
138 |             class LAYER(LSTM):
139 |                 """
140 |                 LAYER as LSTM.
141 |                 """
142 |                 pass
143 |         elif layer == 'GRU':
144 |             class LAYER(GRU):
145 |                 """
146 |                 LAYER as GRU.
147 |                 """
148 |                 pass
149 | 
150 |         # check whether return sequence for each of the layers
151 |         return_sequences = []
152 |         if mapping == 'o2o':
153 |             # if mapping is one-to-one
154 |             for nl in range(nb_layers):
155 |                 if nl == nb_layers-1:
156 |                     return_sequences.append(False)
157 |                 else:
158 |                     return_sequences.append(True)
159 |         elif mapping == 'm2m':
160 |             # if mapping is many-to-many
161 |             for _ in range(nb_layers):
162 |                 return_sequences.append(True)
163 | 
164 |         # first layer RNN with specified number of nodes in the hidden layer.
165 |         self.model.add(LAYER(self.hidden_len,
166 |                              return_sequences=return_sequences[0],
167 |                              input_shape=(self.sentence_length,
168 |                                           self.input_len)))
169 |         self.model.add(Dropout(dropout))
170 | 
171 |         # the following layers
172 |         for nl in range(nb_layers-1):
173 |             self.model.add(LAYER(self.hidden_len,
174 |                                  return_sequences=return_sequences[nl+1]))
175 |             self.model.add(Dropout(dropout))
176 | 
177 |         if mapping == 'o2o':
178 |             # if mapping is one-to-one
179 |             self.model.add(Dense(self.output_len))
180 |         elif mapping == 'm2m':
181 |             # if mapping is many-to-many
182 |             self.model.add(TimeDistributed(Dense(self.output_len)))
183 | 
184 |         self.model.add(Activation('softmax'))
185 | 
186 |         rms = RMSprop(lr=learning_rate)
187 |         self.model.compile(loss='categorical_crossentropy',
188 |                            optimizer=rms,
189 |                            metrics=['accuracy'])
190 | 
191 | 
192 | class BRNN(SequenceAnalyzer):
193 |     """
194 |     Bi-directional RNN model of the sequence analyzer. Graph Model.
195 |     """
196 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
197 |         super(BRNN, self).__init__(sentence_length,
198 |                                    input_len, hidden_len, output_len)
199 | 
200 |     @override
201 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
202 |               nb_layers=2, dropout=0.2):
203 |         """
204 |         Bidirectional RNN with specified dropout rate (default 0.2), built with
205 |         softmax activation, cross entropy loss and rmsprop optimizer.
206 | 
207 |         Arguments:
208 |             layer: {string}, the type of the layers in the RNN Model.
209 |                 'LSTM': LSTM layers
210 |                 'GRU': GRU layers
211 |             mapping: {string}, input to output mapping.
212 |                 'o2o': one-to-one
213 |                 'm2m': many-to-many
214 |             learning_rate: {float}, learning rate.
215 |             nb_layers: {integer}, number of layers in total.
216 |             dropout: {float}, dropout value.
217 |         """
218 |         print "Building Model..."
219 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f "
220 |                "nb_layers = %d , dropout = %.2f"
221 |                %(self.hidden_len, layer, mapping, learning_rate,
222 |                  nb_layers, dropout))
223 | 
224 |         # check the layer type: LSTM or GRU
225 |         if layer == 'LSTM':
226 |             class LAYER(LSTM):
227 |                 """
228 |                 LAYER as LSTM.
229 |                 """
230 |                 pass
231 |         elif layer == 'GRU':
232 |             class LAYER(GRU):
233 |                 """
234 |                 LAYER as GRU.
235 |                 """
236 |                 pass
237 | 
238 |         # check whether return sequence for each of the layers
239 |         return_sequences = []
240 |         if mapping == 'o2o':
241 |             # if mapping is one-to-one
242 |             for nl in range(nb_layers):
243 |                 if nl == nb_layers-1:
244 |                     return_sequences.append(False)
245 |                 else:
246 |                     return_sequences.append(True)
247 |         elif mapping == 'm2m':
248 |             # if mapping is many-to-many
249 |             for _ in range(nb_layers):
250 |                 return_sequences.append(True)
251 | 
252 |         # add input
253 |         input_layer = Input(shape=(self.sentence_length, self.input_len),
254 |                             dtype='float32')
255 | 
256 |         # first Bi-directional LSTM layer
257 |         forward1 = LAYER(self.hidden_len,
258 |                          return_sequences=return_sequences[0])(input_layer)
259 |         forward_dropout1 = Dropout(dropout)(forward1) # pylint: disable=W0612
260 |         backward1 = LAYER(self.hidden_len,
261 |                           return_sequences=return_sequences[0],
262 |                           go_backwards=True)(input_layer)
263 |         backward_dropout1 = Dropout(dropout)(backward1) # pylint: disable=W0612
264 | 
265 |         # following Bi-directional layers
266 |         for nl in range(nb_layers-1):
267 |             exec("%s = LAYER(self.hidden_len, return_sequences=%s)(%s)"
268 |                  %('forward' + str(nl+2),
269 |                    return_sequences[nl+1],
270 |                    'forward_dropout' + str(nl+1)))
271 |             exec("%s = Dropout(dropout)(%s)"
272 |                  %('forward_dropout' + str(nl+2),
273 |                    'forward' + str(nl+2)))
274 |             exec(("%s = LAYER(self.hidden_len, return_sequences=%s, "
275 |                   "go_backwards=True)(%s)")
276 |                  %('backward' + str(nl+2),
277 |                    return_sequences[nl+1],
278 |                    'backward_dropout' + str(nl+1)))
279 |             exec("%s = Dropout(dropout)(%s)"
280 |                  %('backward_dropout' + str(nl+2),
281 |                    'backward' + str(nl+2)))
282 | 
283 |         merged_layer = merge([locals()['forward_dropout' + str(nb_layers)],
284 |                               locals()['backward_dropout' + str(nb_layers)]],
285 |                              mode='concat', concat_axis=-1)
286 | 
287 |         if mapping == 'o2o':
288 |             output_layer = Dense(self.output_len,
289 |                                  activation='softmax')(merged_layer)
290 |         elif mapping == 'm2m':
291 |             output_layer = TimeDistributed(
292 |                 Dense(self.output_len, activation='softmax'))(merged_layer)
293 | 
294 |         # add ouput
295 |         self.model = Model(input=input_layer, output=output_layer)
296 | 
297 |         rms = RMSprop(lr=learning_rate)
298 |         # try using different optimizers and different optimizer configs
299 |         self.model.compile(loss='categorical_crossentropy',
300 |                            optimizer=rms,
301 |                            metrics=['accuracy'])
302 | 
303 | 
304 | class History(Callback):
305 |     """
306 |     Record the loss and accuracy history.
307 |     """
308 |     @override
309 |     def on_train_begin(self, logs={}): # pylint: disable=W0102
310 |         """
311 |         A method starting at the begining of the training.
312 | 
313 |         Arguments:
314 |             logs: {dictionary}, recording the training and validation
315 |                 losses and accuracy of every epoch.
316 |         """
317 |         # training loss and accuracy
318 |         self.train_losses = []
319 |         self.train_acc = []
320 |         # validation loss and accuracy
321 |         self.val_losses = []
322 |         self.val_acc = []
323 | 
324 |     @override
325 |     def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102
326 |         """
327 |         A method starting at the begining of the training.
328 | 
329 |         Arguments:
330 |             epoch: {integer}, the current epoch.
331 |             logs: {dictionary}, recording the training and validation
332 |                 losses and accuracy of every epoch.
333 |         """
334 |         # record training loss and accuracy
335 |         self.train_losses.append(logs.get('loss'))
336 |         self.train_acc.append(logs.get('acc'))
337 |         # record validation loss and accuracy
338 |         self.val_losses.append(logs.get('val_loss'))
339 |         self.val_acc.append(logs.get('val_acc'))
340 | 
341 |         # continutously save the train_loss, train_acc, val_loss, val_acc
342 |         # into a csv file with 4 columns respeactively
343 |         csv_name = 'history.csv'
344 |         with open(csv_name, 'a') as csvfile:
345 |             his_writer = csv.writer(csvfile)
346 |             print "\n    Save loss and accuracy into %s" %csv_name
347 |             his_writer.writerow((logs.get('loss'), logs.get('acc'),
348 |                                  logs.get('val_loss'), logs.get('val_acc')))
349 | 
350 | 
351 | def sample(prob, temperature=0.2):
352 |     """
353 |     Softmax function for reinforcement learning.
354 | 
355 |     Arguments:
356 |         prob: {list}, a list of probabilities of each of the classes.
357 |         temperature: {float}, Softmax temperature.
358 |     Returns:
359 |         {integer}, the most possible sample.
360 |     """
361 |     prob = np.log(prob) / temperature
362 |     prob = np.exp(prob) / np.sum(np.exp(prob))
363 |     return np.argmax(np.random.multinomial(1, prob, 1))
364 | 
365 | 
366 | def get_sequence(filepath):
367 |     """
368 |     Get the original sequence from file.
369 | 
370 |     Arguments:
371 |         filename: {string}, the name/path of input log sequence file.
372 |     Returns:
373 |         {list}, the log sequence.
374 |         {integer}, the size of vocabulary.
375 |     """
376 |     # read file and convert ids of each line into array of numbers
377 |     seqfiles = glob.glob(filepath)
378 |     sequence = []
379 | 
380 |     for seqfile in seqfiles:
381 |         with open(seqfile, 'r') as f:
382 |             one_sequence = [int(id_) for id_ in f]
383 |             print "        %s, sequence length: %d" %(seqfile,
384 |                                                       len(one_sequence))
385 |             sequence.extend(one_sequence)
386 | 
387 |     # add two extra positions for 'unknown-log' and 'no-log'
388 |     vocab_size = max(sequence) + 2
389 | 
390 |     return sequence, vocab_size
391 | 
392 | 
393 | def data_generator(sequence, vocab_size, mapping='m2m', sentence_length=40,
394 |                    step=3, random_offset=True, batch_size=128):
395 |     """
396 |     Retrieves data from a plain txt file and formats it using one-hot vector.
397 |     This method returns a data generator yeilding a batch of (X_train, y_train)
398 |     every time being called.
399 | 
400 |     Arguments:
401 |         sequence: {lsit}, the original input sequence
402 |         vocab_size: {integer}, the number of unique id classes
403 |         mapping: {string}, input to output mapping.
404 |             'o2o': one-to-one
405 |             'm2m': many-to-many
406 |         sentence_length: {integer}, the length of each training sentence.
407 |         step: {integer}, the sample steps.
408 |         random_offset: {bool}, the offset is random between step or is 0.
409 |         batch_size: {integer}, the number of sample per batch.
410 |     Yields:
411 |         {np.array}, training input data X
412 |         {np.array}, training target data y
413 |     """
414 |     # the number of current sample
415 |     sample_count = 0
416 | 
417 |     # one-hot vector (all zeros except for a single one at
418 |     # the exact postion of this id number)
419 |     X_train = np.zeros((batch_size, sentence_length, vocab_size),
420 |                        dtype=np.bool)
421 |     # expected outputs for each sentence
422 |     if mapping == 'o2o':
423 |         # if mapping is one-to-one
424 |         y_train = np.zeros((batch_size, vocab_size), dtype=np.bool)
425 |     elif mapping == 'm2m':
426 |         # if mapping is many-to-many
427 |         y_train = np.zeros((batch_size, sentence_length, vocab_size),
428 |                            dtype=np.bool)
429 | 
430 |     # continuousy creat batch data and next sentences
431 |     while True:
432 |         offset = np.random.randint(0, step) if random_offset else 0
433 |         for i in range(offset, len(sequence) - sentence_length, step):
434 |             # index of a this sample in this batch
435 |             batch_index = sample_count % batch_size
436 | 
437 |             # re-initialzing the batch
438 |             if batch_index == 0:
439 |                 X_train.fill(0)
440 |                 y_train.fill(0)
441 | 
442 |             # current sample and target outputs
443 |             X_sentence = []
444 |             y_sentence = []
445 |             next_id = []
446 | 
447 |             X_sentence = sequence[i : i + sentence_length]
448 |             if mapping == 'o2o':
449 |                 # if mapping is one-to-one
450 |                 next_id = sequence[i + sentence_length]
451 |             elif mapping == 'm2m':
452 |                 # if mapping is many-to-many
453 |                 y_sentence = sequence[i + 1 : i + sentence_length + 1]
454 | 
455 |             for t, id_ in enumerate(X_sentence):
456 |                 # mark the each corresponding character in a sentence as 1
457 |                 X_train[batch_index, t, id_] = 1
458 |                 # if mapping is many-to-many
459 |                 if mapping == 'm2m':
460 |                     y_train[batch_index, t, y_sentence[t]] = 1
461 |             # if mapping is one-to-one
462 |             # mark the corresponding character in expected output as 1
463 |             if mapping == 'o2o':
464 |                 y_train[batch_index, next_id] = 1
465 | 
466 |             # sample count plus 1
467 |             sample_count += 1
468 | 
469 |             if batch_index == batch_size-1:
470 |                 yield X_train, y_train
471 | 
472 | 
473 | def predict(sequence, input_len, analyzer, nb_predictions=80,
474 |             mapping='m2m', sentence_length=40):
475 |     """
476 |     Predict the next sequences using existing model and weights given some seed.
477 | 
478 |     Arguments:
479 |         sequence: {lsit}, the original input sequence
480 |         input_len: {integer}, the number of unique id classes
481 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
482 |         nb_predictions: {integer}, number of predictions after giving the seed
483 |         mapping: {string}, input to output mapping.
484 |             'o2o': one-to-one
485 |             'm2m': many-to-many
486 |         sentence_length: {integer}, the length of each sentence.
487 |     """
488 |     # generate elements
489 |     for _ in range(nb_predictions):
490 |         # start index of the seed, random number in range
491 |         start_index = np.random.randint(0, len(sequence) - sentence_length - 1)
492 |         # seed sentence
493 |         sentence = sequence[start_index : start_index + sentence_length]
494 | 
495 |         # Y_true
496 |         y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
497 |         print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
498 | 
499 |         seed = np.zeros((1, sentence_length, input_len))
500 |         # format input
501 |         for t in range(0, sentence_length):
502 |             seed[0, t, sentence[t]] = 1
503 | 
504 |         # get predictions
505 |         # verbose = 0, no logging
506 |         predictions = analyzer.model.predict(seed, verbose=0)[0]
507 | 
508 |         # y_predicted
509 |         if mapping == 'o2o':
510 |             next_id = np.argmax(predictions)
511 |             sys.stdout.write(' ' + str(next_id))
512 |             sys.stdout.flush()
513 |         elif mapping == 'm2m':
514 |             next_sentence = []
515 |             for pred in predictions:
516 |                 next_sentence.append(np.argmax(pred))
517 |             print "y_pred: " + ' '.join(str(id_).ljust(4)
518 |                                         for id_ in next_sentence)
519 |             # next_id = np.argmax(predictions[-1])
520 | 
521 |         # y_true
522 |         print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
523 | 
524 |         print "\n"
525 | 
526 | 
527 | def train(analyzer, train_data, nb_training_samples,
528 |           val_data, nb_validation_samples,
529 |           nb_epoch=50, nb_iterations=4):
530 |     """
531 |     Trains the network.
532 | 
533 |     Arguments:
534 |         analyzer: {SequenceAnalyzer}.
535 |         train_data: {tuple}, training data (X_train, y_train).
536 |         val_data: {tuple}, validation data (X_val, y_val).
537 |         nb_training_samples: {integer}, the number training samples.
538 |         nb_validation_samples: {integer}, the number validation samples.
539 |         nb_iterations: {integer}, number of iterations.
540 |         sentence_length: {integer}, the length of each training sentence.
541 |     """
542 |     for iteration in range(1, nb_iterations+1):
543 |         print ""
544 |         print "------------------------ Start Training ------------------------"
545 |         print "Iteration: ", iteration
546 |         print "Number of epoch per iteration: ", nb_epoch
547 | 
548 |         # history of losses and accuracy
549 |         history = History()
550 | 
551 |         # saves the model weights after each epoch
552 |         # if the validation loss decreased
553 |         checkpointer = ModelCheckpoint(filepath="weights.hdf5",
554 |                                        verbose=1, save_best_only=True)
555 | 
556 |         # train the model with data generator
557 |         analyzer.model.fit_generator(train_data,
558 |                                      samples_per_epoch=nb_training_samples,
559 |                                      nb_epoch=nb_epoch, verbose=1,
560 |                                      callbacks=[history, checkpointer],
561 |                                      validation_data=val_data,
562 |                                      nb_val_samples=nb_validation_samples)
563 | 
564 |         analyzer.save_model("weights-after-iteration.hdf5")
565 | 
566 | 
567 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40):
568 |     """
569 |     Scan the given sequence for detecting anormalies.
570 | 
571 |     Arguments:
572 |         sequence: {lsit}, the original input sequence
573 |         input_len: {integer}, the number of unique id classes
574 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
575 |         mapping: {string}, input to output mapping.
576 |             'o2o': one-to-one
577 |             'm2m': many-to-many
578 |         sentence_length: {integer}, the length of each sentence.
579 |     """
580 |     # sequence length
581 |     length = len(sequence)
582 | 
583 |     # predicted probabilities for each id
584 |     # we assume the first sentence_length ids are true
585 |     prob = [1] * sentence_length + [0] * (length - sentence_length)
586 | 
587 |     start_time = time.time()
588 |     try:
589 |         # generate elements
590 |         for start_index in xrange(length - sentence_length):
591 |             # seed sentence
592 |             X = sequence[start_index : start_index + sentence_length]
593 |             # print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
594 | 
595 |             # Y_true
596 |             # y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
597 |             # print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
598 |             y_next_true = sequence[start_index + sentence_length]
599 | 
600 |             seed = np.zeros((1, sentence_length, input_len))
601 |             # format input
602 |             for t in range(0, sentence_length):
603 |                 seed[0, t, X[t]] = 1
604 | 
605 |             # get predictionsverbose = 0, no logging
606 |             predictions = analyzer.model.predict(seed, verbose=0)[0]
607 | 
608 |             # y_predicted
609 |             y_next_pred = 0
610 |             next_prob = 0
611 |             if mapping == 'o2o':
612 |                 next_prob = predictions[y_next_true]
613 |                 prob[start_index + sentence_length] = next_prob
614 |                 y_next_pred = np.argmax(predictions)
615 |             elif mapping == 'm2m':
616 |                 # next_sentence = []
617 |                 # for pred in predictions:
618 |                 #     next_sentence.append(np.argmax(pred))
619 |                 # y_next_pred = next_sentence[-1]
620 |                 # print "y_pred: " + ' '.join(str(id_).ljust(4)
621 |                 #                             for id_ in next_sentence)
622 |                 y_next_pred = np.argmax(predictions[-1])
623 |                 next_prob = predictions[-1][y_next_true]
624 |                 prob[start_index + sentence_length] = next_prob
625 | 
626 |             print start_index, next_prob
627 |     except KeyboardInterrupt:
628 |         # print "    |-Write the clusters into %s ..." %self.cluster_file
629 |         with open('prob.txt', 'w') as prob_file:
630 |             for p in prob:
631 |                 prob_file.write(str(p) + '\n')
632 | 
633 |         plt.plot(prob, 'r*')
634 |         plt.xlim(0, 1000)
635 |         plt.ylim(0, 1)
636 |         plt.savefig("prob.png")
637 |         plt.clf()
638 |         plt.cla()
639 | 
640 |     stop_time = time.time()
641 |     print "--- %s seconds ---\n" % (stop_time - start_time)
642 | 
643 |     return prob
644 | 
645 | 
646 | def run(hidden_len=512, batch_size=128, nb_batch=200, nb_epoch=50,
647 |         nb_iterations=4, learning_rate=0.001, nb_predictions=20,
648 |         mapping='m2m', sentence_length=80, step=80, mode='train'):
649 |     """
650 |     Train, evaluate, or predict.
651 | 
652 |     Arguments:
653 |         hidden_len: {integer}, the size of a hidden layer.
654 |         batch_size: {interger}, the number of sentences per batch.
655 |         nb_batch: {integer}, number of batches to be trained durign each epoch.
656 |         nb_epoch: {interger}, number of epoches per iteration.
657 |         nb_iterations: {integer}, number of iterations.
658 |         learning_rate: {float}, learning rate.
659 |         nb_predictions: {integer}, number of the ids predicted.
660 |         mapping: {string}, input to output mapping.
661 |             'o2o': one-to-one
662 |             'm2m': many-to-many
663 |         sentence_length: {integer}, the length of each training sentence.
664 |         step: {integer}, the sample steps.
665 |         mode: {string}, th running mode of this programm
666 |             'train': train and predict
667 |             'predict': only predict by loading existing model weights
668 |             'evaluate': evaluate the model in evaluation data set
669 |             'detect': detect a new log sequence for the probabilities
670 |     """
671 |     # get parameters and dimensions of the model
672 |     print "Loading training data..."
673 |     train_sequence, input_len1 = get_sequence("./train_data/*")
674 |     print "Loading validation data..."
675 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
676 |     input_len = max(input_len1, input_len2)
677 | 
678 |     print "Training sequence length: %d" %len(train_sequence)
679 |     print "Validation sequence length: %d" %len(val_sequence)
680 |     print "#classes: %d\n" %input_len
681 | 
682 |     # data generator of X_train and y_train, with random offset
683 |     train_data = data_generator(train_sequence, input_len, mapping=mapping,
684 |                                 sentence_length=sentence_length, step=step,
685 |                                 random_offset=True, batch_size=batch_size)
686 | 
687 |     # data generator of X_val and y _val,  with random offset
688 |     val_data = data_generator(val_sequence, input_len, mapping=mapping,
689 |                               sentence_length=sentence_length, step=step,
690 |                               random_offset=True, batch_size=batch_size)
691 | 
692 |     # two layered LSTM 512 hidden nodes and a dropout rate of 0.2
693 |     analyzer = SequenceAnalyzer(sentence_length,
694 |                                 input_len, hidden_len, input_len)
695 | 
696 |     # build model
697 |     analyzer.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate,
698 |                    nb_layers=2, dropout=0.2)
699 | 
700 |     # plot model
701 |     # analyzer.plot_model()
702 | 
703 |     # load the previous model weights
704 |     # analyzer.load_model("weightsf4-61.hdf5")
705 | 
706 |     if mode == 'predict':
707 |         print "Predict..."
708 |         predict(val_sequence, input_len, analyzer, nb_predictions=nb_predictions,
709 |                 mapping=mapping, sentence_length=sentence_length)
710 |     elif mode == 'evaluate':
711 |         print "Evaluate..."
712 |         print "Metrics: " + ', '.join(analyzer.model.metrics_names)
713 |         X_val, y_val = data_generator(val_sequence, input_len, mapping=mapping,
714 |                                       sentence_length=sentence_length,
715 |                                       step=step, random_offset=False,
716 |                                       batch_size=batch_size)
717 |         results = analyzer.model.evaluate(X_val, y_val, #pylint: disable=W0612
718 |                                           batch_size=batch_size,
719 |                                           verbose=1)
720 |         print "Loss: ", results[0]
721 |         print "Accuracy: ", results[1]
722 |     elif mode == 'train':
723 |         print "Train..."
724 |         # number of training sampes and validation samples
725 |         nb_training_samples = batch_size * nb_batch
726 |         nb_validation_samples = int(nb_training_samples * 0.05)
727 | 
728 |         try:
729 |             train(analyzer, train_data, nb_training_samples,
730 |                   val_data, nb_validation_samples,
731 |                   nb_epoch=nb_epoch, nb_iterations=nb_iterations)
732 |         except KeyboardInterrupt:
733 |             analyzer.save_model("weights-stop.hdf5")
734 |     elif mode == 'detect':
735 |         print "Detect..."
736 |         detect(val_sequence, input_len, analyzer, mapping=mapping,
737 |                sentence_length=sentence_length)
738 |     else:
739 |         print "The mode = %s is not correct!!!" %mode
740 | 
741 |     return mode
742 | 
743 | 
744 | if __name__ == '__main__':
745 |     run()
746 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | git+https://github.com/fchollet/keras.git
2 | git+https://github.com/Theano/Theano.git
3 | git+https://github.com/scipy/scipy.git
4 | git+https://github.com/numpy/numpy.git
5 | cython
6 | 


--------------------------------------------------------------------------------
/rnn_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fluency03/sequence-rnn-py/0a55a8fcc93644bca216afc660564d3a606886ab/rnn_model.png


--------------------------------------------------------------------------------
/rnn_sequence_analyzer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program analyze the integer sequence using Uni-diractional Recurrent Neural
  3 | Network (RNN) with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
  4 | based on the python library Keras.
  5 | 
  6 | "Keras is a minimalist, highly modular neural networks library, written in
  7 |  Python and capable of running on top of either TensorFlow or Theano."
  8 |                                                 ---- Keras (http://keras.io/)
  9 | 
 10 | It is based on this Keras example - lstm_text_generation:
 11 | https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
 12 | 
 13 | Author: Chang Liu (fluency03)
 14 | Data: 2016-03-17
 15 | """
 16 | 
 17 | from math import log
 18 | import glob
 19 | # import os
 20 | import sys
 21 | import csv
 22 | import time
 23 | import matplotlib.pyplot as plt
 24 | import numpy as np
 25 | 
 26 | from keras.callbacks import Callback, ModelCheckpoint
 27 | from keras.layers import Activation, Dense, Dropout, LSTM, GRU
 28 | from keras.layers.wrappers import TimeDistributed
 29 | from keras.models import Sequential
 30 | from keras.optimizers import RMSprop # pylint: disable=W0611
 31 | from keras.utils.visualize_util import plot
 32 | 
 33 | 
 34 | # random number generator with a fixed value for reproducibility
 35 | np.random.seed(1337)
 36 | 
 37 | 
 38 | def override(f):
 39 |     """
 40 |     Override decorator.
 41 |     """
 42 |     return f
 43 | 
 44 | 
 45 | class SequenceAnalyzer(object):
 46 |     """
 47 |     Sequence analyzer based on RNN Sequential Model.
 48 |     """
 49 |     def __init__(self, sentence_length, input_len, hidden_len, output_len):
 50 |         self.sentence_length = sentence_length
 51 |         self.input_len = input_len
 52 |         self.hidden_len = hidden_len
 53 |         self.output_len = output_len
 54 |         self.model = Sequential()
 55 | 
 56 |     def build(self, layer='LSTM', mapping='m2m', learning_rate=0.001,
 57 |               nb_layers=2, dropout=0.2):
 58 |         """
 59 |         Stacked RNN with specified dropout rate (default 0.2), built with
 60 |         softmax activation, cross entropy loss and rmsprop optimizer.
 61 | 
 62 |         Arguments:
 63 |             layer: {string}, the type of the layers in the RNN Model.
 64 |                 'LSTM': LSTM layers
 65 |                 'GRU': GRU layers
 66 |             mapping: {string}, input to output mapping.
 67 |                 'o2o': one-to-one
 68 |                 'm2m': many-to-many
 69 |             learning_rate: {float}, learning rate.
 70 |             nb_layers: {integer}, number of layers in total.
 71 |             dropout: {float}, dropout value.
 72 |         """
 73 |         print "Building Model..."
 74 |         print ("    layer = %d-%s , mapping = %s , learning rate = %.5f, "
 75 |                "nb_layers = %d , dropout = %.2f"
 76 |                %(self.hidden_len, layer, mapping, learning_rate,
 77 |                  nb_layers, dropout))
 78 | 
 79 |         # check the layer type: LSTM or GRU
 80 |         if layer == 'LSTM':
 81 |             class LAYER(LSTM):
 82 |                 """
 83 |                 LAYER as LSTM.
 84 |                 """
 85 |                 pass
 86 |         elif layer == 'GRU':
 87 |             class LAYER(GRU):
 88 |                 """
 89 |                 LAYER as GRU.
 90 |                 """
 91 |                 pass
 92 | 
 93 |         # check whether return sequence for each of the layers
 94 |         return_sequences = []
 95 |         if mapping == 'o2o':
 96 |             # if mapping is one-to-one
 97 |             for nl in range(nb_layers):
 98 |                 if nl == nb_layers-1:
 99 |                     return_sequences.append(False)
100 |                 else:
101 |                     return_sequences.append(True)
102 |         elif mapping == 'm2m':
103 |             # if mapping is many-to-many
104 |             for _ in range(nb_layers):
105 |                 return_sequences.append(True)
106 | 
107 |         # first layer RNN with specified number of nodes in the hidden layer.
108 |         self.model.add(LAYER(self.hidden_len,
109 |                              return_sequences=return_sequences[0],
110 |                              input_shape=(self.sentence_length,
111 |                                           self.input_len)))
112 |         self.model.add(Dropout(dropout))
113 | 
114 |         # the following layers
115 |         for nl in range(nb_layers-1):
116 |             self.model.add(LAYER(self.hidden_len,
117 |                                  return_sequences=return_sequences[nl+1]))
118 |             self.model.add(Dropout(dropout))
119 | 
120 |         if mapping == 'o2o':
121 |             # if mapping is one-to-one
122 |             self.model.add(Dense(self.output_len))
123 |         elif mapping == 'm2m':
124 |             # if mapping is many-to-many
125 |             self.model.add(TimeDistributed(Dense(self.output_len)))
126 | 
127 |         self.model.add(Activation('softmax'))
128 | 
129 |         rms = RMSprop(lr=learning_rate)
130 |         self.model.compile(loss='categorical_crossentropy',
131 |                            optimizer=rms,
132 |                            metrics=['accuracy'])
133 | 
134 |     def save_model(self, filename, overwrite=False):
135 |         """
136 |         Save the model weight into a hdf5 file.
137 | 
138 |         Arguments:
139 |             filename: {string}, the name/path to the file
140 |                 to which the weights are going to be saved.
141 |             overwrite: {bool}, overwrite existing file.
142 |         """
143 |         print "Save Weights %s ..." %filename
144 |         self.model.save_weights(filename, overwrite=overwrite)
145 | 
146 |     def load_model(self, filename):
147 |         """
148 |         Load the model weight into a hdf5 file.
149 | 
150 |         Arguments:
151 |             filename: {string}, the name/path to the file
152 |                 to which the weights are going to be loaded.
153 |         """
154 |         print "Load Weights %s ..." %filename
155 |         self.model.load_weights(filename)
156 | 
157 |     def plot_model(self, filename='rnn_model.png'):
158 |         """
159 |         Plot model.
160 | 
161 |         Arguments:
162 |             filename: {string}, the name/path to the file
163 |                 to which the weights are going to be plotted.
164 |         """
165 |         print "Plot Model %s ..." %filename
166 |         plot(self.model, to_file=filename)
167 | 
168 | 
169 | class History(Callback):
170 |     """
171 |     Record the loss and accuracy history.
172 |     """
173 |     @override
174 |     def on_train_begin(self, logs={}): # pylint: disable=W0102
175 |         """
176 |         A method starting at the begining of the training.
177 | 
178 |         Arguments:
179 |             logs: {dictionary}, recording the training and validation
180 |                 losses and accuracy of every epoch.
181 |         """
182 |         # training loss and accuracy
183 |         self.train_losses = []
184 |         self.train_acc = []
185 |         # validation loss and accuracy
186 |         self.val_losses = []
187 |         self.val_acc = []
188 | 
189 |     @override
190 |     def on_epoch_end(self, epoch, logs={}): # pylint: disable=W0102
191 |         """
192 |         A method starting at the begining of the training.
193 | 
194 |         Arguments:
195 |             epoch: {integer}, the current epoch.
196 |             logs: {dictionary}, recording the training and validation
197 |                 losses and accuracy of every epoch.
198 |         """
199 |         # record training loss and accuracy
200 |         self.train_losses.append(logs.get('loss'))
201 |         self.train_acc.append(logs.get('acc'))
202 |         # record validation loss and accuracy
203 |         self.val_losses.append(logs.get('val_loss'))
204 |         self.val_acc.append(logs.get('val_acc'))
205 | 
206 |         # continutously save the train_loss, train_acc, val_loss, val_acc
207 |         # into a csv file with 4 columns respeactively
208 |         csv_name = 'history.csv'
209 |         with open(csv_name, 'a') as csvfile:
210 |             his_writer = csv.writer(csvfile)
211 |             print "\n    Save loss and accuracy into %s" %csv_name
212 |             his_writer.writerow((logs.get('loss'), logs.get('acc'),
213 |                                  logs.get('val_loss'), logs.get('val_acc')))
214 | 
215 | 
216 | def sample(prob, temperature=0.2):
217 |     """
218 |     Softmax function for reinforcement learning.
219 | 
220 |     Arguments:
221 |         prob: {list}, a list of probabilities of each of the classes.
222 |         temperature: {float}, Softmax temperature.
223 |     Returns:
224 |         {integer}, the most possible sample.
225 |     """
226 |     prob = np.log(prob) / temperature
227 |     prob = np.exp(prob) / np.sum(np.exp(prob))
228 |     return np.argmax(np.random.multinomial(1, prob, 1))
229 | 
230 | 
231 | def get_sequence(filepath):
232 |     """
233 |     Get the original sequence from file.
234 | 
235 |     Arguments:
236 |         filename: {string}, the name/path of input log sequence file.
237 |     Returns:
238 |         {list}, the log sequence.
239 |         {integer}, the size of vocabulary.
240 |     """
241 |     # read file and convert ids of each line into array of numbers
242 |     seqfiles = glob.glob(filepath)
243 |     sequence = []
244 | 
245 |     for seqfile in seqfiles:
246 |         with open(seqfile, 'r') as f:
247 |             one_sequence = [int(id_) for id_ in f]
248 |             print "        %s, sequence length: %d" %(seqfile,
249 |                                                       len(one_sequence))
250 |             sequence.extend(one_sequence)
251 | 
252 |     # add two extra positions for 'unknown-log' and 'no-log'
253 |     vocab_size = max(sequence) + 2
254 | 
255 |     return sequence, vocab_size
256 | 
257 | 
258 | def get_data(sequence, vocab_size, mapping='m2m', sentence_length=40, step=3,
259 |              random_offset=True):
260 |     """
261 |     Retrieves data from a plain txt file and formats it using one-hot vector.
262 | 
263 |     Arguments:
264 |         sequence: {lsit}, the original input sequence
265 |         vocab_size: {integer}, the number of unique id classes
266 |         mapping: {string}, input to output mapping.
267 |             'o2o': one-to-one
268 |             'm2m': many-to-many
269 |         sentence_length: {integer}, the length of each training sentence.
270 |         step: {integer}, the sample steps.
271 |         random_offset: {bool}, the offset is random between step or is 0.
272 |     Returns:
273 |         {np.array}, training input data X
274 |         {np.array}, training target data y
275 |     """
276 |     X_sentences = []
277 |     y_sentences = []
278 |     next_ids = []
279 | 
280 |     offset = np.random.randint(0, step) if random_offset else 0
281 | 
282 |     # creat batch data and next sentences
283 |     for i in range(offset, len(sequence) - sentence_length, step):
284 |         X_sentences.append(sequence[i : i + sentence_length])
285 |         if mapping == 'o2o':
286 |             # if mapping is one-to-one
287 |             next_ids.append(sequence[i + sentence_length])
288 |         elif mapping == 'm2m':
289 |             # if mapping is many-to-many
290 |             y_sentences.append(sequence[i + 1 : i + sentence_length + 1])
291 | 
292 |     # number of sampes
293 |     nb_samples = len(X_sentences)
294 |     # print "total # of sentences: %d" %nb_samples
295 | 
296 |     # one-hot vector (all zeros except for a single one at
297 |     # the exact postion of this id number)
298 |     X_train = np.zeros((nb_samples, sentence_length, vocab_size), dtype=np.bool)
299 |     # expected outputs for each sentence
300 |     if mapping == 'o2o':
301 |         # if mapping is one-to-one
302 |         y_train = np.zeros((nb_samples, vocab_size), dtype=np.bool)
303 |     elif mapping == 'm2m':
304 |         # if mapping is many-to-many
305 |         y_train = np.zeros((nb_samples, sentence_length, vocab_size),
306 |                            dtype=np.bool)
307 | 
308 |     for i, x_sentence in enumerate(X_sentences):
309 |         for t, id_ in enumerate(x_sentence):
310 |             # mark the each corresponding character in a sentence as 1
311 |             X_train[i, t, id_] = 1
312 |             # if mapping is many-to-many
313 |             if mapping == 'm2m':
314 |                 y_train[i, t, y_sentences[i][t]] = 1
315 |         # if mapping is one-to-one
316 |         # mark the corresponding character in expected output as 1
317 |         if mapping == 'o2o':
318 |             y_train[i, next_ids[i]] = 1
319 | 
320 |     return X_train, y_train
321 | 
322 | 
323 | def predict(sequence, input_len, analyzer, nb_predictions=80,
324 |             mapping='m2m', sentence_length=40):
325 |     """
326 |     Predict the next sequences using existing model and weights given some seed.
327 | 
328 |     Arguments:
329 |         sequence: {lsit}, the original input sequence
330 |         input_len: {integer}, the number of unique id classes
331 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
332 |         nb_predictions: {integer}, number of predictions after giving the seed
333 |         mapping: {string}, input to output mapping.
334 |             'o2o': one-to-one
335 |             'm2m': many-to-many
336 |         sentence_length: {integer}, the length of each sentence.
337 |     """
338 |     # generate elements
339 |     for _ in range(nb_predictions):
340 |         # start index of the seed, random number in range
341 |         start_index = np.random.randint(0, len(sequence) - sentence_length - 1)
342 |         # seed sentence
343 |         sentence = sequence[start_index : start_index + sentence_length]
344 | 
345 |         # Y_true
346 |         y_true = sequence[start_index + 1 : start_index + sentence_length + 1]
347 |         print "X:      " + ' '.join(str(s).ljust(4) for s in sentence)
348 | 
349 |         seed = np.zeros((1, sentence_length, input_len))
350 |         # format input
351 |         for t in range(0, sentence_length):
352 |             seed[0, t, sentence[t]] = 1
353 | 
354 |         # get predictions
355 |         # verbose = 0, no logging
356 |         predictions = analyzer.model.predict(seed, verbose=0)[0]
357 | 
358 |         # y_predicted
359 |         if mapping == 'o2o':
360 |             next_id = np.argmax(predictions)
361 |             sys.stdout.write(' ' + str(next_id))
362 |             sys.stdout.flush()
363 |         elif mapping == 'm2m':
364 |             next_sentence = []
365 |             for pred in predictions:
366 |                 next_sentence.append(np.argmax(pred))
367 |             print "y_pred: " + ' '.join(str(id_).ljust(4)
368 |                                         for id_ in next_sentence)
369 |             # next_id = np.argmax(predictions[-1])
370 | 
371 |         # y_true
372 |         print "y_true: " + ' '.join(str(s).ljust(4) for s in y_true)
373 | 
374 |         print "\n"
375 | 
376 | 
377 | def train(analyzer, train_sequence, val_sequence, input_len,
378 |           batch_size=128, nb_epoch=50, nb_iterations=4,
379 |           sentence_length=40, step=40, mapping='m2m'):
380 |     """
381 |     Trains the network.
382 | 
383 |     Arguments:
384 |         analyzer: {SequenceAnalyzer}.
385 |         train_sequence: {list}, training sequence.
386 |         val_sequence: {list}, validation sequence.
387 |         input_len: {integer}, the number of classes, i.e., the input length of
388 |             neural network.
389 |         batch_size: {interger}, the number of sentences per batch.
390 |         nb_epoch: {integer}, number of epoches per iteration.
391 |         nb_iterations: {integer}, number of iterations.
392 |         sentence_length: {integer}, the length of each training sentence.
393 |         step: {integer}, the sample steps.
394 |         mapping: {string}, input to output mapping.
395 |             'o2o': one-to-one
396 |             'm2m': many-to-many
397 |     """
398 |     for iteration in range(1, nb_iterations+1):
399 |         # create training data, randomize the offset between steps
400 |         X_train, y_train = get_data(train_sequence, input_len, mapping=mapping,
401 |                                     sentence_length=sentence_length, step=step,
402 |                                     random_offset=False)
403 |         X_val, y_val = get_data(val_sequence, input_len, mapping=mapping,
404 |                                 sentence_length=sentence_length, step=step,
405 |                                 random_offset=False)
406 |         print ""
407 |         print "------------------------ Start Training ------------------------"
408 |         print "Iteration: ", iteration
409 |         print "Number of epoch per iteration: ", nb_epoch
410 | 
411 |         # history of losses and accuracy
412 |         history = History()
413 | 
414 |         # saves the model weights after each epoch
415 |         # if the validation loss decreased
416 |         checkpointer = ModelCheckpoint(filepath="weights.hdf5",
417 |                                        verbose=1, save_best_only=True)
418 | 
419 |         # train the model
420 |         analyzer.model.fit(X_train, y_train,
421 |                            batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
422 |                            callbacks=[history, checkpointer],
423 |                            validation_data=(X_val, y_val))
424 | 
425 |         analyzer.save_model("weights-after-iteration.hdf5", overwrite=True)
426 | 
427 | 
428 | def detect(sequence, input_len, analyzer, mapping='m2m', sentence_length=40,
429 |            nb_options=1):
430 |     """
431 |     Scan the given sequence for detecting anormalies.
432 | 
433 |     Arguments:
434 |         sequence: {lsit}, the original input sequence
435 |         input_len: {integer}, the number of unique id classes
436 |         analyzer: {SequenceAnalyzer}, the sequence analyzer
437 |         mapping: {string}, input to output mapping.
438 |             'o2o': one-to-one
439 |             'm2m': many-to-many
440 |         sentence_length: {integer}, the length of each sentence.
441 |         nb_options: {interger}, number of predicted options.
442 |     """
443 |     # sequence length
444 |     length = len(sequence)
445 | 
446 |     # predicted probabilities for each id
447 |     # we assume the first sentence_length ids are true
448 |     probs = np.zeros((nb_options+1, length))
449 |     for o in xrange(nb_options+1):
450 |         probs[o][:sentence_length] = 1.0
451 | 
452 |     # probability in negative log scale
453 |     log_probs = np.zeros((nb_options+1, length))
454 | 
455 |     # count the number of correct predictions
456 |     nb_correct = [0] * (nb_options+1)
457 | 
458 |     start_time = time.time()
459 |     try:
460 |         # generate elements
461 |         for start_index in xrange(length - sentence_length):
462 |             # seed sentence
463 |             X = sequence[start_index : start_index + sentence_length]
464 |             y_next_true = sequence[start_index + sentence_length]
465 | 
466 |             seed = np.zeros((1, sentence_length, input_len))
467 |             # format input
468 |             for t in range(0, sentence_length):
469 |                 seed[0, t, X[t]] = 1
470 | 
471 |             # get predictions, verbose = 0, no logging
472 |             predictions = np.asarray(analyzer.model.predict(seed, verbose=0)[0])
473 | 
474 |             # y_predicted
475 |             y_next_pred = []
476 |             next_probs = [0.0] * (nb_options+1)
477 |             if mapping == 'o2o':
478 |                 # y_next_pred[np.argmax(predictions)] = True
479 |                 # get the top-nb_options predictions with the high probability
480 |                 y_next_pred = np.argsort(predictions)[-nb_options:][::-1]
481 |                 # get the probability of the y_true
482 |                 next_probs[0] = predictions[y_next_true]
483 |             elif mapping == 'm2m':
484 |                 # y_next_pred[np.argmax(predictions[-1])] = True
485 |                 # get the top-nb_options predictions with the high probability
486 |                 y_next_pred = np.argsort(predictions[-1])[-nb_options:][::-1]
487 |                 # get the probability of the y_true
488 |                 next_probs[0] = predictions[-1][y_next_true]
489 | 
490 |             print y_next_pred, y_next_true
491 |             # chech whether the y_true is in the top-predicted options
492 |             for i in xrange(nb_options):
493 |                 if y_next_true == y_next_pred[i]:
494 |                     next_probs[i+1] = 1.0
495 |                     nb_correct[i+1] += 1
496 | 
497 |             next_probs = np.maximum.accumulate(next_probs)
498 |             print next_probs
499 | 
500 |             for j in xrange(nb_options+1):
501 |                 probs[j, start_index + sentence_length] = next_probs[j]
502 |                 # get the negative log probability
503 |                 log_probs[j, start_index + sentence_length] = -log(next_probs[j])
504 | 
505 |             print start_index, next_probs
506 | 
507 |     except KeyboardInterrupt:
508 |         print "KeyboardInterrupt"
509 | 
510 |     nb_correct = np.add.accumulate(nb_correct)
511 |     for p in xrange(nb_options+1):
512 |         print "Accuracy %d: %.4f%%" %(p, (nb_correct[p] * 100.0 /
513 |                                           (start_index + 1))) # pylint: disable=W0631
514 | 
515 |     print "    |-Plot figures ..."
516 |     for q in xrange(nb_options+1):
517 |         plot_and_write_prob(probs[q],
518 |                             "prob_"+str(q),
519 |                             [0, 50000, 0, 1],
520 |                             'Normal')
521 |         plot_and_write_prob(log_probs[q],
522 |                             "log_prob_"+str(q),
523 |                             [0, 50000, 0, 25],
524 |                             'Log')
525 | 
526 |     stop_time = time.time()
527 |     print "--- %s seconds ---\n" % (stop_time - start_time)
528 | 
529 |     return probs
530 | 
531 | 
532 | def plot_hist(prob, filename, plot_range, scale, cumulative, normed=True):
533 |     """
534 |     Plot and write the (cumulative) probabilties distribution.
535 |     """
536 |     if scale == 'Log':
537 |         prob = [-p for p in prob]
538 |     plt.hist(prob, bins=100, normed=normed, cumulative=cumulative)
539 |     plt.ylabel('Probability in %s Scale' %scale)
540 |     plt.ylabel('Distribution: Normalized=%s, Cumulated=%s.' %(normed,
541 |                                                               cumulative))
542 |     plt.grid(True)
543 |     plt.axis(plot_range)
544 |     plt.savefig(filename + ".png")
545 |     plt.clf()
546 |     plt.cla()
547 | 
548 | 
549 | def plot_and_write_prob(prob, filename, plot_range, scale):
550 |     """
551 |     Plot and write the probabilties for each of the log.
552 |     """
553 |     # print "    |-Plot figures ..."
554 |     plt.plot(prob, 'r*')
555 |     plt.xlabel('Log')
556 |     plt.ylabel('Probability in %s Scale' %scale)
557 |     plt.axis(plot_range)
558 |     plt.savefig(filename + ".png")
559 |     plt.clf()
560 |     plt.cla()
561 | 
562 |     # print "    |-Write probabilities ..."
563 |     with open(filename + '.txt', 'w') as prob_file:
564 |         for p in prob:
565 |             prob_file.write(str(p) + '\n')
566 | 
567 | 
568 | def run(hidden_len=512, batch_size=128, nb_epoch=50, nb_iterations=5,
569 |         learning_rate=0.001, nb_predictions=20, mapping='m2m',
570 |         sentence_length=40, step=40, mode='train'):
571 |     """
572 |     Train, evaluate, or predict.
573 | 
574 |     Arguments:
575 |         hidden_len: {integer}, the size of a hidden layer.
576 |         batch_size: {interger}, the number of sentences per batch.
577 |         nb_epoch: {interger}, number of epoches per iteration.
578 |         nb_iterations: {integer}, number of iterations.
579 |         learning_rate: {float}, learning rate.
580 |         nb_predictions: {integer}, number of the ids predicted.
581 |         mapping: {string}, input to output mapping.
582 |             'o2o': one-to-one
583 |             'm2m': many-to-many
584 |         sentence_length: {integer}, the length of each training sentence.
585 |         step: {integer}, the sample steps.
586 |         mode: {string}, th running mode of this programm
587 |             'train': train and predict
588 |             'predict': only predict by loading existing model weights
589 |             'evaluate': evaluate the model in evaluation data set
590 |             'detect': detect a new log sequence for the probabilities
591 |     """
592 |     # get parameters and dimensions of the model
593 |     print "Loading training data..."
594 |     train_sequence, input_len1 = get_sequence("./train_data/*")
595 |     print "Loading validation data..."
596 |     val_sequence, input_len2 = get_sequence("./validation_data/*")
597 |     input_len = max(input_len1, input_len2)
598 | 
599 |     print "Training sequence length: %d" %len(train_sequence)
600 |     print "Validation sequence length: %d" %len(val_sequence)
601 |     print "#classes: %d\n" %input_len
602 | 
603 |     # two layered LSTM 512 hidden nodes and a dropout rate of 0.2
604 |     rnn = SequenceAnalyzer(sentence_length, input_len, hidden_len, input_len)
605 | 
606 |     # build model
607 |     rnn.build(layer='LSTM', mapping=mapping, learning_rate=learning_rate,
608 |               nb_layers=3, dropout=0.5)
609 | 
610 |     # plot model
611 |     # rnn.plot_model()
612 | 
613 |     # load the previous model weights
614 |     # rnn.load_model("weights-after-iteration-l1.hdf5")
615 | 
616 |     if mode == 'predict':
617 |         print "Predict..."
618 |         predict(val_sequence, input_len, rnn, nb_predictions=nb_predictions,
619 |                 mapping=mapping, sentence_length=sentence_length)
620 |     elif mode == 'evaluate':
621 |         print "Evaluate..."
622 |         print "Metrics: " + ', '.join(rnn.model.metrics_names)
623 |         X_val, y_val = get_data(val_sequence, input_len, mapping=mapping,
624 |                                 sentence_length=sentence_length, step=step,
625 |                                 random_offset=False)
626 |         results = rnn.model.evaluate(X_val, y_val, #pylint: disable=W0612
627 |                                      batch_size=batch_size,
628 |                                      verbose=1)
629 |         print "Loss: ", results[0]
630 |         print "Accuracy: ", results[1]
631 |     elif mode == 'train':
632 |         print "Train..."
633 |         try:
634 |             train(rnn, train_sequence, val_sequence, input_len,
635 |                   batch_size=batch_size, nb_epoch=nb_epoch,
636 |                   nb_iterations=nb_iterations,
637 |                   sentence_length=sentence_length,
638 |                   step=step, mapping=mapping)
639 |         except KeyboardInterrupt:
640 |             rnn.save_model("weights-stop.hdf5", overwrite=True)
641 |     elif mode == 'detect':
642 |         print "Detect..."
643 |         detect(val_sequence, input_len, rnn, mapping=mapping,
644 |                sentence_length=sentence_length, nb_options=3)
645 |     else:
646 |         print "The mode = %s is not correct!!!" %mode
647 | 
648 |     return mode
649 | 
650 | 
651 | if __name__ == '__main__':
652 |     run()
653 | 


--------------------------------------------------------------------------------