├── Deep Neural Network.ipynb
├── Gradient Boosting Machine.ipynb
├── Modeling.ipynb
├── Preprocessing.ipynb
├── README.md
└── utilities.py


/Deep Neural Network.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction: Deep Neural Network\n",
  8 |     "\n",
  9 |     "In this notebook we will develop a deep neural network to apply to the building energy prediction task. This is a supervised regression problem where the objective is build a model that is trained on past electricity consumption and weather information and then makes predictions for future energy consumption. \n",
 10 |     "\n",
 11 |     "Deep neural networks have gained immense popularity in recent years with extraordinary performance on many tasks. This includes above human level performance in computer vision - using convolutional neural networks - and natural language processing - using recurrent neural networks - problems. However, neural networks require a considerable amount of data points in order to learn the mapping from the features to the target, particularly as the depth of the network increases. For that reason, neural networks are typically not as succesful on small to medium sized datasets such as the building energy data used by EDIFES. The majority of the building datasets are under 1e6 observations, which may not be enough for a neural network to learn. Nonetheless, we will build a neural network that can then be tested on all of the buildings for performance relative to the other models.\n",
 12 |     "\n",
 13 |     "As in previous notebooks, we will go through the implementation step-by-step, and then refactor the code into a single function. The end objective is a function that can take in the training features, training targets, testing features, and testing targets, and return the model performance. This can then be integrate into the previous develop `evaluate_models` function. After the addition of the deep neural network to the set of models, there will be eight models that can be run on hundreds of building datasets. "
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "### Imports \n",
 21 |     "\n",
 22 |     "We will use a standard stack of data science libraries: `pandas`, `numpy`, `sklearn`, `matplotlib` along with `keras` for the deep neural network implementation. See the `requirements.txt` file for the correct version of these libraries to install. "
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 1,
 28 |    "metadata": {},
 29 |    "outputs": [
 30 |     {
 31 |      "name": "stderr",
 32 |      "output_type": "stream",
 33 |      "text": [
 34 |       "Using TensorFlow backend.\n"
 35 |      ]
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "# numpy and pandas for data manipulation\n",
 40 |     "import pandas as pd\n",
 41 |     "import numpy as np\n",
 42 |     "\n",
 43 |     "# Sklearn preprocessing functionality\n",
 44 |     "from sklearn.preprocessing import LabelEncoder, MinMaxScaler\n",
 45 |     "\n",
 46 |     "# Matplotlib for visualizations\n",
 47 |     "import matplotlib.pyplot as plt\n",
 48 |     "\n",
 49 |     "# Adjust default font size \n",
 50 |     "plt.rcParams['font.size'] = 18\n",
 51 |     "\n",
 52 |     "# Keras for neural networks\n",
 53 |     "from keras import models, layers, optimizers, losses, metrics, callbacks\n",
 54 |     "\n",
 55 |     "# Timer for recording runtime\n",
 56 |     "from timeit import default_timer as timer\n",
 57 |     "\n",
 58 |     "# Utilities developed for project\n",
 59 |     "from utilities import preprocess_data"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 2,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "df = pd.read_csv('../data/f-APS_weather.csv')\n",
 69 |     "train, train_targets, test, test_targets = preprocess_data(df)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "# Deep Neural Network\n",
 77 |     "\n",
 78 |     "A fully-connected deep neural network is made up of many layers consisting of matrix multiplies and bias additions. The outputs of the matrix multiplies are passed through a non-linearity which is where the neural network gets its power to approximate any function (provided it has enough hidden units). Each layer is made up of a number of units (also called neurons), generally a multiple of 2. The number of neurons per layer and the number of layers determine the model capacity: a model with more layers and more neurons per layer will have greater capacity to learn. However, the more total neurons, the greater the chance the model will overfit to the training data, especially with a limited amount of training data. A general technique is to start with a shallow network (a few layers) and keep adding more layers until the model overfits. Then, either remove layers as needed or employ some form of model regularization. Choosing an architecture is the most critical part of the deep neural network modeling, and the ideal architecture depends on the problem. We are going to use the same neural network across all buildings even though the best model probably differs significantly from building to building. Optimizing neural network architectures automatically is an open research question, and applying a random search through model architectures might be one option although for this problem we will only use one architecture.\n",
 79 |     "\n",
 80 |     "# Deep Neural Network Hyperparameters\n",
 81 |     "\n",
 82 |     "There are many hyperparameters we need to choose for a deep neural network:\n",
 83 |     "\n",
 84 |     "* Number of layers\n",
 85 |     "* Number of units per layer\n",
 86 |     "* Non-linear activation on hidden layers\n",
 87 |     "* Regularization methods\n",
 88 |     "* Optimizer (algorithm) used for minimizing the objective function\n",
 89 |     "* Learning rate of the optimizer\n",
 90 |     "* Number of training iterations\n",
 91 |     "* Batch size\n",
 92 |     "\n",
 93 |     "## Number of Layers\n",
 94 |     "\n",
 95 |     "We will use a total of 7 layers: 1 input, 5 hidden, and 1 output layer. Generally, as the depth of the network increases, the model performs better on the training data because it has a greater capacity. To prevent overfitting though, we do not want to make the model too deep. 5 hidden layers was selected based on observing the training curves and validation scores of recorded by different models across several building datasets. More layers generally tended to not improve performance on the validation set, and fewer layers meant the model was not able to learn even the training data. \n",
 96 |     "\n",
 97 |     "## Units per Layer\n",
 98 |     "\n",
 99 |     "The number of units per layers will be: 32, 64, 128, 256, 512, 1024, 1. The number of output units must be 1 because the network outputs a single prediction.\n",
100 |     "\n",
101 |     "## Activations\n",
102 |     "\n",
103 |     "The output layer in a regression problem must have no activation because we are interested in predicting a continous value that can take on any positive value (the output layer in a binary classification task has a `sigmoid` activation and the output layer for a multiclass classification task has a `softmax` activation).\n",
104 |     "\n",
105 |     "The current recommedation for developing fully-connected deep neural networks is to use a ReLU (rectified linear) activation on the hidden layers along with Batch Normalization. The \n",
106 |     "definition of the ReLU function is $$f(x) = max(0, x)$$. This activation prevents the vanishing gradient problem common with saturating activation functions such as `tanh` or `sigmoid`. ReLU is very quick to train and has demonstrated good accuracy. The drawbacks are that it can lead to \"dead\" neurons because any value less than 0 is set to 0. Moreover, the mean activation will be greater than 0, which can lead to issues with training stability. The mean activation issue can be addressed by applying a Batch Normalization layer after each activation. There are a number of other activation functions which have been developed in recent years, but ReLU + Batch Norm remains a good default choice with fast training times and reasonable generalization to the test data. \n",
107 |     "\n",
108 |     "## Regularization Methods\n",
109 |     "\n",
110 |     "Regularization is used to address overfitting on the training data. This can be done by placing penalties on the L1 or L2 norm of the weight matrices that are added to the objective function. This encourages smaller magnitude weights which reduces the variance of the model. Other methods of regularization include data augmentation (typically for image data) or adding noise to the targets to encourage robustness to small perturbances. We will employ two methods of regularization: early stopping (discussion in the training iterations section below) and dropout. \n",
111 |     "\n",
112 |     "#### Dropout \n",
113 |     "\n",
114 |     "[Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors](https://arxiv.org/pdf/1207.0580.pdf)\n",
115 |     "\n",
116 |     "Dropout has become one of the most popular methods for regularization. It is a techique that randomly drops (sets the activations to 0) a fraction of the units in a layer during training. The fraction is usually around 0.5 and can be different across layers. The idea behind dropout is it encourages the model to be robust and have less variance because neurons in a subsequent layer cannot depend on the neuron output in the previous  layer. Dropout is applied to the outputs of fully connected hidden layers typically before the activation function. Dropout must not be used during testing! (Keras will handle this for us automatically). We will use dropout with a rate of 0.5 after every hidden layer except for the final hidden layer before the output. "
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "### Architecture Implementation\n",
124 |     "\n",
125 |     "Below is the code to create the basic architecture described above. We use fully connected layers with \"relu\" activations for all of the hidden layers followed by dropout with a rate of 0.5 (except for the final hidden layer) and batch norm applied after every hidden layer. The final output layer has no activation because this is a regression problem."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 3,
131 |    "metadata": {},
132 |    "outputs": [
133 |     {
134 |      "name": "stdout",
135 |      "output_type": "stream",
136 |      "text": [
137 |       "_________________________________________________________________\n",
138 |       "Layer (type)                 Output Shape              Param #   \n",
139 |       "=================================================================\n",
140 |       "dense_1 (Dense)              (None, 32)                704       \n",
141 |       "_________________________________________________________________\n",
142 |       "dropout_1 (Dropout)          (None, 32)                0         \n",
143 |       "_________________________________________________________________\n",
144 |       "batch_normalization_1 (Batch (None, 32)                128       \n",
145 |       "_________________________________________________________________\n",
146 |       "dense_2 (Dense)              (None, 64)                2112      \n",
147 |       "_________________________________________________________________\n",
148 |       "dropout_2 (Dropout)          (None, 64)                0         \n",
149 |       "_________________________________________________________________\n",
150 |       "batch_normalization_2 (Batch (None, 64)                256       \n",
151 |       "_________________________________________________________________\n",
152 |       "dense_3 (Dense)              (None, 128)               8320      \n",
153 |       "_________________________________________________________________\n",
154 |       "dropout_3 (Dropout)          (None, 128)               0         \n",
155 |       "_________________________________________________________________\n",
156 |       "batch_normalization_3 (Batch (None, 128)               512       \n",
157 |       "_________________________________________________________________\n",
158 |       "dense_4 (Dense)              (None, 256)               33024     \n",
159 |       "_________________________________________________________________\n",
160 |       "dropout_4 (Dropout)          (None, 256)               0         \n",
161 |       "_________________________________________________________________\n",
162 |       "batch_normalization_4 (Batch (None, 256)               1024      \n",
163 |       "_________________________________________________________________\n",
164 |       "dense_5 (Dense)              (None, 512)               131584    \n",
165 |       "_________________________________________________________________\n",
166 |       "dropout_5 (Dropout)          (None, 512)               0         \n",
167 |       "_________________________________________________________________\n",
168 |       "batch_normalization_5 (Batch (None, 512)               2048      \n",
169 |       "_________________________________________________________________\n",
170 |       "dense_6 (Dense)              (None, 1024)              525312    \n",
171 |       "_________________________________________________________________\n",
172 |       "batch_normalization_6 (Batch (None, 1024)              4096      \n",
173 |       "_________________________________________________________________\n",
174 |       "dense_7 (Dense)              (None, 1)                 1025      \n",
175 |       "=================================================================\n",
176 |       "Total params: 710,145\n",
177 |       "Trainable params: 706,113\n",
178 |       "Non-trainable params: 4,032\n",
179 |       "_________________________________________________________________\n"
180 |      ]
181 |     }
182 |    ],
183 |    "source": [
184 |     "model = models.Sequential()\n",
185 |     "\n",
186 |     "# Input layer\n",
187 |     "model.add(layers.Dense(32, activation=\"relu\", input_shape = (train.shape[1], )))\n",
188 |     "model.add(layers.Dropout(0.5))\n",
189 |     "model.add(layers.BatchNormalization())\n",
190 |     "\n",
191 |     "# Five hidden layers\n",
192 |     "model.add(layers.Dense(64, activation = \"relu\"))\n",
193 |     "model.add(layers.Dropout(0.5))\n",
194 |     "model.add(layers.BatchNormalization())\n",
195 |     "model.add(layers.Dense(128, activation = \"relu\"))\n",
196 |     "model.add(layers.Dropout(0.5))\n",
197 |     "model.add(layers.BatchNormalization())\n",
198 |     "model.add(layers.Dense(256, activation = \"relu\"))\n",
199 |     "model.add(layers.Dropout(0.5))\n",
200 |     "model.add(layers.BatchNormalization())\n",
201 |     "model.add(layers.Dense(512, activation = \"relu\"))\n",
202 |     "model.add(layers.Dropout(0.5))\n",
203 |     "model.add(layers.BatchNormalization())\n",
204 |     "model.add(layers.Dense(1024, activation = \"relu\"))\n",
205 |     "model.add(layers.BatchNormalization())\n",
206 |     "\n",
207 |     "# Output layer\n",
208 |     "model.add(layers.Dense(1, activation = None))\n",
209 |     "\n",
210 |     "model.summary()"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "## Optimizer\n",
218 |     "\n",
219 |     "All of the optimizers now used in neural networks are variants on vanilla gradient descent. For example, Nesterov momentum uses the concept of momentum based on exponentially decaying moving averages of past gradients to encourage stable training. Another family of optimizers, represented by AdaGrad, RMSProp, and Adam use adaptive learning rates to encourage faster training and better convergence. The currently recommended default optimizer is Adam (Adaptive Moment Estimation) which adaptively computes the learning rate for each individual parameter based of estimates of the first and second moments of the gradients. There are a number of hyperparamters associated with the Adam optimizer besides the learning rate, but these can usually be left at the defaults. \n",
220 |     "\n",
221 |     "### Learning Rate of the Optimizer\n",
222 |     "\n",
223 |     "A learning rate that is too slow will take too long to converge, while a learning rate that is too high will not lead to the model \"jumping\" around the optimum of the objective function. While a good rule is to leave the learning rate low and train for a large number of epochs, training can go faster by starting with a higher learning rate and gradually decreasing it over the number of epochs. This is called learning rate decay.  \n",
224 |     "\n",
225 |     "#### Learning Rate Decay\n",
226 |     "\n",
227 |     "Learning rate decay gradually decreases the learning rate as the model trains. The idea is that the optimizer can take large steps at the beginning of training and then smaller steps as it gets closer to an optimum of the objective function. Learning rate decay is often linear or exponential. However, with the Adam optimization algorithm, the __learning rate is adaptively computed for individual parameters__. When using Adam, a learning rate decay schedule is typically not employed because of the adaptive learning rate. \n",
228 |     "\n",
229 |     "We will therefore use the Adam optimizer with the default parameters in Keras:\n",
230 |     "\n",
231 |     "* `lr = 0.001`\n",
232 |     "* `beta_1 = 0.9`\n",
233 |     "* `beta_2 = 0.999`\n",
234 |     "* `epsilon = None`\n",
235 |     "* `decay = 0.0`\n",
236 |     "* `amsgrad = False` "
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 4,
242 |    "metadata": {},
243 |    "outputs": [],
244 |    "source": [
245 |     "# Define the optimizer\n",
246 |     "opt = optimizers.Adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999,\n",
247 |     "                      epsilon = None, decay = 0.0, amsgrad = False)\n",
248 |     "\n",
249 |     "# Compile the model with specified optimizer\n",
250 |     "model.compile(optimizer = opt, loss = \"mean_absolute_percentage_error\",\n",
251 |     "              metrics = [\"mean_absolute_percentage_error\"])"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "## Number of Training Epochs\n",
259 |     "\n",
260 |     "The number of training epochs is another crucial hyperparamter of a neural network. This represents the number of complete passes through the training data. Too few passes and the network will not learn the relationship, and too many passes and the network will overfit. Fortunately, there is a simple method to determine the best number of iterations that we have already employed when building a gradient boosting machine: early stopping.\n",
261 |     "\n",
262 |     "### Early Stopping\n",
263 |     "\n",
264 |     "Early stopping refers to training until the loss on a validation set does not decrease for a certain number of epochs. Early stopping is used to prevent overfitting by stopping training when the generalization error (as estimated by the validation set) has reached a minimum. We will implement early stopping by cutting off training after the validation loss has not decresed for 10 iterations. We will use 20% of the training data for validation (this will be randomly sampled from the data on every epoch). When implementing early stopping, we need to save the model parameters (weights and biases) every time the validation loss decreases. Then, when the training has stopped, we load the model weights that achieved the lowest error on the validation data. This model is then used to make precitions. Unfortunately this still requires writing the model weights to disk after every decrease in validation error which is not very efficient. Early stopping is always recommended as a simple method of regularization. The maximum number of epochs will be set at 100. \n",
265 |     "\n",
266 |     "## Batch Size \n",
267 |     "\n",
268 |     "The batch size refers to the number of training examples passed through the network at a time. All the neural network optimizers operate on minibatches of examples rather than processing the whole training set at once. Each batch is fed forward through the network, then backpropagation is used to calculate the gradients of the objective function with respect to the model parameters. Then the optimizer updates the weights according to the gradients and the learning rate. The next batch can then be passed through with the updated parameters. One pass of all the training data through the network is referred to as one epoch. Smaller batch sizes typically lead to better generalization performance although they may increase training time. Batch sizes are almost always a power of 2 to take advantage of computer architectures. We will use a batch size of 16 which means passing 16 training datapoints to the model at a time. The number of iterations per epoch will be (number of training datapoints - number of validation datapoints) / batch size."
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "metadata": {},
274 |    "source": [
275 |     "### Training Implementation\n",
276 |     "\n",
277 |     "We will train the model implementing the above description. The model uses early stopping with a patience of 10 epochs and 20% of the data used for validation. We save a copy of the model weights to disk every time the validation loss decreases. Training information will be available in the `history` variable."
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 5,
283 |    "metadata": {},
284 |    "outputs": [
285 |     {
286 |      "name": "stdout",
287 |      "output_type": "stream",
288 |      "text": [
289 |       "Train on 71416 samples, validate on 17855 samples\n",
290 |       "Epoch 1/100\n",
291 |       "71416/71416 [==============================] - 50s 700us/step - loss: 33.5104 - mean_absolute_percentage_error: 33.5104 - val_loss: 33.9968 - val_mean_absolute_percentage_error: 33.9968\n",
292 |       "Epoch 2/100\n",
293 |       "71416/71416 [==============================] - 49s 680us/step - loss: 22.1322 - mean_absolute_percentage_error: 22.1322 - val_loss: 29.0347 - val_mean_absolute_percentage_error: 29.0347\n",
294 |       "Epoch 3/100\n",
295 |       "71416/71416 [==============================] - 47s 660us/step - loss: 20.2166 - mean_absolute_percentage_error: 20.2166 - val_loss: 68.6281 - val_mean_absolute_percentage_error: 68.6281\n",
296 |       "Epoch 4/100\n",
297 |       "71416/71416 [==============================] - 49s 686us/step - loss: 19.0316 - mean_absolute_percentage_error: 19.0316 - val_loss: 41.4244 - val_mean_absolute_percentage_error: 41.4244\n",
298 |       "Epoch 5/100\n",
299 |       "71416/71416 [==============================] - 48s 666us/step - loss: 18.6532 - mean_absolute_percentage_error: 18.6532 - val_loss: 79.8662 - val_mean_absolute_percentage_error: 79.8662\n",
300 |       "Epoch 6/100\n",
301 |       "71416/71416 [==============================] - 48s 670us/step - loss: 17.7975 - mean_absolute_percentage_error: 17.7975 - val_loss: 66.6839 - val_mean_absolute_percentage_error: 66.6839\n",
302 |       "Epoch 7/100\n",
303 |       "71416/71416 [==============================] - 50s 695us/step - loss: 17.4435 - mean_absolute_percentage_error: 17.4435 - val_loss: 68.6557 - val_mean_absolute_percentage_error: 68.6557\n",
304 |       "Epoch 8/100\n",
305 |       "71416/71416 [==============================] - 55s 776us/step - loss: 17.1454 - mean_absolute_percentage_error: 17.1454 - val_loss: 33.1360 - val_mean_absolute_percentage_error: 33.1360\n",
306 |       "Epoch 9/100\n",
307 |       "71416/71416 [==============================] - 51s 716us/step - loss: 16.9902 - mean_absolute_percentage_error: 16.9902 - val_loss: 69.9849 - val_mean_absolute_percentage_error: 69.9849\n",
308 |       "Epoch 10/100\n",
309 |       "71416/71416 [==============================] - 48s 678us/step - loss: 16.7171 - mean_absolute_percentage_error: 16.7171 - val_loss: 45.7916 - val_mean_absolute_percentage_error: 45.7916\n",
310 |       "Epoch 11/100\n",
311 |       "71416/71416 [==============================] - 50s 695us/step - loss: 16.5480 - mean_absolute_percentage_error: 16.5480 - val_loss: 48.3490 - val_mean_absolute_percentage_error: 48.3490\n",
312 |       "Epoch 12/100\n",
313 |       "71416/71416 [==============================] - 50s 704us/step - loss: 16.8150 - mean_absolute_percentage_error: 16.8150 - val_loss: 67.3812 - val_mean_absolute_percentage_error: 67.3812\n"
314 |      ]
315 |     }
316 |    ],
317 |    "source": [
318 |     "# Early stopping and model checkpoint\n",
319 |     "callback_list = [callbacks.EarlyStopping(monitor = \"val_loss\", patience=10),\n",
320 |     "                 callbacks.ModelCheckpoint(filepath = \"models/aps_model.h5\",\n",
321 |     "                                           monitor = \"val_loss\",\n",
322 |     "                                           save_best_only = True,\n",
323 |     "                                           save_weights_only = True)]\n",
324 |     "\n",
325 |     "# Train the model\n",
326 |     "history = model.fit(train, train_targets, batch_size = 32,\n",
327 |     "                    callbacks=callback_list, epochs = 100,\n",
328 |     "                    validation_split = 0.2)"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "metadata": {},
334 |    "source": [
335 |     "## Making Predictions \n",
336 |     "\n",
337 |     "To make predictions, we need to load in the best model weights (which we saved periodically during training). Then we simply called predict on the model."
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 6,
343 |    "metadata": {},
344 |    "outputs": [],
345 |    "source": [
346 |     "# Load the model weights\n",
347 |     "model.load_weights('models/aps_model.h5')\n",
348 |     "\n",
349 |     "# Make predictions on the test data\n",
350 |     "predictions = model.predict(test)"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "## Visualizing Training Curves\n",
358 |     "\n",
359 |     "One diagnostic tool of a deep neural network model are the training curves. These show the training and validation error as the epochs progress. Training curves allow us to determine if the model is overfitting or if the validation error was still decreasing when training stopped. Based on this information, if the model was overfitting, we could add regularization or decrease the capacity of the model (fewer layers / fewer neurons per layer). If the validation loss was still decreasing at the end of training, then we would want to increase the number of training epochs. The training and validation losses are saved in the `history` variable. We can write a short function to visualize the training curves."
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": 7,
365 |    "metadata": {},
366 |    "outputs": [],
367 |    "source": [
368 |     "# Plot the training history of a keras model\n",
369 |     "def plot_history(history):\n",
370 |     "    val_loss = history.history['val_loss']\n",
371 |     "    train_loss = history.history['loss']\n",
372 |     "    epochs = [int(i) for i in list(range(1, len(val_loss) + 1))]\n",
373 |     "    \n",
374 |     "    plt.figure(figsize=(8, 6))\n",
375 |     "    \n",
376 |     "    plt.plot(epochs, train_loss, 'bo-', label = 'training loss')\n",
377 |     "    plt.plot(epochs, val_loss, 'ro-', label = 'validation loss')\n",
378 |     "    plt.xlabel('Epoch'); plt.ylabel('MAPE'); plt.title('Training Curves')\n",
379 |     "    plt.legend();\n",
380 |     "    plt.show()"
381 |    ]
382 |   },
383 |   {
384 |    "cell_type": "code",
385 |    "execution_count": 8,
386 |    "metadata": {},
387 |    "outputs": [
388 |     {
389 |      "data": {
390 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfEAAAGDCAYAAAA72Cm3AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzs3Xl8lNX1P/DPIYQl7LsIEkBB2VdpIgqiohKsC0FrjS36s6K0tVpr69a61NJqSy1fq2Bx6ddvG1dcW1ELCoJVk8ywb4qy72EJW1gCub8/zjxkEiazZZ55nmfm83698prMZJaTEHLm3nvuuWKMAREREXlPPacDICIiovgwiRMREXkUkzgREZFHMYkTERF5FJM4ERGRRzGJExEReRSTOJGHiEiGiBwUkS6JvC8ReROTOJGNAknU+qgUkcNB1wtifT5jzAljTFNjzMZE3jceInKOiMwUkd0iUiYii0XkLhHh3xWiJOF/NiIbBZJoU2NMUwAbAXw36LbCmvcXkfrJjzJ2ItIDwJcA1gLoa4xpCeD7AHIBZMXxfJ74vonchkmcyEEi8jsReU1EXhGRAwBuFJFcEfkyMLrdJiJPiUhm4P71RcSISNfA9X8Gvv6BiBwQkS9EpFus9w18fYyIfC0i+0TkryLyXxG5qZbQHwPwqTHmV8aYbQBgjFlljPmeMeagiFwiIutrfK+bReTCWr7v+0WkXERaBN3/XBHZaSV4EfmRiKwWkb2B7+GMwO31At/XzkDsS0Wkd53+YYg8gkmcyHnXAHgZQAsArwE4DuBOAG0BDAdwOYDbwjz+BgC/AdAaOtp/LNb7ikh7AK8D+GXgddcBGBbmeS4BMDP8txVR8Pc9BYAPwLgasb5ujDkuIuMDsV0FoB2AosBjAWAMgBwAPQC0AnA9gD11jI3IE5jEiZz3mTHmX8aYSmPMYWNMiTGmyBhz3BizFsAMACPDPH6mMcZnjKkAUAhgYBz3vQLAYmPMu4Gv/QXArjDP0xrAtmi/wVpU+76hSfn7gI6uAXwPVYn6NgC/N8Z8ZYw5DuB3AIaJSCcAFQCaAzgHAIwxK40x2+sYG5EnMIkTOW9T8JVAwdj7IrJdRPYD+C10dFyb4IRVDqBpHPc9PTgOoycjbQ7zPHsAdAzz9WhsqnH9DQAXiEgHAKMAHDHGfB74WjaAZwJLDGXQNxiVADobY/4D4FkA0wHsEJFnRaRZHWMj8gQmcSLn1TxK8G8AlgM4yxjTHMBDAMTmGLYB6GxdEREB0CnM/ecAyA/z9UMIKnALrGu3qXGfat+3MWY3gE8AXAudSn8l6MubANxijGkZ9NHYGFMUeOxUY8xgAH0B9AZwd5jYiFIGkziR+zQDsA/AIRHphfDr4YnybwCDReS7gYR7J3TtuTYPAbhQRP4gIqcBgIj0FJGXRaQpgNUAmonIZYGivIcBZEYRx8sAJkDXxl8Ouv1ZAA8Gfh4QkZaBdXKIyLDAR33om4djAE5E/60TeReTOJH7/AKayA5AR+Wv2f2Cxpgd0DXoJwHsBnAmgEUAjtZy/6+h28l6AlgZmOJ+HbrtrNwYsxfAHQBeArAFOv0ezTr1O9CR9EZjzIqg13sjENsbgSWGpQAuC3y5JYAXAJQBWA+dVfhLlN86kaeJLn0REVURkQwAWwGMN8YscDoeIgqNI3EiAgCIyOUi0kJEGkK3oR0HUOxwWEQUBpM4EVnOh3Zg2wXdm361MSbkdDoRuQOn04mIiDyKI3EiIiKPYhInIiLyKE+cHNS2bVvTtWtXp8MgIiJKCr/fv8sYE65XAwCPJPGuXbvC5/M5HQYREVFSiMiGaO7H6XQiIiKPYhInIiLyKCZxIiIij/LEmjgREUWvoqICmzdvxpEjR5wOhSJo1KgROnfujMzMaM4HOhWTOBFRitm8eTOaNWuGrl27Qk+VJTcyxmD37t3YvHkzunXrFtdzcDqdiCjFHDlyBG3atGECdzkRQZs2beo0Y8IkTkSUgpjAvaGu/05M4kRElFBlZWWYNm1aXI/Ny8tDWVlZ2Ps89NBDmDNnTlzPX1PXrl2xa9euhDyXE2xN4iLycxFZISLLReQVEWkkIt1EpEhE1ojIayLSwM4YiIgovMJCoGtXoF49vSwsrNvzhUviJ06cCPvYWbNmoWXLlmHv89vf/haXXHJJ3PGlEtuSuIh0AvAzAEONMX0BZAC4HsATAP5ijOkBYC+AW+yKgchWif7LR+SAwkJg4kRgwwbAGL2cOLFuv8733Xcfvv32WwwcOBC//OUvMW/ePIwaNQo33HAD+vXrBwC4+uqrMWTIEPTp0wczZsw4+VhrZLx+/Xr06tULt956K/r06YNLL70Uhw8fBgDcdNNNmDlz5sn7P/zwwxg8eDD69euH1atXAwBKS0sxevRoDB48GLfddhuys7MjjriffPJJ9O3bF3379sXUqVMBAIcOHcLYsWMxYMAA9O3bF6+99trJ77F3797o378/7rnnnvh/WHVkd3V6fQCNRaQCQBaAbQAuAnBD4OsvAXgEwHSb4yBKLOsvX3m5Xrf+8gFAQYFzcRHVcNddwOLFtX/9yy+BozVOjS8vB265BXjuudCPGTgQCOS4kB5//HEsX74ciwMvPG/ePBQXF2P58uUnq7BffPFFtG7dGocPH8a5556L/Px8tGnTptrzrFmzBq+88gqee+45XHfddXjzzTdx4403nvJ6bdu2xcKFCzFt2jRMmTIFzz//PB599FFcdNFFuP/++/Hhhx9We6MQit/vx9///ncUFRXBGIPvfOc7GDlyJNauXYvTTz8d77//PgBg37592LNnD95++22sXr0aIhJx+t9Oto3EjTFbAEwBsBGavPcB8AMoM8YcD9xtM4BOdsVAZJsHH6xK4Jbycr2dyENqJvBIt8dr2LBh1bZRPfXUUxgwYABycnKwadMmrFmz5pTHdOvWDQMHDgQADBkyBOvXrw/53OPGjTvlPp999hmuv/56AMDll1+OVq1ahY3vs88+wzXXXIMmTZqgadOmGDduHBYsWIB+/fphzpw5uPfee7FgwQK0aNECzZs3R6NGjfCjH/0Ib731FrKysmL9cSSMbSNxEWkF4CoA3QCUAXgDwJgQdzW1PH4igIkA0KVLF5uiJIrTxo2x3U7kkHAjZkBXgjaEOGojOxuYNy9xcTRp0uTk5/PmzcOcOXPwxRdfICsrCxdeeGHIbVYNGzY8+XlGRsbJ6fTa7peRkYHjx3WMaEzI1FKr2u7fs2dP+P1+zJo1C/fffz8uvfRSPPTQQyguLsbHH3+MV199FU8//TQ++eSTmF4vUewsbLsEwDpjTKkxpgLAWwDOA9BSRKw3D50BbA31YGPMDGPMUGPM0HbtIp7GRpRctb2x5BtO8pjJk4GaA8msLL09Xs2aNcOBAwdq/fq+ffvQqlUrZGVlYfXq1fjyyy/jf7FanH/++Xj99dcBAP/5z3+wd+/esPcfMWIE3nnnHZSXl+PQoUN4++23ccEFF2Dr1q3IysrCjTfeiHvuuQcLFy7EwYMHsW/fPuTl5WHq1Kknlw2cYOea+EYAOSKSBeAwgIsB+ADMBTAewKsAJgB418YYiOwxeTJw881ARUXVbXX9y0fkAKuE48EHdSKpSxf9Na5LaUebNm0wfPhw9O3bF2PGjMHYsWOrff3yyy/Hs88+i/79++Pss89GTk5OHb6D0B5++GF8//vfx2uvvYaRI0eiY8eOaNasWa33Hzx4MG666SYMGzYMAPCjH/0IgwYNwkcffYRf/vKXqFevHjIzMzF9+nQcOHAAV111FY4cOQJjDP7yl78kPP5oSaxTDjE9ucijAL4H4DiARQB+BF0DfxVA68BtNxpjwq6+DB061PA8cXKdQYOAFSuqEvnEicDf/uZsTEQAVq1ahV69ejkdhqOOHj2KjIwM1K9fH1988QUmTZrk6Ig5nFD/XiLiN8YMjfRYW6vTjTEPA3i4xs1rAQyz83WJbGcMsG2bDldeeAEYPhx45x3giSeACHtcich+GzduxHXXXYfKyko0aNAAz9VWau9xPACFKB6bNwM7dgBDh+o+8Wee0c8ffhj4n/9xOjqitNejRw8sWrTI6TBsx7arRPGwlnfOPVcvBw8Gbr8dePppYOlS5+IiorTCJE4Uj5ISoH59oH//qtt+9zugVSvgJz/R6XYiIpsxiRPFw+fTBN6oUdVtrVsDjz8OfPYZW7ASUVIwiRPFyhhN4kNDFI7+v/8HDBsG/PKXwP79yY+NiNIKkzhRrNauBfburVoPD2YVue3YATzySNJDI/Kqpk2bAgC2bt2K8ePHh7zPhRdeiEjbjadOnYryoJbI0RxtGo1HHnkEU6ZMqfPzJBqTOFGsSkr0MtRI3Lr91luBp54Cli9PXlxE8XLRiXynn376yRPK4lEziUdztKmXMYkTxcrn07XwPn1qv8/vfw+0aAH89KcsciN3s+Es0nvvvbfaeeKPPPII/vznP+PgwYO4+OKLTx4b+u67pzbsXL9+Pfr27QsAOHz4MK6//nr0798f3/ve96r1Tp80aRKGDh2KPn364OGHtR3JU089ha1bt2LUqFEYNWoUgKqjTYHQR42GO/K0NosXL0ZOTg769++Pa6655mRL16eeeurk8aTW4SuffvopBg4ciIEDB2LQoEFh29HGxRjj+o8hQ4YYItcYOdKYnJzI93v2WWMAY15+2faQiIKtXLmy6sqdd+rvbG0fDRvq72nNj4YNa3/MnXeGff2FCxeaESNGnLzeq1cvs2HDBlNRUWH27dtnjDGmtLTUnHnmmaaystIYY0yTJk2MMcasW7fO9OnTxxhjzJ///Gdz8803G2OMWbJkicnIyDAlJSXGGGN2795tjDHm+PHjZuTIkWbJkiXGGGOys7NNaWnpyde2rvt8PtO3b19z8OBBc+DAAdO7d2+zcOFCs27dOpORkWEWLVpkjDHm2muvNf/4xz9O+Z4efvhh86c//ckYY0y/fv3MvHnzjDHG/OY3vzF3Bn4eHTt2NEeOHDHGGLN3715jjDFXXHGF+eyzz4wxxhw4cMBUVFSc8tzV/r0CAPhMFPmRI3GiWJw4Afj9tU+lB/vRj4AhQ4B77gES/e6bKFFsOIt00KBB2LlzJ7Zu3YolS5agVatW6NKlC4wxeOCBB9C/f39ccskl2LJlC3bs2FHr88yfP//k+eH9+/dH/6Atna+//joGDx6MQYMGYcWKFVi5cmXYmGo7ahSI/shTQA9vKSsrw8iRIwEAEyZMwPz580/GWFBQgH/+85+oX197qQ0fPhx33303nnrqKZSVlZ28PVHYsY0oFl9/DRw8GLqoraaMDC1yy8kBfvtb4E9/sj8+opocOot0/PjxmDlzJrZv335yarmwsBClpaXw+/3IzMxE165dQx5BGkxETrlt3bp1mDJlCkpKStCqVSvcdNNNEZ/HhFnWivbI00jef/99zJ8/H++99x4ee+wxrFixAvfddx/Gjh2LWbNmIScnB3PmzME555wT1/OHwpE4USwiFbXV9J3vALfcon9II4wUiBxhx1mkAK6//nq8+uqrmDlz5slq83379qF9+/bIzMzE3LlzsSHUm4cgI0aMQGFgbX758uVYGuiGuH//fjRp0gQtWrTAjh078MEHH5x8TG3HoNZ21GisWrRogVatWp0cxf/jH//AyJEjUVlZiU2bNmHUqFH44x//iLKyMhw8eBDffvst+vXrh3vvvRdDhw7F6tWrY37NcDgSJ4qFzwc0bQqcfXb0j/nDH4A33wTuuAOYMwcIMbIgcowdZ5EC6NOnDw4cOIBOnTqhY8eOgZcqwHe/+10MHToUAwcOjDginTRpEm6++Wb0798fAwcOPHlM6IABAzBo0CD06dMH3bt3x/Dhw08+ZuLEiRgzZgw6duyIuXPnnry9tqNGw02d1+all17C7bffjvLycnTv3h1///vfceLECdx4443Yt28fjDH4+c9/jpYtW+I3v/kN5s6di4yMDPTu3RtjxoyJ+fXCsfUo0kThUaTkGrm5QIMGwKefxva4adO0HetrrwHXXWdPbEQBPIrUW+pyFCmn04miVVEBLF4c3Xp4TbfdpueP3323rqkTESUAkzhRtFasAI4ciX49PJhV5LZlix6UQkSUAEziRNGqefxorHJzgZtuAp58Evjqq4SFRUTpi0mcKFolJXrUaPfu8T/HE09o5e8dd7CTG9nKC/VOVPd/JyZxomhZJ5fVpbq8fXvgsceA2bOBt95KXGxEQRo1aoTdu3czkbucMQa7d+9Go+AjjWPE6nSiaBw5AjRrpkeM/v73dXuu48e1k9vevcCqVUCTJomJkSigoqICmzdvjtgAhZzXqFEjdO7cGZmZmdVuj7Y6nfvEiaKxdKkm33jXw4PVr69FbhdcoG8I6thUg6imzMxMdOvWzekwKAk4nU4UjVg7tUVy/vnAD34ATJkCrFmTmOckorTDJE4UDZ9P17M7d07cc/7xj3qk6c9+xiI3IooLkzhRNHw+nUpPZMvU004DHn0U+PBDIMS5ykREkTCJE0Vy6JAeXpKoqfRgP/0p0LcvcNddQHl54p+fiFIakzhRJIsWAZWViSlqq8kqctuwAXj88cQ/PxGlNCZxokgSXdRW04gRwA036Br5t9/a8xpElJKYxIki8fmAM84AOnSw7zX+9CcgMxO48077XoOIUg6TOFEkJSX2jcItp58OPPII8P77wL/+Ze9rEVHKYBInCqesTPdx27EeXtPPfgb07q2j8cOH7X89IvI8JnGicPx+vbR7JA7odPrTTwPr1un6OBFRBEziROFYPfuTkcQBYNQo4Hvf00r1deuS85pE5FlM4kThlJQAZ56pR5Amy5QpQEaG7h0nIgqDSZwoHKtTWzJ17gw89BDw3nvArFnJfW0i8hQmcaLalJZqE5ZkTaUHu+su4JxztNiNx0kSUS2YxIlqk+z18GANGgB//as2f5kyJfmvT0SewCROVJuSEj3wZPBgZ17/kkuA8eP1zPENG5yJgYhcjUmcqDY+n05pN2vmXAxPPqlvJH7+c+diICLXsi2Ji8jZIrI46GO/iNwlIq1FZLaIrAlcJrHslyhKxuhIPNlFbTWdcQbw618Db78NfPSRs7EQkevYlsSNMV8ZYwYaYwYCGAKgHMDbAO4D8LExpgeAjwPXidxl61Zg+3Zn1sNruvtuoEcP4I47gKNHnY6G7FBYCHTtCtSrp5eFhU5HRB6RrOn0iwF8a4zZAOAqAC8Fbn8JwNVJioEoelZRm9MjcQBo2FCL3Nas0el1Si2FhcDEiVr3YIxeTpzIRE5RSVYSvx7AK4HPOxhjtgFA4LJ9kmIgil5JiZ71PWCA05Goyy4DrrkGeOwxYONGp6OhRHrwQaC8vPpt5eV6O1EEtidxEWkA4EoAb8T4uIki4hMRX2lpqT3BEdXG5wP69gUaN3Y6kip/+Yte3n23s3FQYtX2poxv1igKyRiJjwGw0BizI3B9h4h0BIDA5c5QDzLGzDDGDDXGDG3Xrl0SwiQKsIra3LAeHiw7G3jgAeDNN4HZs52OhhKlS5fYbicKkowk/n1UTaUDwHsAJgQ+nwDg3STEQBS99euBPXvcsR5e0z33aC/3O+4Ajh1zOhpKhMmTT53xycrS24kisDWJi0gWgNEA3gq6+XEAo0VkTeBrj9sZA8Uh3StlS0r00m0jcQBo1Ah46ingq6+qptfdLN1/l6JRUADcdlv12+64Q28niqC+nU9ujCkH0KbGbbuh1erkRlalrFVoY1XKAunzR8Xn04rwvn2djiS0vDzgyiu1yK2gQA9McSP+LkWvYUM9T37nTv33LCtzOiLyCDHGOB1DREOHDjU+a8sP2atr19AtPrOzdZo5HYwapYmnqMjpSGq3bh3Qu7cm89deczqa0M44A9i8+dTb0+l3KVqjRgGHDgHFxXqe/Lx52qsgI8PpyMghIuI3xkScDmTbVaou3StlKysBv9+dU+nBunUD7rsPeP114OOPnY5G7doFvPWWnrzWv3/oBA6kz+9StE6c0NmfYcP0+vjxOiL/7DNn4yJPYBKn6tK9Uvbrr4EDB9xZ1FbTr36lydypIrfdu7UdrJW027UD8vOB558HOnQAWrQI/bh0+V2K1qpVwMGDwHe+o9fHjNHahzffdDYu8gQmcapu8mQ9BjNYOlXKOnn8aKwaNwamTtUk8NRT9r+elbTvvFOb4LRtC4wbV5W0f/c7HT2WlekWuGee0d+dYOn0uxQta9nGSuJNmwKXX66zGpWVzsVFnsAkTtUVFOgRmJasLGDGjPQpRCop0e+5Vy+nI4nOd7+rhW6PPqprqIlUW9J+7jmgfftTk/aDDwLDh1e9CSwo0N+d7Oyq5/zrX9PndylaRUVAy5baH98yfjywZYu76zJIObwDg4VtdCqrsKt/f11z3bVLK2fTwfDh+p9xwQKnI4neN98AffroVPbLL8f/PLt3A/Pna1HVvHnA0qV6e+PG+nO58EL9OPfcU2drInn/feCKK/R5R46MP8ZUNHCgzmQEn1K3b58uT/zsZ8CUKc7FRuHV3IEBJGzgw8I2ik9lJbBwITB4sI7w9u8HPv/c6aiS4/hxYNEib6yHBzvrLF0ff+UVTZLR2rMHeOcd4K67NJG0axfbSDsWOTl6+cUXsT82lR06BCxbVjWVbmnRAhg9Gpg5UzsIkju5oO+9rfvEyYO+/VYT95AhwMUX6wh81qz0GD2tXAkcPuyN9fCa7r8f+Mc/gBtv1G1JmzZpAdnkyVUjgj17Th1pG1M10n7ssfhH2pG0aaPTxV9+mdjn9Tq/X98410zigE6pz5qlb6qHDEl+bBSZC3bzMIlTdX6/Xg4ZAjRvDlxwgf4heeIJZ+NKBjcdPxqrrCw95Wzq1KrbNmwAbrlFk/v27dWT9nnnAb/9bVXSbtjQ/hhzc4EPP9QYROx/PS+w1ryt7WXBrrxS35C9+SaTuFt16RK6r0YSd2BwOp2q8/t1FNanj17PywOWL0+Pvb0lJTqNeeaZTkcSn7feOvW2o0d1rbVtW03aCxYAe/cCc+YAv/41cP75yUnggCbxnTu1UQ2p4mLdJhjqkKc2bYCLLuKUups98siptyV5BwaTOFXn92tBmzWdOnasXs6a5VxMyeLz6VR6PY/+t9i0KfTtIs4k7Zpyc/WS6+JViopCT6Vb8vOBNWv0jTS5T/3AZHb79vr/LDs76bt5PPrXimxhzKnrb2efrSOFVE/iR48CS5Z4cz3c4vZGPX36AE2acF3csm2bvvEKl8SvvlqTAxu/uNO0aUDPnvpvWVmp7YSTvIWSSZyqfPutbm0JTuIiOqX+8cfAkSPOxWa3ZcuAigpvrodbJk92d3OV+vV17ZcjcRVuPdzSoQMwYoROqZO7LFqkv8uTJjk6e8ckTlUWLtTLmkU0eXm6bWL+/OTHlCxuPn40WsHNVRya2osoN1dnPGpuy0lHxcX6xmbQoPD3y88HVqzQ42fJPaZP1yLRCRMcDYNJnKpYRW01j+C88ELt5ZzKU+o+nxYXuWXqOV4FBTql59DUXkS5ubofn82bdCQ+YIAmgnDGjdNLTqm7R1mZNnq54QagVStHQ2ESpyp+P9CvX+je6aNGpXYSLynRUTi3PtmLTV/UiRP6OxduPdzSqZO++eGUunv83//pbNKPf+x0JEziFBCqqC1YXp5Wya5Zk9y4kqG8XKcrvTyV7hVt22qHuXQvbvvqKz0tL9x6eLD8fF2DXbvW3rgoMmO0oO0739HOlg5jEie1bp3uHw6XxIHUHI0vWqTTz14uavOS3Fwdiafz3ueaJ5dFwil195g7V9+EuWAUDjCJkyW4U1so3bsD55yTmkncS8ePpoLcXGDHDl2zT1dFRdpYqGfP6O7frZv+32QSd9706UDr1sB11zkdCQAmcbL4/donvWZRW7C8PO25fehQ0sJKipISXXfs2NHpSNIDm75oEh82LLatSfn5+rjamvqQ/bZu1eN5b7lFi31dgEmclFXUFq6bV14ecOwY8MknyYsrGaxObZQcffumd9OX8nLtSxDterglP18vQ7XXpeR47jldervtNqcjOYlJnHRt0u+PfMjC+ecDTZum1pT6vn26vsX18OSpX19/3uk6El+4UKvTo10Pt/TsqW+0OaXujIoK7btw+eWuOl+BSZx0bXLv3siVlg0bApdcokk8VYqSrAY3HIknV24usHixHv2abmItaguWn69nvG/fntiYKLL33tPp9EmTnI6kGiZxilzUFiwvT080W7nS3piSJRU6tXlROjd9KSoCunbVQzNiNX68voF+++2Eh0URTJumzaCsnTouwSROmsTr19epukhSbauZz6eVv23aOB1Jeknnpi/FxbGvh1t699ZDiTilnlyrVmkt0O236xnvLsIkTprE+/aNrtqyUydtFfn++/bHlQw+H9fDndCuna4rpltx244dwIYN8U2lA9pRMD9fd4ns2pXQ0CiMZ5/V3Tu33OJ0JKdgEk930Ra1BcvL03W5ffvsiysZdu3SJjecSndGOjZ9qct6uGX8eC2Me/fdxMRE4R06BPzv/wLXXhvfEojNmMTT3YYNwJ49sSfxEyeA2bPtiysZrFoAjsSdkZurBVobNjgdSfIUFel0bKSTy8IZOFCXgDilnhwvvwzs3++aDm01MYmnu1iK2iw5OUDLlt5fF7eK2lzQ/zgtpWPTl+JioH//U899j4WIjsbnzNHTtMg+Vp/0/v2B885zOpqQmMTTnVXU1r9/9I+pXx+47DLggw+08YFX+XxaJNS8udORpKd+/TSZpUsSr6zUJF6XqXRLfr7uW/7Xv+r+XFS7L7/UrZA//rFrTzhkEk93CxcCffrE3kIwL0+nQhcvtieuZCgp4VS6k6ymL+lS3PbVVzotm4gkfu65QOfOnFK327RpQLNmQEGB05HUikk8ncVT1Ga5/HJ9Z+rVKfWtW/WDRW3Oys3VU+TSoelLcbFexru9LFi9ejoa//BDPdKUEq+0FHj9dWDCBO1U6VJM4uls0yat0I4nibdvr6MBr241s5qMcCTuLKvpi1WbkcqKinTp5pxzEvN8+fnA0aPefSPtdi++qGdFuKxDW01M4uksnqK2YHl5+ofJi/tVfT6tEh440OlI0ls6NX1O0IEaAAAgAElEQVQpKtI3jbGcXBbOeecBHTpwSt0OJ07o3vALL9QGOy7GJJ7O/H5NZLEUtQXLy9Mp+Y8+SmxcyVBSorUAdakSprpr317Pqk/1dfHDh4GlSxOzHm7JyADGjdPZsPLyxD0v6TLF+vWu3VYWjEk8nfn9msgaN47v8UOGaOctr03nGcPjR90kHZq+LFqkywaJWA8Plp+vCdyLb6TdbNo04LTTgKuvdjqSiJjE01Vditos9eoBY8bou9YTJxIXm902bNAlAK6Hu0NuLrBtmx6sk6oS0aktlJEjte8/p9QTZ+1a3T47caK2WnU5JvF0tXmzVl/WJYkDOqW+Z09V5a0XWEVtHIm7Qzo0fSkq0hOwTjstsc9bv76OFv/1Ly1yo7r72990gHLrrU5HEhVbk7iItBSRmSKyWkRWiUiuiLQWkdkisiZw2crOGKgWVlFbXbuVXXqprs15aUq9pARo0CC6U9vIfv3765JOqifxRE+lW/Lzdf/5nDn2PH86OXIEeOEF4KqrdB++B9g9Ev8fAB8aY84BMADAKgD3AfjYGNMDwMeB65Rsfr++2xwwoG7P06qVVsl6aauZz6eJo2FDpyMhIPWbvuzcqUVSiZ5Kt1x8MdCiBTBzpj3Pn07eeAPYvdsTBW0W25K4iDQHMALACwBgjDlmjCkDcBWAlwJ3ewmA+ysHUpHfr1snElGdnZenhTtbt9b9uexWWanfO9fD3cVq+nLkiNORJJ611GRXEm/QALjySj3VrKLCntdIF9OmaSvmiy5yOpKo2TkS7w6gFMDfRWSRiDwvIk0AdDDGbAOAwGXIs91EZKKI+ETEV1paamOYaSgRRW3B8vL08sMPE/N8dvrmGz1Clevh7pKbqwkoFZu+WCeX2XnQTn4+sHevnjNO8Vm4UGeDJk1ybZ/0UOxM4vUBDAYw3RgzCMAhxDB1boyZYYwZaowZ2q5dO7tiTE9btugUX6KSeL9+QKdO3lgXZ1GbO6Vy05eiIqBvX6BJE/te49JLtTUop9TjN3261mZMmOB0JDGxM4lvBrDZGBPYW4GZ0KS+Q0Q6AkDgcqeNMVAode3UVpOIjsb/8x/3T+eVlOh/VJd3YUo7HTroGdmplsQrK/V3zq6pdEvjxsDYscA773hru6dblJUBhYV60EnLlk5HExPbkrgxZjuATSJyduCmiwGsBPAeAOutzgQA79oVA9XCKmpLZMvRvDw9iOG//03cc9rB5wMGDdJiKnKXVGz6smaNJgi7kzigU+o7dwKffWb/a6Wal17Srnou75Meit3V6XcAKBSRpQAGAvg9gMcBjBaRNQBGB65TMvn9QK9eiW05evHF2hjBzVPqx4/ruheL2tzJavqyaZPTkSSOXU1eQhkzRkfknFKPjTFa0JaTY2/dgk1sTeLGmMWBde3+xpirjTF7jTG7jTEXG2N6BC732BkDhbBwYeKm0i3Nmmn3KDcn8dWrtUUl18PdKRWbvhQX61p1ok4uC6dpUz0i+K23dBqfovPJJ8DXX3tqW1kwdmxLN1u3Atu3Jz6JAzqlvmKFtjV1o5ISveRI3J1SsemLdXJZRkZyXm/8eP0/bs0AUGTTpmnr2muvdTqSuDCJp5tEF7UFs7aauXU07vPpec49ejgdCYWSmamzJKnS9OXIEWDJkuRMpVuuuEL3jXNKPTpbtuj++ltuARo1cjqauDCJpxs7itosPXvqsZJuTeIlJfrmJVHnOVPi5ebqck8qNH1ZtEh3a9jVbjWU5s2B0aP1QJRUKhC0y3PP6dLDbbc5HUnc+Ncs3fj9uj5nx55Va6vZxx+774/wsWM6KuJ6uLtZTV8WLnQ6krqzu1NbbcaP1yWtVGyck0gVFcCMGVoQ2L2709HEjUk83SSyU1soeXm6VePTT+17jXgsW6aJnOvh7pZKxW1FRXqIxumnJ/d1r7xSt1DyeNLw3n1Xd0N4tKDNwiSeTrZt0w87k/iFF2pxktum1NmpzRtSqelLUVHyR+EA0Lo1MGqUrotzSr1206YB2dla0e9hTOLpxM6iNkvjxnp4gNuSeEmJVqB27ep0JBRJTo73i9tKS4G1a5O7Hh5s/Hg9J2DZMmde3+1WrQLmzgVuvz15OwdswiSeTvx+Xbe2o6gtWF6e/gH5+mt7XycWPp+Owj10sEHays3VqmEvN32xtjM6MRIHgKuv1gJOTqmHNn26VvH/v//ndCR1xiSeTvx+PWavaVN7X2fMGL10y2i8vBxYvpxT6V6RCuviRUWaRO2c9QqnfXvggguYxEM5eFDbrF57rf6cPI5JPJ3YXdRm6dZN27q6JYkvWaKHQrCozRsGDNA9u15P4n372v+GOZzx47X50urVzsXgRi+/DOzf7/mCNguTeLrYvl07OSVrZJCXpxXqBw8m5/XCYVGbt3i96Ysxur3MqfVwyzXX6CVH41WMAZ55Rt8oWjM+Hsckni6SUdQWLC9Pt3R98klyXi+ckhKgY0c985y8wWr6cvSo05HE7ptvgL17nVsPt3TqBJx3HpN4sC++AJYu1VF4itTHMImnC6uobdCg5Lze+efroShumFK3itrIO3Jz9U2gF5u+JPPkskjy87Vz3Nq1TkfiDtOmaVe7G25wOpKEYRJPF36/tkVt1iw5r9eggbZ/nDXL2b2qBw7omiDXw73Fy8VtRUXaEbF3b6cjAcaN00uOxvWs9TfeACZMcLZWIcGYxNNFsoraguXl6Tah5cuT+7rBFi7UNxEciXvLaafpnn4vJvHiYv19c8P+465dNRYeiAK8+KLO7kya5HQkCcUkng527NB9t8lO4m7Yambt12US9x4vNn05ehRYvNgdU+mW/Hx9Y+Hlffd1deIE8Oyz2smuVy+no0koJvF0YK0rJjuJn366NpZxMon7fNpasV0752Kg+OTmAps364dXLF6soz23JXEAeOstZ+Nw0gcf6KEwKbKtLBiTeDqwKtOTVdQWLC8P+O9/gbKy5L82oCNxrod7kxfXxa2iNqe3lwXr0QPo3z+9p9SnTdMdKldd5XQkCcckng6sorbmzZP/2nl5OpU1e3byX3vPHq3K5VS6N3mx6Utxsc5Ade7sdCTV5efrm+nt252OJPm+/Rb48ENg4kTtQZBimMTTgRNFbZbvfAdo1cqZKXWryQtH4t7UoIH+3nopiTt1clkk48drgefbbzsdSfL97W/aAvfWW52OxBZM4qmutFQLWpxK4vXr61F/H3wAVFYm97WtJD54cHJflxLHS01fdu/WRi9uTOK9ewPnnJN+U+qHDwMvvKAHwqRosycm8VSX7E5toeTlaYX8okXJfd2SEl1GaNkyua9LiWM1fUn27048iov10k3r4cHy87UV8q5dTkeSPG+8octqKVjQZmEST3VOFrVZLrtMu8W9/35yX5ed2rzPS8VtxcX6e+7W37nx47U+5d13nY4keaZN05MbR41yOhLbMImnOr9fq1NbtHAuhnbtdHSSzHXx7dt1a5Jb/6BSdDp21C2CXkjiRUVAnz7J64oYqwEDgO7d02dK3e/Xf5MU6pMeCpN4qvP73bEmnJenI5XS0uS8HovaUocXmr5YJ5e5cT3cIqJT6h9/rAe0pLrp04GsLOCHP3Q6ElsxiaeyXbuAjRudXQ+35OXpH7qPPkrO6/l8WpHq5DICJUZurhZnbtnidCS1W7tWC9vcuh5uGT8eqKgA/vUvpyOx1969em54QUHK18QwiacyNxS1WQYPBjp0SN6UekmJVuQ2aZKc1yP7eGFd3E0nl4Vz7rnAGWek/oEoL72klekpXNBmYRJPZVYSd8N0er162kv9ww+1uMZOxrCoLZUMHOj+pi9FRTp126eP05GEZ02pf/SRnvCXiiortaAtN1d/d1Ick3gq8/uBM890z3RSXp5Oc1mjFrts2qTHDnI9PDV4oelLUZHGWL++05FElp+v++6TvVskWT75BFizJi1G4QCTeGpzslNbKKNH6/GMdv/xsIraOBJPHTk57m36Yu1jd/tUuuW88/So11SdUp82DWjbVtf/0wCTeKravVtP7XFTEm/ZEhg+3P518ZIS7ZE8YIC9r0PJk5tbdcyn2yxZ4r6Ty8KpVw8YN07/H5aXOx1NYm3erPvgb7lFl2DSAJN4qnLq+NFI8vL0D7GdlcY+H9CvH9CwoX2vQcnl5uI2rxS1BcvP1wT+4YdOR5JYzz2nNTG33eZ0JEnDJJ6q3FTUFiwvTy/t+uNhFbVxPTy1nH460KWLe5P4aae57+SycEaMANq0Sa0p9YoKYMYM/RvTrZvT0SQNk3iq8vu1O1OrVk5HUl3fvvrHzq4p9W+/1bPLuR6eenJy3JnErSYvXuoKVr8+cM01ul/cjXUG8XjnHe3UmCYFbRYm8VTltqI2iwgwdqyeL37sWOKfv6RELzkSTz1ubPqydy/w9dfemkq35OfrNrPZs52OJDGmTdMR+GWXOR1JUjGJp6I9e4B169yZxAGd7jpwAPjvfxP/3D6fFrT07p345yZnWevibmrBap1c5sUkftFFeqZCKkypr1wJzJsH3H677oBJI0ziqcitRW2Wiy7Svb92bDUrKdEGD5mZiX9uctagQVqs6KYp9aIid59cFk6DBsBVV2k1d0WF09HUzfTp+v3cfLPTkSQdk3gqcmtRm6VpU2DkyMSvi584oW9gOJWemtzY9KW4GOjVC2je3OlI4pOfr0sCc+c6HUn8Dh7UNqvXXacnJqaZsElcRGr9zRSRLpGeXETWi8gyEVksIr7Aba1FZLaIrAlcuqzyKgX4/bo21Lq105HULi8PWLVKp/0TZfVq4NAhb46KKDo5Ofr7bUc9RayM0ZG4F6fSLZdeqm+qvTylXlioy3NpVtBmiTQSn2d9IiIf1/jaO1G+xihjzEBjjPWX9T4AHxtjegD4OHCdEsmtRW3BrK1mH3yQuOfk8aOpz01NX9at05MCvZzEGzUCrrgCePtt+880sIMxWtA2cKC+wUtDkZJ48J6JmsO6ePdTXAXgpcDnLwG4Os7noVD27tVjEd2exHv00L7uiZxSLynRUUXPnol7TnIXNzV9sYra3H78aCT5+UBpKbBggdORxO7zz4GlS3UU7qUtfgkUKYmbWj4Pdb22x/9HRPwiMjFwWwdjzDYACFy2D/VAEZkoIj4R8ZWWlkbxUgSgqqjNrevhFmur2Sef6JGBieDz6ZuXNKtOTSudOulRmm5I4kVFQOPG2h3Qy8aM0e/Di1Pq06ZpPcINNzgdiWMiJfH2InK3iPwi6HPrejQVBMONMYMBjAHwExEZEW1gxpgZxpihxpih7dKwWCFubjpDPJK8PE3gn35a9+eqqNApVq6Hpz63NH3x0sll4TRpoon8zTf1GE+v2LkTeOMN4Kab9HtIU5GS+HMAmgFoGvS5df35SE9ujNkauNwJ4G0AwwDsEJGOABC43Blv8BSC3w9kZ2tLRbcbOVJHAImYUl++XNdKuR6e+nJzgY0bga1bnYvh2DGd9fL6VLolPx/Yts1de/AjeeEFffM+aZLTkTgq7FtIY8yjACAibY0xu2J5YhFpAqCeMeZA4PNLAfwWwHsAJgB4PHD5bjyBUy28UNRmadQIuPhi3S/+P/9TtzUtHj+aPoKbvowb50wMy5bpm0YvF7UFu+IK3cI3c6YeVep2J04Azz6rPSfOOcfpaBwVaYvZFSJSCmCpiGwWkVj+dTsA+ExElgAoBvC+MeZDaPIeLSJrAIwOXKdEKCvT3uFeSeKATqmvXautK+uipET7xHfvnpi4yL0GDdKE4+SUuhdPLguneXPdbvbWW1rx7XazZulsTJpuKwsWaTr99wAuMMacDiAfwB+ifWJjzFpjzIDARx9jzOTA7buNMRcbY3oELvfEHz5V4/ZObaGMGaOXdZ1S9/l0FJ6mFapppWFD55u+FBUBHTroyWqpYvx4YMOGqroaNyosBLp2Ba68UgtYDx1yOiLHRUrix40xqwHAGFMEXQ8nt/JSUZula1ftc16XJH7kiE5vcj08fTjd9KWoSNfDU+lN45VXapHezJlORxJaYSEwcaK+0QB0Sn3SJL09jUVbnX63iNwd4jq5ycKFOjJo29bpSGIzdqxWqB88GN/jlywBjh/neng6yc3VN29LliT/tcvKgK++Sp2pdEurVrrG/Oab7pxSf+ABoLy8+m3l5cCDDzoTj0tEW51ufdSsUCc38VJRW7C8PK0y/bhmU8AoWcePMomnDyebvli/b6mWxAGdUv/mG53ZcoPjx/XvwqRJugYeSm23p4moqtNDERHOXbrJvn3AmjXAhAlORxK74cOBZs10Sv2qq2J/vM+n65OdOyc+NnKnzp3144svgJ/9LLmvbRW1peLyzdVX63GeM2cC/fs7E8PRo5q433xTT1jbvRvIytKPmiNxILXqEuIQ0ylmItJbRH4bqCyfblNMFI9Fi/TSiyPxzEytjJ01K75pvJISFrWlo9xcZ0biRUW6ralFi+S/tt3atQNGjEh+97bDh7V/+403Au3b6xLbzJnA5ZdrxXxpKTBjhibyYFlZwOTJyY3VZSImcRHJFpH7AlvF/gHgxwBGBx1oQm7gxaK2YHl5wObNsU/jHTyop6Gl4qiIwsvJ0SKnbduS95rGaM/0VJxKt4wfD6xcqf+v7HTgAPDqq8C112odz7hxeiBSfj7w739rR7Z//hO45hpN1gUFmsizs/UNe3a2Xi8osDdOl4u0T/xzALMAZAIYb4wZAuCAMWZ9EmKjWPj92lPaqy1qL79cL2OtUl+4UP+wcj08/QQ3fUmWDRs0uaRyEr/mGr20YzS+dy/wf/+ny2bt2gHf/74evPLDHwKzZwPbtwMvvqgj8YYNT318QQGwfr22h12/Pu0TOBB5JF4KLWLrgKpe6S4sWyTPFrVZTj9dm3jEmsTZqS19DR6c/KYv1np4qrRbDeX007VrW6KSeGkp8Nxz+ka9fXut21m4UNfe588HtmwBpk8HLrlEl9YoJmGTuDHmKgD9ACwE8KiIrAPQSkRS+DfYg/bv145nXk7igL77/vxzfbcerZISnYHo0MG+uMidGjbURJ7MJF5crO2CnSr6Spbx4/VAoW+/je/xW7cCTz8NjBoFnHaa7u9eswb4+c915mTDBmDqVOCCC3jqYB1FXBM3xuwzxrxojBkNIAfAwwCmisgm26Oj6Hi5qC1YXp42cJg9O/rH+HxcD09nOTn6O5Cspi9FRfrGIdVHjFZP+lhG4+vXA3/+s47iO3UC7rgD2LFD93cvWqRb1/74R12KqBdTTTWFEdNP0hizwxjzlDHmPADn2xQTxcoqanP7GeKRDBsGtG4d/ZT63r36h4FT6enLavqydKn9r1VRof/XUnk93JKdrW+OIyXxr78G/vAH/T/YrRtwzz26Deyxx7Q4buVK/XzgQO4esUnYfeIi8l6Ex1+ZwFgoXn6/vvP1+pRyRoaum33wgRauRHq3br154Ug8fQU3fbH7zdyyZfqGIZXXw4N17w689pr+P+zSRbdy3XCDHvv75pv6sXy53nfYMOCJJ7Sy/MwznY07zUQ6zT4XwCYArwAoAsC3Um7k9aK2YHl5wMsv6/cUKTlbRW2p8r1T7M44Q9/AfvGFTt/aqbhYL9NhJF5YqI1WAN39sWEDcNNNwC9+oVPkIsD55+u69rhx+u9AjoiUxE+DHhf6fQA3AHgfwCvGmBV2B0ZROnBAp7RuuMHpSBLjssv0D8SsWZGTeEkJcNZZ2vOZ0leymr4UFem2qK5d7X8tpz34oM46BDt+XDtDTp+und1OO82Z2KiaSNXpJ4wxHxpjJkCL2r4BME9EbH7LS1FbtEjfKafKaLRtWx3pRLMubh0/SuktJ0eLqrZvt/d1ior0dzMd1nZr60d+9KhuDWMCd41oOrY1FJFxAP4J4CcAngLwlt2BUZS83qktlLFjdZS9c2ft99m5U//QcD2cktH0Zd8+YPXq9FkPr60feZr3KXejSB3bXgLwOYDBAB41xpxrjHnMGLMlKdFRZH6/NmdIpXfGeXk6u/DRR7Xfh01eyGJt+bJzSt3n09/JdFgPB7SIjX3KPSHSSPwHAHoCuBPA5yKyP/BxQET22x8eRbRwYWqNwgHdjnLaaeGn1EtKdFpz0KDkxUXu1KiR/U1f0qFTWzD2KfeMSEeRcke+mx08qFN83/ue05EkVr16wJgxeqrR8eNA/RC/pj4f0KuXHmFKlJOjSaaiwp5GLEVFwNlnAy1bJv653aqggEnbA5ikvWzx4tQqaguWlweUlYVe5zSm6vhRIkDXxQ8ftqfpizGaxNNlFE6ewiTuZalY1GYZPVqbv4SaUt+yRfeqsqiNLMFNXxJt0yb9fUuX9XDyFCZxL/P7gY4d9SPVtGihzSRCJfGSEr3kSJwsZ5yhBZ52JHFrPZxJnFyISdzLUqlTWyhjxwJLlujIO5jPp+vkAwY4Exe5j4h9TV+KivTEtFQ/uYw8iUncqw4d0qK2VE7ieXl6+cEH1W8vKQH69gUaN05+TOReOTnAunU69Z1IxcW6C6JBg8Q+L1ECMIl71eLFekhIKifx3r21uUTwlLoxPH6UQrOj6cvx4+lzchl5EpO4V6VyUZtFREfjs2dXnRe9dq0eQcr1cKppyJDEN31ZvlyP1mQSJ5diEvcqv1+PHk3ForZgeXm6H37BAr1udWrjSJxqatRIp70TmcTTrckLeQ6TuFdZRW2pfhjDRRfpWqQ1pe7zaZFR377OxkXulJurNRMVFYl5vuJiPZSne/fEPB9RgjGJe9GhQ8CqVak9lW5p0gS48MKqJF5Som1Z7ejKRd6Xk6NNX5YtS8zzWU1eUv3NMnkWk7gXLVmS+kVtwcaO1Ur8b77RGQiuh1NtEtn0Zf9+YOVKroeTqzGJe1E6FLUFs7aaTZ2q6+NM4lSbLl20TiQRSdw6uYzr4eRiTOJe5PcD7dsDnTo5HUlynHWWFvE984xef+ABoLDQ2ZjInRLZ9KW4WC+ZxMnFmMS9KF2K2iyFhcDu3VXXt20DJk5kIqfQcnJ0K+LOnXV7nqIioEcPoHXrxMRFZAMmca8pL9d1unSZSgeABx/UphvBysv1dqKaEtH0hSeXkUcwiXvN0qXpVdQGABs3xnY7pbchQ7S3fl2m1Lds0RkfFrWRyzGJe026FbUBWqwUy+2U3ho3rnvTF55cRh7BJO41fj/Qrh3QubPTkSTP5MlAVlb127Ky9HaiUKymLzWXYaJVVKRNhnhSHrkck7jXpFtRGwAUFAAzZgDZ2fp9Z2fr9YICpyMjt8rJ0bqJpUvje3xRkTYVatgwsXERJZjtSVxEMkRkkYj8O3C9m4gUicgaEXlNRHi+X7QOHwZWrEivqXRLQQGwfr3WA6xfzwRO4dWluO34cd0jzql08oBkjMTvBLAq6PoTAP5ijOkBYC+AW5IQQ2pYuhQ4cSI9kzhRLLKzgdNOi29dfOVKnlxGnmFrEheRzgDGAng+cF0AXARgZuAuLwG42s4YUko6FrURxaMuTV9Y1EYeYvdIfCqAXwGoDFxvA6DMGGNVm2wGELLtmIhMFBGfiPhKS0ttDtMj/H49UemMM5yOhMj9cnKAb7+NvelLUZE2eDnzTHviIkog25K4iFwBYKcxxh98c4i7mlCPN8bMMMYMNcYMbdeunS0xek46FrURxctaF7dG1tEqLubJZeQZdo7EhwO4UkTWA3gVOo0+FUBLEakfuE9nAFttjCF1HDmiRW2DBzsdCZE3DB0ae9OXgwf1/xmn0skjbEvixpj7jTGdjTFdAVwP4BNjTAGAuQDGB+42AcC7dsWQUpYu1apZrocTRadxY90mFksS9/l0BwSTOHmEE/vE7wVwt4h8A10jf8GBGLyHRW1EscvN1enxaJu+WFPv555rX0xECZSUJG6MmWeMuSLw+VpjzDBjzFnGmGuNMUeTEYPn+f1abJOd7XQkRN5hNX1Ztiy6+xcXa0Fb27b2xkWUIOzY5hUsaiOKXaxNX4qKOJVOnsIk7gVHjgDLl3MqnShWXbsCHTpEty6+ZYt+8PhR8hAmcS9YvpxFbUTxiKXpS3GxXnIkTh7CJO4FLGojil9uLvDNN0CkplFFRUBmpla0E3kEk7gX+P1Aq1Y6NUhEscnJ0ctITV+sk8saNbI/JqIEYRL3Aha1EcUvmqYvJ07oHnGuh5PHMIm73dGjuj2GU+lE8cnKAgYMCJ/EV63Sbm1cDyePYRJ3u+XLgYoKJnGiuojU9IUnl5FHMYm7HYvaiOouJwc4dEjfFIdSVAS0bAn06JHcuIjqiEnc7ayitm7dnI6EyLsiNX0pKuLJZeRJTOJu5/fryWX840IUv27dgPbtQ6+LWyN0TqWTBzGJu9mxYyxqI0qEcE1f/H6eXEaexSTuZsuXayLnGeJEdZebC6xZA+zaVf12q6iN28vIg5jE3YxFbUSJYzV9qbkuXlSk0+3t2iU/JqI6YhJ3M78faNFCj0YkoroZOhTIyDg1iRcXcyqdPItJ3M1Y1EaUOE2anNr0Zds2YNMmJnHyLCZxtzp2DFi6lFPpRIlkNX05cUKvcz2cPI5J3K1WrNBEziROlDi5udpe1Wr6UlysfdUHDXI2LqI4MYm7FYvaiBKvZnFbUZFOsTdu7FxMRHWQXkm8sFCP86xXTy8LC52OqHYLFwLNm7OojSiRunfXKvQvvtAp9ZISroeTp9V3OoCkKSwEJk4Eysv1+oYNeh0ACgqci6s2VlFbvfR6n0Vkq+CmL6tXAwcOcD2cPC19MsSDD1YlcEt5ud7uNhUVwJIlnEonskNuLvD118AHH+h1jsTJw9IniW/cGNvtTlq5Us8RZxInSjxrXfyZZ7QPQ8+ezsZDVAdpk8QPtu4S8vZjjZprwnQTFrUR2WftWr1cv17/77/yiqPhENVF2iTxBzAZh5BV7bbjyECDw/uAgQOBBQsciiwEvx9o1gw46yynIyFKLYWFwB13VF0/ckRrY9xc5EoURtok8af3FOBWzMB6ZKMSgmxCCtQAABNDSURBVPXIxg/xEvIwCzh8GBgxArj1VmDvXqdDZVEbkV28VBtDFIW0yRJdugCvoADdsB4ZqEQ3rMcrKMDK7DHaWOWee4C//x3o1Qt49VXAGGcCPX6cRW1EdvFSbQxRFNImiU+eDGRlnXr7XXdBeyr/6U+6Z/SMM4Dvfx/IywPWrUt6nFi5Uqf4mMSJEq9L6NqYWm8ncrm0SeIFBcCMGUB2tm4V7dQJaNQIeOutqjbKGDRIOzlNnapr5H36AFOm6Og4WVjURmSfUO/ms7L0diIPSpskDmgiX78eqKwENm/WpL5gAfDkk0F3ysgA7rwTWLUKGD0a+OUvgXPP1VF6Mvj9QNOmQI8eyXk9onRS8918drZed2PDJ6IoiHFq7TcGQ4cONT6fL+HPawxw7bXAv/6lObp//xB3ePtt4Kc/BXbs0Mvf/U4rx+2SmwtkZgLz59v3GkRE5Goi4jfGDI10v7QaidckAjz7LNCqFXDjjSG2i4sA48bpqPz224G//hXo3Rt47z17AmJRGxERxSCtkzgAtG0LvPACsGwZ8NBDtdypRQvt7vTf/wItWwJXXQXk5wNbtiQ2mFWrdLsbkzgREUUh7ZM4AIwdC9x2mxaoh+35kpurp4v9/vfArFm6HW3aNF1kTwQWtRERUQyYxAOmTNFTCn/4Q2D//jB3zMwE7r8fWL5cD074yU+A4cN1KF9Xfr9ud2MvZyIiigKTeEDTpsA//qE9H37+8ygecOaZwH/+ow/65hvtsPbAAzodHq+FC3WbW0ZG/M9BRERpg0k8SG4ucN99wIsvAu++G8UDRLQibtUqvfzDH4B+/YA5c2J/8RMngMWLOZVORERRYxKv4eGHdTB8663Azp1RPqhtW23Z+vHH2u989GjgBz8ASkujf+HVq7WHM5M4ERFFybYkLiKNRKRYRJaIyAoReTRwezcRKRKRNSLymog0sCuGeDRoAPzzn7oufuutMbZQv+giYOlS4Ne/Bl57DTjnHOB//ze6J2FRGxERxcjOkfhRABcZYwYAGAjgchHJAfAEgL8YY3oA2AvgFhtjiEvv3joz/t57OsCOSaNGwGOP6dR4r17AzTcDF18MfP11+MdZRW1nnx133ERElF5sS+JGHQxczQx8GAAXAZgZuP0lAFfbFUNd3HknMGqUXsZ1Dkrv3tp17dlntWCtf3/t9nbsWOj7+/16rjmL2oiIKEq2romLSIaILAawE8BsAN8CKDPGWCeKbAbQqZbHThQRn4j4SmNZW06QevV0JrxePWDChKBDUmJ9kttu08K3q64CfvMbXXD/7LPq9ztxAli0iFPpREQUE1uTuDHmhDFmIIDOAIYB6BXqbrU8doYxZqgxZmi7du3sDLNWXboATz+tDWD+/Oc6PFHHjrpG/v77wKFDwAUXaHIvK9Ovf/UVi9qIiChmSalON8aUAZgHIAdASxGpH/hSZwBbkxFDvG68UTus/uY3WrNWJ3l5wIoVwC9+ATz/vBa+3XEHMGKEfv2++4DCwjrHTERE6cHO6vR2ItIy8HljAJcAWAVgLoDxgbtNABDNjmzHRDwkJVZNmmh7uJISoHFjHerv3q1f27YNmDiRiZyIiKJi50i8I4C5IrIUQAmA2caYfwO4F8DdIvINgDYAXrAxhoSI6pCUWA0eHLrnenk58OCDCXoRIiJKZWl9nnisbr8dmDEDmDevaga8TurVC72HXCRxh6oQEZHn8DxxG1iHpEyYEOGQlGh16RLb7UREREGYxGMQ8yEpkUyeDGRlVb8tK0tvJyIiioBJPEYxH5ISTkGBzs9nZ+sUena2Xi8oSEisRESU2rgmHodjx4CcHGDzZj1WvH17pyMiIqJUwjVxG9XpkBQiIqIEYRKPU50OSSEiIkoAJvE6qPMhKURERHXAJF4HwYek/PCHcR6SQkREFCcm8TqyDkn57LM6HpJCREQUIybxBEjoISlERERRYhJPgIQfkkJERBQFJvEEadtWG8AsW6YjciIiIrsxiSdQXh5w223aY33+fKejISKiVMcknmAJPySFiIioFkziCZbwQ1KIiIhqwSRug9xc4P77E3RIChERUS2YxG3y0EPAoEHaW33nTqejISKiVMQkbhMekkJERHZjErcRD0khIiI7MYnbLPiQlLVrnY6GiIhSCZO4zYIPSZkwgYekEBFR4jCJJwEPSSEiIjswiSeJdUjKr38NLFnidDRERJQKmMSTxDokpU0b4Ac/4CEpRERUd0ziScRDUoiIKJGYxJNszBgekkJERInBJO4AHpJCRESJwCTugOBDUu66y+loiIjIq5jEHWIdkvL3vwPt2+s+8q5dgcJCpyMjIiKvqO90AOmsRw+tWi8t1esbNgATJ+rnBQXOxUVERN7AkbiDHn741INRysuBe+91Jh4iIvIWjsQdtHFj6Nu3bNFR+ogR+jFyJJCdraN2IiIiC5O4g7p00Sn0mlq1Avr0Ad5+W/eVA8AZZ1Ql9REjgLPPZlInIkp3TOIOmjxZ18DLy6tuy8oC/vpXXROvrARWrND95PPnA3PmVBW+tW9fPan366fFcURElD7E1FyUdaGhQ4can8/ndBi2KCwEHnxQp9a7dNHEXltRmzHAmjVVSf3TT6um5Fu2BC64oCqpDx4M1OdbNCIiTxIRvzFmaMT7MYl724YNVUl9/nzg66/19iZNgOHDq5L6sGFAw4bOxkpERNFhEk9T27YBCxZUJfVly/T2hg2BnJyqpJ6bq4meiIjcJ9okbtsqqoicISJzRWSViKwQkTsDt7cWkdkisiZw2cquGNJRx47Addfp+eVLlwK7dgHvvAP85CfAoUM6XT96tE6/5+bqdrb33wfKyqqeo7BQG8+wAQ0RkbvZNhIXkY4AOhpjFopIMwB+AFcDuAnAHmPM4yJyH4BWxpiwO6M5Ek+c/fuBzz+vGqkXFwMVFVrpPnCgFszNm1f9qNSsLGDGDDagISJKFtdNp4vIuwCeDnxcaIzZFkj084wxZ4d7LJO4fcrLgaKiqqQ+d+6pDWgA3fb26qtAz55agMdKeCIi+7gqiYtIVwDzAfQFsNEY0zLoa3uNMWGn1JnEk6devdBJPFijRsBZZ+le9Z49q1+2bp2cOImIUlm0Sdz2TUgi0hTAmwDuMsbslyg7lIjIRAATAaBLly72BUjV1NaApnNnXRv/6iutgP/qK2D5cuDdd4Hjx6vu16ZN6OR+5pma/ImIKHFsHYmLSCaAfwP4yBjzZOC2r8DpdNcqLAzdgKa2NfGKCmD9+urJ3brctq3qfvXqaetYK6kHJ/hOnSJPz8eyn56IyOscH4mLDrlfALDKSuAB7wGYAODxwOW7dsVAsbMSY7QJMzNT+7z36HHq1w4c0IRuJXUrwf/3v8DBg1X3a9xYE3rN0XvPnlpFX/ONBU97IyJSdlannw9gAYBlACoDNz8AoAjA6wC6ANgI4FpjzJ5wz8WReGoxRkfpoUbv69YBJ05U3bd9e2DfvurV8pYOHYCPPgKaNQOaNtWPxo2T11OeswNEZBdXFbbVFZN4+jh2DFi7tnpyf/756B9fr15VQm/atHqCr3k92q81bHjqG4NYlx2IiGLBJE4po2vX0MV27dsD06fr1PzBgzp9H+rz2r4W7a9+/fqnJvulS2ufHSgu1qY7mZl1+raJKI05viZOlCi1nfb25JPAuHHxPacxwOHD8SX/gwdDJ3AA2LGj6uz3Dh20qj/4o1On6p83bhxf/EREAJM4eUCsxXbRENE3AllZOqKPVbjZgcmTgc2bqz6++Ua74AW3trW0aVM9sYdK+M2bRx8X1+mJ0gun04niEM+a+MGDwJYtmtity5ofpaWnPq5Zs9Aj+eDb2rQBXn6Z6/REqYLT6UQ2imd2oGnTqj3ytTl6FNi6tXpiD074K1dqZX9lZfXHNWyoVf3BjXcATeg//an2zG/aVE+uq+2ySZPEt9PlzACRvTgSJ/KY48d17b3mKH7KlLo/d+PGkZN9tJezZwO/+pXWHljcMDPANxbkBaxOJ0ozta3Tn3GGVswfPKjH0SbqMt4/HfXrA716aUJv3LiqNiGazyPdr0GD8H0C3Lo1kG8sqCZOpxOlmdqq+P/wB+C00xL7WlZ1f7gkX1sSOn5ce+mXl+tz7NhR9Xl5edVHzSWDaNSrFz7RL1hQfWYA0Nf68Y+1ADEzs+4fDRqE/3qkngNu6UjINxbRcfrnxJE4UQpx+g9KsNpmBrKztd9+OMZoX/6ayb2un3/5pR3faWwyMqon9bKy0G9YGjYELrlEDw5q3DjyRzT3a9RIXz8SzlhEH49dPydOpxORo9yYCMK9sVi7Vt84xPpx7Fh8j7M+nnmm9ngHD9Y3H9bHkSN6eexY/D+DzMzIif6TT6r/u1latNA6h+AZh0RdZmaGL6yM5vfJGP0ZWT+n4J9ZNJ/H+rhdu0IvK0XzRjUSJnEiclw6jZziFc+MxYkT1ZNJqEQf7iPSfRYvtvM7rp01SxEq0a9ff+ruC0ATf/PmVcm7LqzZjJqzGsHXgz9/9tnQzyMS33JQ9efgmjgROaygwF3rqHY0Dqqr2moZJk+u/TEZGVXbAu1Q2xuLLl30PIPgGYhkXX7zTehYKyuBG28Mn2yj+TzUGQmRfPBB7T+nZOFInIjIYZyxiKwuNRZ2ccOaeIJbOxARUawKCjQRVVbqpdOzFwUFmoiscwCys50vaps8WRNksEgzFnZzw8+JI3EiIvIEt81Y2Ilr4kRElFLcVmPhBpxOJyIi8igmcSIiIo9iEiciIvIoJnEiIiKPYhInIiLyKCZxIiIij2ISJyIi8igmcSIiIo9iEiciIvIoJnEiIiKP8kTvdBEpBRDi/JqU0hbALqeD8AD+nKLDn1Nk/BlFhz+n6CT655RtjGkX6U6eSOLpQER80TS7T3f8OUWHP6fI+DOKDn9O0XHq58TpdCIiIo9iEiciIvIoJnH3mOF0AB7Bn1N0+HOKjD+j6PDnFB1Hfk5cEyciIvIojsSJiIg8ikncYSJyhojMFZFVIrJCRO50Oia3EpEMEVkkIv92Oha3EpGWIjJTRFYHfqdynY7JjUTk54H/b8tF5BURaeR0TG4gIi+KyE4RWR50W2sRmS0iawKXrZyM0Q1q+Tn9KfD/bqmIvC0iLZMRC5O4844D+IUxpheAHAA/kf/f3v2EWFnFYRz/PjgS/slNkZhSUyRGSWpIREILTZASDVpYWEi1apG2KZO2EQURJkZRVhoNtTChNokyQRGZQWJFCQUlZo2phNk/zOxpcY8wzYxBoPe8b/f5wOWe98ww97kvzP3d8/45R7qqcqamWgPsqx2i4Z4Gttu+EphD9tcokqYDq4H5tmcD44Db66ZqjM3AkhF9DwODtmcCg2W7121m9H7aCcy2fQ3wJbCuG0FSxCuzPWR7T2n/TOdDd3rdVM0jaQZwC7CpdpamkjQFuBF4EcD2H7aP1U3VWH3ABEl9wETg+8p5GsH2e8CPI7qXA1tKewtwa1dDNdBY+8n2Dtt/ls0PgRndyJIi3iCS+oF5wO66SRppPfAQ8FftIA12OXAEeLmcdtgkaVLtUE1j+zvgSeAAMAT8ZHtH3VSNNtX2EHQGHcBFlfO0wT3A2914oRTxhpA0GXgDeMD28dp5mkTSUuCw7Y9rZ2m4PuBa4Fnb84BfyaHPUco53eXAZcDFwCRJd9ZNFf8Xkh6hc5p0oBuvlyLeAJLG0yngA7a31c7TQAuAZZL2A68DCyW9WjdSIx0EDto+fSRnK52iHv90E/CN7SO2TwLbgBsqZ2qyHyRNAyjPhyvnaSxJq4ClwEp36f7tFPHKJInOOcx9tp+qnaeJbK+zPcN2P50LkN6xnZHTCLYPAd9KmlW6FgFfVIzUVAeA6yVNLP9/i8gFgP/mLWBVaa8C3qyYpbEkLQHWAsts/9at100Rr28BcBed0eXe8ri5dqhorfuBAUmfAnOBxyrnaZxypGIrsAf4jM7nYGYlAyS9BuwCZkk6KOle4HFgsaSvgMVlu6edYT9tBM4HdpbP8ee6kiUztkVERLRTRuIREREtlSIeERHRUiniERERLZUiHhER0VIp4hERES2VIh7RAySdGnYL415JZ20mN0n9w1dzioju6asdICK64nfbc2uHiIizKyPxiB4mab+kJyR9VB5XlP5LJQ2WtZEHJV1S+qeWtZI/KY/T05WOk/RCWaN7h6QJ1d5URA9JEY/oDRNGHE5fMexnx21fR2fGqfWlbyPwSlkbeQDYUPo3AO/ankNnXvbPS/9M4BnbVwPHgNvO8fuJCDJjW0RPkPSL7clj9O8HFtr+uizEc8j2BZKOAtNsnyz9Q7YvlHQEmGH7xLC/0Q/stD2zbK8Fxtt+9Ny/s4jelpF4RPgM7TP9zlhODGufItfbRHRFinhErBj2vKu0P6CzYhzASuD90h4E7gOQNE7SlG6FjIjR8m05ojdMkLR32PZ226dvMztP0m46X+rvKH2rgZckPQgcAe4u/WuA58uqTafoFPShc54+IsaUc+IRPaycE59v+2jtLBHx3+VwekREREtlJB4REdFSGYlHRES0VIp4RERES6WIR0REtFSKeEREREuliEdERLRUinhERERL/Q376e0XPk6PiwAAAABJRU5ErkJggg==\n",
391 |       "text/plain": [
392 |        "<Figure size 576x432 with 1 Axes>"
393 |       ]
394 |      },
395 |      "metadata": {},
396 |      "output_type": "display_data"
397 |     }
398 |    ],
399 |    "source": [
400 |     "plot_history(history)"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "markdown",
405 |    "metadata": {},
406 |    "source": [
407 |     "# Deep Neural Network Function\n",
408 |     "\n",
409 |     "Now we can take the code and put it into a function. The function will take in the standard training features, training targets, testing features, testing targets and return the training time, prediction (inference) time, and the mape on the test data. "
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "code",
414 |    "execution_count": 9,
415 |    "metadata": {},
416 |    "outputs": [],
417 |    "source": [
418 |     "def dnn_model(train, train_targets, test, test_targets, save_file):\n",
419 |     "    \"\"\"Train a deep neural network and make predictions\n",
420 |     "    \n",
421 |     "    Parameters\n",
422 |     "    --------\n",
423 |     "    train : dataframe, shape = [n_training_samples, n_features]\n",
424 |     "        Set of training features for training a model\n",
425 |     "    \n",
426 |     "    train_targets : array, shape = [n_training_samples]\n",
427 |     "        Array of training targets for training a model\n",
428 |     "        \n",
429 |     "    test : dataframe, shape = [n_testing_samples, n_features]\n",
430 |     "        Set of testing features for making predictions with a model\n",
431 |     "    \n",
432 |     "    test_targets : array, shape = [n_testing_samples]\n",
433 |     "        Array of testing targets for evaluating the model predictions\n",
434 |     "        \n",
435 |     "    save_file : string\n",
436 |     "        File name for saving the model weights. The model will be saved\n",
437 |     "        to the directory models with the save file in the h5 format: models/save_file.h5\n",
438 |     "        \n",
439 |     "    Returns\n",
440 |     "    --------\n",
441 |     "    \n",
442 |     "    results : array, shape = [4]\n",
443 |     "        Numpy array of results. \n",
444 |     "        First entry is the model, second is the training time,\n",
445 |     "        third is the testing time, and fourth is the MAPE. All entries\n",
446 |     "        are in strings and so will need to be converted to numbers.\n",
447 |     "    \n",
448 |     "    \"\"\"\n",
449 |     "    model = models.Sequential()\n",
450 |     "\n",
451 |     "    # Input layer\n",
452 |     "    model.add(layers.Dense(32, activation=\"relu\", input_shape = (train.shape[1], )))\n",
453 |     "    model.add(layers.Dropout(0.5))\n",
454 |     "    model.add(layers.BatchNormalization())\n",
455 |     "\n",
456 |     "    # Five hidden layers\n",
457 |     "    model.add(layers.Dense(64, activation = \"relu\"))\n",
458 |     "    model.add(layers.Dropout(0.5))\n",
459 |     "    model.add(layers.BatchNormalization())\n",
460 |     "    model.add(layers.Dense(128, activation = \"relu\"))\n",
461 |     "    model.add(layers.Dropout(0.5))\n",
462 |     "    model.add(layers.BatchNormalization())\n",
463 |     "    model.add(layers.Dense(256, activation = \"relu\"))\n",
464 |     "    model.add(layers.Dropout(0.5))\n",
465 |     "    model.add(layers.BatchNormalization())\n",
466 |     "    model.add(layers.Dense(512, activation = \"relu\"))\n",
467 |     "    model.add(layers.Dropout(0.5))\n",
468 |     "    model.add(layers.BatchNormalization())\n",
469 |     "    model.add(layers.Dense(1024, activation = \"relu\"))\n",
470 |     "    model.add(layers.BatchNormalization())\n",
471 |     "\n",
472 |     "    # Output layer\n",
473 |     "    model.add(layers.Dense(1, activation = None))\n",
474 |     "    \n",
475 |     "    # Define the optimizer\n",
476 |     "    opt = optimizers.Adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999,\n",
477 |     "                          epsilon = None, decay = 0.0, amsgrad = False)\n",
478 |     "\n",
479 |     "    # Compile the model with specified optimizer\n",
480 |     "    model.compile(optimizer = opt, loss = \"mean_absolute_percentage_error\",\n",
481 |     "                  metrics = [\"mean_absolute_percentage_error\"])\n",
482 |     "    \n",
483 |     "    # Early stopping and model checkpoint\n",
484 |     "    callback_list = [callbacks.EarlyStopping(monitor = \"val_loss\", patience=10),\n",
485 |     "                     callbacks.ModelCheckpoint(filepath = \"models/%s.h5\" % save_file,\n",
486 |     "                                               monitor = \"val_loss\",\n",
487 |     "                                               save_best_only = True,\n",
488 |     "                                               save_weights_only = True)]\n",
489 |     "\n",
490 |     "    # Start the training timer\n",
491 |     "    start = timer()\n",
492 |     "    # Train the model\n",
493 |     "    history = model.fit(train, train_targets, batch_size = 32,\n",
494 |     "                        callbacks=callback_list, epochs = 100,\n",
495 |     "                        validation_split = 0.2)\n",
496 |     "    \n",
497 |     "    # Calculate the training time\n",
498 |     "    end = timer()\n",
499 |     "    train_time = end - start\n",
500 |     "    \n",
501 |     "    # Load the best validation model weights\n",
502 |     "    model.load_weights('models/%s.h5' % save_file)\n",
503 |     "\n",
504 |     "    # Start the testing time\n",
505 |     "    start = timer()\n",
506 |     "    # Make predictions on the test data\n",
507 |     "    predictions = model.predict(test)\n",
508 |     "    \n",
509 |     "    # Calculate the testing time\n",
510 |     "    end = timer()\n",
511 |     "    test_time = end - start\n",
512 |     "    \n",
513 |     "    # Calculate the mape\n",
514 |     "    mape = 100 * np.mean( abs(predictions - test_targets) / test_targets)\n",
515 |     "    \n",
516 |     "    return np.array(['dnn', train_time, test_time, mape])"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "markdown",
521 |    "metadata": {},
522 |    "source": [
523 |     "## Interface with Other Models\n",
524 |     "\n",
525 |     "The `dnn_model` function has the standard inputs and outputs for models we defined in the project. Therefore it can be easily integrated into the `evaluate_models` function with the six Scikit-Learn models and the Gradient Boosting Machine. This is the final model we will add to the retinue of models to evaluate on the EDIFES data. The next step will be running this function across hundreds of buildings and recording the results."
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "code",
530 |    "execution_count": 10,
531 |    "metadata": {},
532 |    "outputs": [],
533 |    "source": [
534 |     "def evaluate_models(df):\n",
535 |     "    \"\"\"Evaluate machine learning models\n",
536 |     "    on a building energy dataset. More models can be added\n",
537 |     "    to the function as required. \n",
538 |     "    \n",
539 |     "    \n",
540 |     "    Parameters\n",
541 |     "    --------\n",
542 |     "    df : dataframe\n",
543 |     "        Building energy dataframe. Each row must have one observation\n",
544 |     "        and the columns must contain the features. The dataframe\n",
545 |     "        needs to have an \"elec_cons\" column to be used as targets. \n",
546 |     "    \n",
547 |     "    Return\n",
548 |     "    --------\n",
549 |     "    results : dataframe, shape = [n_models, 4]\n",
550 |     "        Modeling metrics. A dataframe with columns:\n",
551 |     "        model, train_time, test_time, mape. Used for comparing\n",
552 |     "        models for a given building dataset\n",
553 |     "        \n",
554 |     "    \"\"\"\n",
555 |     "    try:\n",
556 |     "        # Preprocess the data for machine learning\n",
557 |     "        train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)\n",
558 |     "    except Exception as e:\n",
559 |     "        print('Error processing data: ', e)\n",
560 |     "        return\n",
561 |     "        \n",
562 |     "    # elasticnet\n",
563 |     "    model = ElasticNet(alpha = 1.0, l1_ratio=0.5)\n",
564 |     "    elasticnet_results = implement_model(model, train, train_targets, test, \n",
565 |     "                                         test_targets, model_name = 'elasticnet')\n",
566 |     "    \n",
567 |     "    # knn\n",
568 |     "    model = KNeighborsRegressor()\n",
569 |     "    knn_results = implement_model(model, train, train_targets, test, \n",
570 |     "                                  test_targets, model_name = 'knn')\n",
571 |     "    \n",
572 |     "    # svm\n",
573 |     "    model = SVR()\n",
574 |     "    svm_results = implement_model(model, train, train_targets, test, \n",
575 |     "                                   test_targets, model_name = 'svm')\n",
576 |     "    \n",
577 |     "    # rf\n",
578 |     "    model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)\n",
579 |     "    rf_results = implement_model(model, train, train_targets, test, \n",
580 |     "                                  test_targets, model_name = 'rf')\n",
581 |     "    \n",
582 |     "    # et\n",
583 |     "    model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)\n",
584 |     "    et_results = implement_model(model, train, train_targets, test, \n",
585 |     "                                  test_targets, model_name = 'et')\n",
586 |     "    \n",
587 |     "    # adaboost\n",
588 |     "    model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05, \n",
589 |     "                              loss = 'exponential')\n",
590 |     "    adaboost_results = implement_model(model, train, train_targets, test, \n",
591 |     "                                       test_targets, model_name = 'adaboost')\n",
592 |     "    \n",
593 |     "    # gbm\n",
594 |     "    gbm_results = gbm_model(train, train_targets, test, test_targets)\n",
595 |     "    \n",
596 |     "    dnn_results = dnn_model(train, train_targets, test, test_targets, save_file = 'dnn_model')\n",
597 |     "    \n",
598 |     "    # Put the results into a single array (stack the rows)\n",
599 |     "    results = np.vstack((elasticnet_results, knn_results, svm_results,\n",
600 |     "                         rf_results, et_results, adaboost_results,\n",
601 |     "                         gbm_results, dnn_results))\n",
602 |     "    \n",
603 |     "    # Convert the results to a dataframe\n",
604 |     "    results = pd.DataFrame(results, columns = ['model', 'train_time', 'test_time', 'mape'])\n",
605 |     "    \n",
606 |     "    # Convert the numeric results to numbers\n",
607 |     "    results.iloc[:, 1:] = results.iloc[:, 1:].astype(np.float32)\n",
608 |     "    \n",
609 |     "    return results"
610 |    ]
611 |   },
612 |   {
613 |    "cell_type": "markdown",
614 |    "metadata": {},
615 |    "source": [
616 |     "# Conclusions\n",
617 |     "\n",
618 |     "In this notebook we built a deep neural network for use on the building prediction problem. We walked through the steps and the reasoning behind the design choices. The final model was then implemented in a function with the same set of standard inputs and outputs used throughout the project. We will now be able to use this function to evaluate all of the models on hundreds of buildings and choose the best model for further development. I will see you in the next notebook! "
619 |    ]
620 |   }
621 |  ],
622 |  "metadata": {
623 |   "kernelspec": {
624 |    "display_name": "Python 3",
625 |    "language": "python",
626 |    "name": "python3"
627 |   },
628 |   "language_info": {
629 |    "codemirror_mode": {
630 |     "name": "ipython",
631 |     "version": 3
632 |    },
633 |    "file_extension": ".py",
634 |    "mimetype": "text/x-python",
635 |    "name": "python",
636 |    "nbconvert_exporter": "python",
637 |    "pygments_lexer": "ipython3",
638 |    "version": "3.6.5"
639 |   }
640 |  },
641 |  "nbformat": 4,
642 |  "nbformat_minor": 2
643 | }
644 | 


--------------------------------------------------------------------------------
/Gradient Boosting Machine.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Introduction: Gradient Boosting Machine\n",
   8 |     "\n",
   9 |     "Primary Sources: [Tree Boosting with XGBoost](https://brage.bibsys.no/xmlui/bitstream/handle/11250/2433761/16128_FULLTEXT.pdf?sequence=1&isAllowed=y), [Greedy Function Approximation: A Gradient Boosting Machine](http://statweb.stanford.edu/~jhf/ftp/trebst.pdf), and [Hands-On Machine Learning with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do)\n",
  10 |     "\n",
  11 |     "The gradient boosting machine is an ensemble machine learning model that has risen to prominence in recent years due to exceptional performance on structured machine learning tasks. The gradient boosting machine not surprisingly is a boosting model, one of the two types of ensembles:\n",
  12 |     "\n",
  13 |     "* bagging (bootstrap aggregating): train individual learners independently and make a prediction by averaging individual predictions\n",
  14 |     "* boosting: train individual learners in sequence, with each individual learning from the mistakes of the previous. Predictions are made by weighting the predictions inversely proportional to the error of the individual. \n",
  15 |     "\n",
  16 |     "The modeling notebook showed examples of both ensembles: random forest and extra trees are bagging methods while AdaBoost (adaptive boosting) is a boosting ensemble. Whereas in adaptive boosting, the observations are re-weighted each iteration according to the magnitude of their residuals (with larger residuals receiving greater weight), in gradient boosting, the learners are trained directly on the residuals of the entire ensemble. Both methods are teaching each successive learner to focus on the errors of the previous learners, but using different methods. Furthermore, while adaptive boosting weights the individuals according to their performance on the training set, the contribution of each learner in gradient boosting is learned through an iterative gradient descent optimization method. The ultimate goal of machine learning is to approximate a function that maps the features to the targets, and gradient boosting can therefore be thought of as gradient descent in function space. \n",
  17 |     "\n",
  18 |     "Gradient Boosting is a general method that can be applied to any differentiable objective function and use any individual models (called weak learners). The most common weak learner is the decision tree leading to particular type of gradient boosting known as Gradient Boosted Regression Trees (GBRT). There are several open-source libraries for implementing Gradient Boosting in Python including [Scikit-Learn](http://scikit-learn.org/stable/index.html), [LightGBM](http://lightgbm.readthedocs.io/en/latest/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and [CatBoost](https://catboost.yandex/). Although Scikit-Learn is a go-to library for many machine learning algorithms, its version of the Gradient Boosting model is [less performant](https://datascience.stackexchange.com/questions/10943/why-is-xgboost-so-much-faster-than-sklearn-gradientboostingclassifier) with fewer customizations than the other options. In this notebook, we will implement a gradient boosting machine using the [LightGBM library](https://github.com/Microsoft/LightGBM) from Microsoft. This library includes a number of variations on the gradient boosting framework including [Dropout meets Multiple Addivive Regression Trees (DART)](https://arxiv.org/abs/1505.01866) and [Gradient-based One Sided Sampling (GOSS)](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf). We will stick to using the Gradient Boosting Regression Trees method. \n",
  19 |     "\n",
  20 |     "This notebook will cover some of the best practices for building a gradient boosting machine collected from multiple machine learning competitions and published research. We will put these practices to use by building a gradient boosting model step by step for the supervised regression building energy prediction problem. The end outcome is a function that will allow us to evaluate the gradient boosting machine alongside the other models developed in the [Modeling notebook](https://bitbucket.org/willkoehrsen/prediction-documentation/src/670f76b3c327aed7916cd2d73f3f84dec8d71c98/notebooks/Modeling.ipynb?at=master&fileviewer=file-view-default). These models will be tested on hundreds of building datasets to determine the most promising model for further development. "
  21 |    ]
  22 |   },
  23 |   {
  24 |    "cell_type": "markdown",
  25 |    "metadata": {},
  26 |    "source": [
  27 |     "## Imports\n",
  28 |     "\n",
  29 |     "We will use a standard stack of data science tools along with the [LightGBM library](https://lightgbm.readthedocs.io/en/latest/): `pandas`, `numpy`, `matplotlib`, `seaborn`, `sklearn`, `lightgbm`. We also import `gc` for memory management (garbage collection), `warnings` to filer out warnings from `pandas`, and the `preprocess_data` function we wrote in the [Preprocessing notebook](https://bitbucket.org/willkoehrsen/prediction-documentation/src/670f76b3c327aed7916cd2d73f3f84dec8d71c98/notebooks/Preprocessing.ipynb?at=master&fileviewer=file-view-default). Please refer to the `requirements.txt` file for the correct version of the packages to install. "
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "code",
  34 |    "execution_count": 1,
  35 |    "metadata": {},
  36 |    "outputs": [],
  37 |    "source": [
  38 |     "# pandas and numpy for data manipulation\n",
  39 |     "import pandas as pd\n",
  40 |     "import numpy as np\n",
  41 |     "\n",
  42 |     "# matplotlib and seaborn for plotting\n",
  43 |     "import matplotlib.pyplot as plt\n",
  44 |     "import seaborn as sns\n",
  45 |     "\n",
  46 |     "# Suppress warnings from pandas\n",
  47 |     "import warnings\n",
  48 |     "warnings.filterwarnings('ignore')\n",
  49 |     "\n",
  50 |     "plt.style.use('fivethirtyeight')\n",
  51 |     "\n",
  52 |     "# Modeling library\n",
  53 |     "import lightgbm as lgb\n",
  54 |     "\n",
  55 |     "# Using KFold cross validation\n",
  56 |     "from sklearn.model_selection import KFold\n",
  57 |     "\n",
  58 |     "# Encoding categorical features\n",
  59 |     "from sklearn.preprocessing import LabelEncoder\n",
  60 |     "\n",
  61 |     "# Scikit-Learn Machine Learning models\n",
  62 |     "from sklearn.linear_model import ElasticNet\n",
  63 |     "from sklearn.neighbors import KNeighborsRegressor\n",
  64 |     "from sklearn.svm import SVR\n",
  65 |     "from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor\n",
  66 |     "\n",
  67 |     "# Memory management\n",
  68 |     "import gc\n",
  69 |     "\n",
  70 |     "# Prepocess data for machine learning\n",
  71 |     "from utilities import preprocess_data, implement_model\n",
  72 |     "\n",
  73 |     "# Timing utility\n",
  74 |     "from timeit import default_timer as timer"
  75 |    ]
  76 |   },
  77 |   {
  78 |    "cell_type": "markdown",
  79 |    "metadata": {},
  80 |    "source": [
  81 |     "#### Read in Example Data and Preprocess\n",
  82 |     "\n",
  83 |     "In this notebook we will work with two example datasets to illustrate the process of building and using the model. The preprocessing is done using the function developed in the Preprocessing notebook. As with the other six models demonstrated in the Modeling notebook, the gradient boosting machine will be eventually be run on hundreds of buildings in order to get an accurate assessment of its performance. "
  84 |    ]
  85 |   },
  86 |   {
  87 |    "cell_type": "code",
  88 |    "execution_count": 2,
  89 |    "metadata": {},
  90 |    "outputs": [
  91 |     {
  92 |      "data": {
  93 |       "text/html": [
  94 |        "<div>\n",
  95 |        "<style scoped>\n",
  96 |        "    .dataframe tbody tr th:only-of-type {\n",
  97 |        "        vertical-align: middle;\n",
  98 |        "    }\n",
  99 |        "\n",
 100 |        "    .dataframe tbody tr th {\n",
 101 |        "        vertical-align: top;\n",
 102 |        "    }\n",
 103 |        "\n",
 104 |        "    .dataframe thead th {\n",
 105 |        "        text-align: right;\n",
 106 |        "    }\n",
 107 |        "</style>\n",
 108 |        "<table border=\"1\" class=\"dataframe\">\n",
 109 |        "  <thead>\n",
 110 |        "    <tr style=\"text-align: right;\">\n",
 111 |        "      <th></th>\n",
 112 |        "      <th>timestamp</th>\n",
 113 |        "      <th>biz_day</th>\n",
 114 |        "      <th>week_day_end</th>\n",
 115 |        "      <th>ghi</th>\n",
 116 |        "      <th>dif</th>\n",
 117 |        "      <th>gti</th>\n",
 118 |        "      <th>temp</th>\n",
 119 |        "      <th>rh</th>\n",
 120 |        "      <th>pwat</th>\n",
 121 |        "      <th>ws</th>\n",
 122 |        "      <th>...</th>\n",
 123 |        "      <th>yday_cos</th>\n",
 124 |        "      <th>month_sin</th>\n",
 125 |        "      <th>month_cos</th>\n",
 126 |        "      <th>wday_sin</th>\n",
 127 |        "      <th>wday_cos</th>\n",
 128 |        "      <th>num_time_sin</th>\n",
 129 |        "      <th>num_time_cos</th>\n",
 130 |        "      <th>sun_rise_set_neither</th>\n",
 131 |        "      <th>sun_rise_set_rise</th>\n",
 132 |        "      <th>sun_rise_set_set</th>\n",
 133 |        "    </tr>\n",
 134 |        "  </thead>\n",
 135 |        "  <tbody>\n",
 136 |        "    <tr>\n",
 137 |        "      <th>0</th>\n",
 138 |        "      <td>0.000000</td>\n",
 139 |        "      <td>1.0</td>\n",
 140 |        "      <td>0.0</td>\n",
 141 |        "      <td>0.0</td>\n",
 142 |        "      <td>0.0</td>\n",
 143 |        "      <td>0.0</td>\n",
 144 |        "      <td>0.394397</td>\n",
 145 |        "      <td>0.245303</td>\n",
 146 |        "      <td>0.069231</td>\n",
 147 |        "      <td>0.295082</td>\n",
 148 |        "      <td>...</td>\n",
 149 |        "      <td>0.629749</td>\n",
 150 |        "      <td>0.066987</td>\n",
 151 |        "      <td>0.75</td>\n",
 152 |        "      <td>1.0</td>\n",
 153 |        "      <td>0.25</td>\n",
 154 |        "      <td>0.500000</td>\n",
 155 |        "      <td>1.000000</td>\n",
 156 |        "      <td>1.0</td>\n",
 157 |        "      <td>0.0</td>\n",
 158 |        "      <td>0.0</td>\n",
 159 |        "    </tr>\n",
 160 |        "    <tr>\n",
 161 |        "      <th>1</th>\n",
 162 |        "      <td>0.000011</td>\n",
 163 |        "      <td>1.0</td>\n",
 164 |        "      <td>0.0</td>\n",
 165 |        "      <td>0.0</td>\n",
 166 |        "      <td>0.0</td>\n",
 167 |        "      <td>0.0</td>\n",
 168 |        "      <td>0.390086</td>\n",
 169 |        "      <td>0.250522</td>\n",
 170 |        "      <td>0.067949</td>\n",
 171 |        "      <td>0.295082</td>\n",
 172 |        "      <td>...</td>\n",
 173 |        "      <td>0.629749</td>\n",
 174 |        "      <td>0.066987</td>\n",
 175 |        "      <td>0.75</td>\n",
 176 |        "      <td>1.0</td>\n",
 177 |        "      <td>0.25</td>\n",
 178 |        "      <td>0.533050</td>\n",
 179 |        "      <td>0.998907</td>\n",
 180 |        "      <td>1.0</td>\n",
 181 |        "      <td>0.0</td>\n",
 182 |        "      <td>0.0</td>\n",
 183 |        "    </tr>\n",
 184 |        "    <tr>\n",
 185 |        "      <th>2</th>\n",
 186 |        "      <td>0.000022</td>\n",
 187 |        "      <td>1.0</td>\n",
 188 |        "      <td>0.0</td>\n",
 189 |        "      <td>0.0</td>\n",
 190 |        "      <td>0.0</td>\n",
 191 |        "      <td>0.0</td>\n",
 192 |        "      <td>0.387931</td>\n",
 193 |        "      <td>0.256785</td>\n",
 194 |        "      <td>0.067949</td>\n",
 195 |        "      <td>0.295082</td>\n",
 196 |        "      <td>...</td>\n",
 197 |        "      <td>0.629749</td>\n",
 198 |        "      <td>0.066987</td>\n",
 199 |        "      <td>0.75</td>\n",
 200 |        "      <td>1.0</td>\n",
 201 |        "      <td>0.25</td>\n",
 202 |        "      <td>0.565955</td>\n",
 203 |        "      <td>0.995631</td>\n",
 204 |        "      <td>1.0</td>\n",
 205 |        "      <td>0.0</td>\n",
 206 |        "      <td>0.0</td>\n",
 207 |        "    </tr>\n",
 208 |        "    <tr>\n",
 209 |        "      <th>3</th>\n",
 210 |        "      <td>0.000034</td>\n",
 211 |        "      <td>1.0</td>\n",
 212 |        "      <td>0.0</td>\n",
 213 |        "      <td>0.0</td>\n",
 214 |        "      <td>0.0</td>\n",
 215 |        "      <td>0.0</td>\n",
 216 |        "      <td>0.383621</td>\n",
 217 |        "      <td>0.262004</td>\n",
 218 |        "      <td>0.067949</td>\n",
 219 |        "      <td>0.303279</td>\n",
 220 |        "      <td>...</td>\n",
 221 |        "      <td>0.629749</td>\n",
 222 |        "      <td>0.066987</td>\n",
 223 |        "      <td>0.75</td>\n",
 224 |        "      <td>1.0</td>\n",
 225 |        "      <td>0.25</td>\n",
 226 |        "      <td>0.598572</td>\n",
 227 |        "      <td>0.990187</td>\n",
 228 |        "      <td>1.0</td>\n",
 229 |        "      <td>0.0</td>\n",
 230 |        "      <td>0.0</td>\n",
 231 |        "    </tr>\n",
 232 |        "    <tr>\n",
 233 |        "      <th>4</th>\n",
 234 |        "      <td>0.000045</td>\n",
 235 |        "      <td>1.0</td>\n",
 236 |        "      <td>0.0</td>\n",
 237 |        "      <td>0.0</td>\n",
 238 |        "      <td>0.0</td>\n",
 239 |        "      <td>0.0</td>\n",
 240 |        "      <td>0.379310</td>\n",
 241 |        "      <td>0.268267</td>\n",
 242 |        "      <td>0.067949</td>\n",
 243 |        "      <td>0.303279</td>\n",
 244 |        "      <td>...</td>\n",
 245 |        "      <td>0.629749</td>\n",
 246 |        "      <td>0.066987</td>\n",
 247 |        "      <td>0.75</td>\n",
 248 |        "      <td>1.0</td>\n",
 249 |        "      <td>0.25</td>\n",
 250 |        "      <td>0.630758</td>\n",
 251 |        "      <td>0.982600</td>\n",
 252 |        "      <td>1.0</td>\n",
 253 |        "      <td>0.0</td>\n",
 254 |        "      <td>0.0</td>\n",
 255 |        "    </tr>\n",
 256 |        "  </tbody>\n",
 257 |        "</table>\n",
 258 |        "<p>5 rows × 21 columns</p>\n",
 259 |        "</div>"
 260 |       ],
 261 |       "text/plain": [
 262 |        "   timestamp  biz_day  week_day_end  ghi  dif  gti      temp        rh  \\\n",
 263 |        "0   0.000000      1.0           0.0  0.0  0.0  0.0  0.394397  0.245303   \n",
 264 |        "1   0.000011      1.0           0.0  0.0  0.0  0.0  0.390086  0.250522   \n",
 265 |        "2   0.000022      1.0           0.0  0.0  0.0  0.0  0.387931  0.256785   \n",
 266 |        "3   0.000034      1.0           0.0  0.0  0.0  0.0  0.383621  0.262004   \n",
 267 |        "4   0.000045      1.0           0.0  0.0  0.0  0.0  0.379310  0.268267   \n",
 268 |        "\n",
 269 |        "       pwat        ws        ...         yday_cos  month_sin  month_cos  \\\n",
 270 |        "0  0.069231  0.295082        ...         0.629749   0.066987       0.75   \n",
 271 |        "1  0.067949  0.295082        ...         0.629749   0.066987       0.75   \n",
 272 |        "2  0.067949  0.295082        ...         0.629749   0.066987       0.75   \n",
 273 |        "3  0.067949  0.303279        ...         0.629749   0.066987       0.75   \n",
 274 |        "4  0.067949  0.303279        ...         0.629749   0.066987       0.75   \n",
 275 |        "\n",
 276 |        "   wday_sin  wday_cos  num_time_sin  num_time_cos  sun_rise_set_neither  \\\n",
 277 |        "0       1.0      0.25      0.500000      1.000000                   1.0   \n",
 278 |        "1       1.0      0.25      0.533050      0.998907                   1.0   \n",
 279 |        "2       1.0      0.25      0.565955      0.995631                   1.0   \n",
 280 |        "3       1.0      0.25      0.598572      0.990187                   1.0   \n",
 281 |        "4       1.0      0.25      0.630758      0.982600                   1.0   \n",
 282 |        "\n",
 283 |        "   sun_rise_set_rise  sun_rise_set_set  \n",
 284 |        "0                0.0               0.0  \n",
 285 |        "1                0.0               0.0  \n",
 286 |        "2                0.0               0.0  \n",
 287 |        "3                0.0               0.0  \n",
 288 |        "4                0.0               0.0  \n",
 289 |        "\n",
 290 |        "[5 rows x 21 columns]"
 291 |       ]
 292 |      },
 293 |      "execution_count": 2,
 294 |      "metadata": {},
 295 |      "output_type": "execute_result"
 296 |     }
 297 |    ],
 298 |    "source": [
 299 |     "# Read in example data\n",
 300 |     "df = pd.read_csv('../data/f-APS_weather.csv')\n",
 301 |     "\n",
 302 |     "# Preprocess for machine learning\n",
 303 |     "train, targets, test, test_targets = preprocess_data(df)\n",
 304 |     "\n",
 305 |     "train.head()"
 306 |    ]
 307 |   },
 308 |   {
 309 |    "cell_type": "markdown",
 310 |    "metadata": {},
 311 |    "source": [
 312 |     "The data is now ready for machine learning: all of the features are numeric, there are no missing values, and the features have been scaled between 0 and 1. The [gradient boosting machine can handle missing values](https://github.com/Microsoft/LightGBM/issues/122) and does not need scaled features, but because we are comparing performance to other models, we will use the same standard set of features. "
 313 |    ]
 314 |   },
 315 |   {
 316 |    "cell_type": "markdown",
 317 |    "metadata": {},
 318 |    "source": [
 319 |     "### Define Evaluation Metric\n",
 320 |     "\n",
 321 |     "The evaluation metric we selected is the Mean Absolute Percentage Error (MAPE). The definition of MAPE is the average of the absolute residuals divided by the true values: \n",
 322 |     "\n",
 323 |     "$$\\mbox{MAPE} = 100\\% * \\frac{1}{n}\\sum_{i=1}^n  \\left|\\frac{y_i-\\hat{y}_i}{y_i}\\right|$$\n",
 324 |     "\n",
 325 |     "In order to use this evaluation metric in LightGBM, we need to write a [custom evaluation metric function](https://github.com/Microsoft/LightGBM/issues/284) that takes in the true targets and the predictions. The output of this function must be three arguments: \n",
 326 |     "\n",
 327 |     "* A `string` representing the name of the metric\n",
 328 |     "* A `float` for the value of the metric\n",
 329 |     "* A `boolean` stating if a higher value is better (which is False because a lower MAPE is better)\n",
 330 |     "\n",
 331 |     "We can pass in this function to the model during training to be used as the evaluation metric for early stopping. "
 332 |    ]
 333 |   },
 334 |   {
 335 |    "cell_type": "code",
 336 |    "execution_count": 3,
 337 |    "metadata": {},
 338 |    "outputs": [],
 339 |    "source": [
 340 |     "# Evaluation metric\n",
 341 |     "def mape(true, predictions):\n",
 342 |     "    \"\"\"Calculates the mean absolute percentage error given the true\n",
 343 |     "    values and predictions. Return is formatted for a LightGBM custom evaluation metric.\"\"\"\n",
 344 |     "    \n",
 345 |     "    return 'mape', 100 * np.mean(abs(true - predictions) / true), False"
 346 |    ]
 347 |   },
 348 |   {
 349 |    "cell_type": "markdown",
 350 |    "metadata": {},
 351 |    "source": [
 352 |     "# LightGBM Regression Model\n",
 353 |     "\n",
 354 |     "Again these sources are extremely helpful ([Tree Boosting with XGBoost](https://brage.bibsys.no/xmlui/bitstream/handle/11250/2433761/16128_FULLTEXT.pdf?sequence=1&isAllowed=y), [Greedy Function Approximation: A Gradient Boosting Machine](http://statweb.stanford.edu/~jhf/ftp/trebst.pdf), and [Hands-On Machine Learning with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do))\n",
 355 |     "\n",
 356 |     "A gradient boosting machine has many hyperparameters to tune. In this notebook, and for evaluation of all buildings, we will use a standard set of hyperparameters (for all but one of the settings) selected by analyzing best practices recommended by research papers and the documentation for the library. Moreover, we will use early stopping to determine one of the most important hyperparameters, the number of decision trees trained (this is also known as the number of boosting rounds or the number of iterations). \n",
 357 |     "\n",
 358 |     "## Gradient Boosting Machine Hyperparameters \n",
 359 |     "\n",
 360 |     "There are two sets of hyperparameters in a gradient boosting machine: those that apply to the overall ensemble, and those that apply to the individual learners in the ensemble. These hyperparameters are used to control the bias/variance tradeoff: higher variance leads to overfitting while high bias leads to underfitting. Both situations decrease generalization performance on the test set.\n",
 361 |     "\n",
 362 |     "\n",
 363 |     "### Ensemble Hyperparameters\n",
 364 |     "\n",
 365 |     "The two main hyperparameters that pertain to the entire ensemble are:\n",
 366 |     "\n",
 367 |     "* `n_estimators`: number of base learners used, equivalently the number of boosting rounds. More estimators decreases the bias but increases the variance which can lead to overfitting. This hyperparameter can be set using early stopping. \n",
 368 |     "* `learning_rate`: the contribution of each new learner to the ensemble. A larger learning rate will increase the rate of convergence but may lead the algorithm to jump around the optimum (lowest point) of the objective function. A lower learning rate will lead to longer training times but can improve generalization on the test set. \n",
 369 |     "\n",
 370 |     "There are a few other hyperparameters that deal with the entire model. However, we will only set one other different than the default:\n",
 371 |     "\n",
 372 |     "* `subsample`: the fraction of observations to use for training each base learner. By default the GBM uses all of the training examples, but by randomly samping a subset of the observations, the variance of the model can be reduced. \n",
 373 |     "\n",
 374 |     "We will use the following set of ensemble hyperparameters (all others are kept at the defaults):\n",
 375 |     "\n",
 376 |     "* `n_estimators=10000`: this number of base learners will likely not be reached because of early stopping\n",
 377 |     "* `learning_rate=0.01`: This is a decrease of three-quarters from the default of 1.0\n",
 378 |     "* `subsample=0.9`: sample 90% of the training examples for training\n",
 379 |     "\n",
 380 |     "The number of estimators will not actually be 10,000 for many of the training runs because we will use early stopping to determine the ideal value. \n",
 381 |     "\n",
 382 |     "\n",
 383 |     "### Early Stopping\n",
 384 |     "\n",
 385 |     "Early stopping continues adding learners to the ensemble until the error on a validation set has not decreased for a set number of iterations. For example, here we will use early stopping by adding more learners until the MAPE on the validation set has not decreased for 100 iterations. The model records the number of estimators that resulted in the lowest error and then this number of estimators is used to make predictions on the test set.\n",
 386 |     "\n",
 387 |     "The concept of early stopping is illustrated in the following image:\n",
 388 |     "\n",
 389 |     "![image](../images/early_stopping.png)\n",
 390 |     "\n",
 391 |     "\n",
 392 |     "Early stopping is an effective technique for choosing the ideal number of base learners and is commonly used for training gradient boosting machines. Early stopping greatly simplifies the process of finding the optimal number of iterations. Using early stopping does require a validation set though, which reduces the amount of training data. However, we can get around this limitation by selecting the number of iterations using the technique of cross validation.\n",
 393 |     "\n",
 394 |     "\n",
 395 |     "### Cross Validation\n",
 396 |     "\n",
 397 |     "Cross validation (cv) is another best practice in machine learning. Often, the ideal model hyperparameters require a validation set to determine. The validation set must be drawn from the __training data__ and not the testing data. Splitting the training data into two sets though reduces the amount of training data, which can have a detrimental effect on performance. Moreover, optimizing the hyperparameters for a single validation set can just lead to overfitting on the _validation data_. Therefore, a technique known as cross-validation is used to avoid the need for a separate validation set and to better predict the generalization error of the model. The most common implementation of cross validation is K-Fold Cross Validation. \n",
 398 |     "\n",
 399 |     "#### K-Fold Cross Validation\n",
 400 |     "\n",
 401 |     "In K-Fold cross validation, the training data is split into K folds and then training proceeds in an iterative manner. On each iteration, the model is trained on K-1 of the sets and tested on the Kth set. This continues for K iterations until eventually all of the training data has been used as a validation set. The final estimate of the performance of the model is the average performance across the K validation scores. This method eliminates the need to split the valuable training data and usually provides a better measure of the generalization performance than using one validation set. The final model is then trained on the entire training data before making predictions. \n",
 402 |     "\n",
 403 |     "Below is an image of using 5-fold cv:\n",
 404 |     "\n",
 405 |     "![image](../images/5-fold.png)\n",
 406 |     "\n",
 407 |     "[Source](https://tex.stackexchange.com/questions/429451/k-fold-cross-validation-figure-using-tikz-or-table)\n",
 408 |     "\n",
 409 |     "We will use K-Fold cross validation in order to determine the ideal number of base learners. We will split the data into 5 folds, and each fold, train until the MAPE on the validation fold does not decrease for 100 iterations. Each iteration, the best number of learners is recorded, and the final number of learners used is the average of the best number from the 5 validation runs. After completing 5-fold cv, the model is trained on the entire set of training data using the optimal number of iterations. Ideally this procedure will find a near-optimal number of iterations that will result in high generalization performance on the test set. \n",
 410 |     "\n",
 411 |     "While K-Fold cv can be used to select the number of estimators, the other hyperparameters will have to be set at a constant value for all of the evaluation. The other main hyperparameter governing the entire model is the learning rate which will be set at 0.01. Generally, a smaller learning rate is used to complement a larger number of estimators. The maximum number of estimators is 10000, and although this number likely will not be reached because of early stopping, the learning rate is set at 0.01 to prevent the model from \"jumping\" around the optimum value of the objective function. \n",
 412 |     "\n",
 413 |     "\n",
 414 |     "## Regularization and Individual Learner Hyperparameters\n",
 415 |     "\n",
 416 |     "The Gradient Boosting Machine can be regularized both on the ensemble level and the base learner level. To regularize the model on the ensemble level, the number of estimators can be reduced. On the individual learner level, there are a number of hyperparameters that can be adjusted. We can limit the complexity of each tree by limiting the maximum depth, establishing a minimum number of observations required in each leaf node, or limiting the maximum number of leaf nodes among other methods. We can also penalize the complexity of the trees by adding terms to the objective function proportional to the L1 and L2 norm of the weights of the tree. We will impose regularization on the individual learners by setting the following hyperparameters:\n",
 417 |     "\n",
 418 |     "* `reg_alpha=0.1`: Penalty on L1 norm of the tree weights\n",
 419 |     "* `reg_lambda=0.1`: Penality on L2 norm of the tree weights\n",
 420 |     "\n",
 421 |     "The final hyperparameters we set in the call to the model are:\n",
 422 |     "\n",
 423 |     "* `n_jobs=-1`: Use all available cores on the machine for training\n",
 424 |     "\n",
 425 |     "__All of the rest of the hyperparamters are set at the defaults.__ These defaults should be reviewed and can be [ound in the documentation](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)"
 426 |    ]
 427 |   },
 428 |   {
 429 |    "cell_type": "markdown",
 430 |    "metadata": {},
 431 |    "source": [
 432 |     "### Model Set Up\n",
 433 |     "\n",
 434 |     "Here we build the model with the specified hyperparameters. In this run I also set the `random_state` to ensure consistent results across runs. "
 435 |    ]
 436 |   },
 437 |   {
 438 |    "cell_type": "code",
 439 |    "execution_count": 4,
 440 |    "metadata": {},
 441 |    "outputs": [
 442 |     {
 443 |      "data": {
 444 |       "text/plain": [
 445 |        "LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,\n",
 446 |        "       learning_rate=0.01, max_depth=-1, min_child_samples=20,\n",
 447 |        "       min_child_weight=0.001, min_split_gain=0.0, n_estimators=10000,\n",
 448 |        "       n_jobs=-1, num_leaves=31, objective=None, random_state=100,\n",
 449 |        "       reg_alpha=0.1, reg_lambda=0.1, silent=True, subsample=0.9,\n",
 450 |        "       subsample_for_bin=200000, subsample_freq=1)"
 451 |       ]
 452 |      },
 453 |      "execution_count": 4,
 454 |      "metadata": {},
 455 |      "output_type": "execute_result"
 456 |     }
 457 |    ],
 458 |    "source": [
 459 |     "# Create the model with specified hyperparameters\n",
 460 |     "model = lgb.LGBMRegressor(n_estimators=10000,\n",
 461 |     "                          learning_rate = 0.01, \n",
 462 |     "                          reg_alpha = 0.1, reg_lambda = 0.1, \n",
 463 |     "                          subsample = 0.9, n_jobs = -1,\n",
 464 |     "                          random_state=100)\n",
 465 |     "\n",
 466 |     "model"
 467 |    ]
 468 |   },
 469 |   {
 470 |    "cell_type": "markdown",
 471 |    "metadata": {},
 472 |    "source": [
 473 |     "### Cross Validation Applied\n",
 474 |     "\n",
 475 |     "We will use five-fold cross validation for early stopping. To actually split the data, we have to convert the features into numpy arrays. \n",
 476 |     "\n",
 477 |     "In the code below, we make the [`KFold` object](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html), and then show a single iteration of splitting the data. "
 478 |    ]
 479 |   },
 480 |   {
 481 |    "cell_type": "code",
 482 |    "execution_count": 5,
 483 |    "metadata": {},
 484 |    "outputs": [
 485 |     {
 486 |      "name": "stdout",
 487 |      "output_type": "stream",
 488 |      "text": [
 489 |       "Training Data Shape for Fold:  (71416, 21)\n",
 490 |       "Validation Data Shape for Fold:  (17855, 21)\n"
 491 |      ]
 492 |     }
 493 |    ],
 494 |    "source": [
 495 |     "# Kfold cross validation\n",
 496 |     "kfold = KFold(n_splits=5)\n",
 497 |     "\n",
 498 |     "# Split the training data\n",
 499 |     "for i, (train_indices, valid_indices) in enumerate(kfold.split(np.array(train))):\n",
 500 |     "    \n",
 501 |     "    if i > 0: \n",
 502 |     "        break\n",
 503 |     "    \n",
 504 |     "    # Training data for fold\n",
 505 |     "    train_features, train_targets = np.array(train)[train_indices], targets[train_indices]\n",
 506 |     "    \n",
 507 |     "    # Validation data for fold\n",
 508 |     "    valid_features, valid_targets = np.array(train)[valid_indices], targets[valid_indices]\n",
 509 |     "\n",
 510 |     "    print('Training Data Shape for Fold: ', train_features.shape)\n",
 511 |     "    print('Validation Data Shape for Fold: ', valid_features.shape)"
 512 |    ]
 513 |   },
 514 |   {
 515 |    "cell_type": "markdown",
 516 |    "metadata": {},
 517 |    "source": [
 518 |     "Each fold will be used for validation exactly once. The validation data is used to determine the number of estimators (rounds of boosting) used by the model on each fold. The number of estimators used in the final full round of training will be the average of the best number of estimators returned from each round of cross validation. "
 519 |    ]
 520 |   },
 521 |   {
 522 |    "cell_type": "markdown",
 523 |    "metadata": {},
 524 |    "source": [
 525 |     "# Training the Model\n",
 526 |     "\n",
 527 |     "To illustrate what will happen on each fold, we can train the model on a single fold using the validation data for early stopping. We pass the model:\n",
 528 |     "\n",
 529 |     "* `X = train_features`: training features\n",
 530 |     "* `y = train_targets`: training targets\n",
 531 |     "* `early_stopping_rounds = 100`: Stop training after the validation metric has not decreased for 100 rounds\n",
 532 |     "* `eval_metric = mape`: evaluation metric used for early stopping. The model will stop training when this metric does not decrease for the specified number of early stopping rounds.\n",
 533 |     "* `eval_set = [(valid_features, valid_labels),(train_features, train_labels)]`: evaluation data to be used for early stopping. The model will only use the validation data for early stopping but will still evaluate the metrics for the training data. We can use the difference between the metrics to assess the amount of overfitting.\n",
 534 |     "* `eval_names = ['valid', 'train']`: names of the evaluation sets of data\n",
 535 |     "* `verbose = 500`: print out statistics about training and validation scores every 500 rounds (every 500 estimators trained)\n",
 536 |     "\n",
 537 |     "Let's see what this looks like in action. "
 538 |    ]
 539 |   },
 540 |   {
 541 |    "cell_type": "code",
 542 |    "execution_count": 6,
 543 |    "metadata": {},
 544 |    "outputs": [
 545 |     {
 546 |      "name": "stdout",
 547 |      "output_type": "stream",
 548 |      "text": [
 549 |       "Training until validation scores don't improve for 100 rounds.\n",
 550 |       "[500]\tvalid's l2: 0.634723\tvalid's mape: 9.06663\ttrain's l2: 0.682509\ttrain's mape: 9.29991\n",
 551 |       "[1000]\tvalid's l2: 0.594519\tvalid's mape: 8.80087\ttrain's l2: 0.544038\ttrain's mape: 8.05561\n",
 552 |       "Early stopping, best iteration is:\n",
 553 |       "[1025]\tvalid's l2: 0.592757\tvalid's mape: 8.79411\ttrain's l2: 0.539732\ttrain's mape: 8.02122\n"
 554 |      ]
 555 |     }
 556 |    ],
 557 |    "source": [
 558 |     "# Train the model\n",
 559 |     "model.fit(X = train_features, y = train_targets, \n",
 560 |     "          early_stopping_rounds = 100,\n",
 561 |     "          eval_metric = mape,\n",
 562 |     "          eval_set = [(valid_features, valid_targets), \n",
 563 |     "                      (train_features, train_targets)],\n",
 564 |     "          eval_names = ['valid', 'train'], verbose = 500);"
 565 |    ]
 566 |   },
 567 |   {
 568 |    "cell_type": "markdown",
 569 |    "metadata": {},
 570 |    "source": [
 571 |     "Early stopping kicked in at round 1025 (it might be different depending on the run) because the error on the validation score had not improved for 100 rounds. We can use the best number of iterations found on this validation fold to make predictions on the test set. First, we rebuild the model and fit on the entire training set using the optimum number of iterations, and then we make predictions. "
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "code",
 576 |    "execution_count": 7,
 577 |    "metadata": {},
 578 |    "outputs": [
 579 |     {
 580 |      "name": "stdout",
 581 |      "output_type": "stream",
 582 |      "text": [
 583 |       "Best number of iterations:  1025\n"
 584 |      ]
 585 |     }
 586 |    ],
 587 |    "source": [
 588 |     "print('Best number of iterations: ', model.best_iteration_)\n",
 589 |     "\n",
 590 |     "\n",
 591 |     "# Create the model with specified hyperparameters\n",
 592 |     "model = lgb.LGBMRegressor(n_estimators=model.best_iteration_,\n",
 593 |     "                          learning_rate = 0.01, \n",
 594 |     "                          reg_alpha = 0.1, reg_lambda = 0.1, \n",
 595 |     "                          subsample = 0.9, n_jobs = -1,\n",
 596 |     "                          random_state=100)\n",
 597 |     "\n",
 598 |     "\n",
 599 |     "model.fit(train, targets)\n",
 600 |     "\n",
 601 |     "# Make predictions on the test set\n",
 602 |     "predictions = model.predict(np.array(test))"
 603 |    ]
 604 |   },
 605 |   {
 606 |    "cell_type": "markdown",
 607 |    "metadata": {},
 608 |    "source": [
 609 |     "Then we can score the model using the evaluation metric."
 610 |    ]
 611 |   },
 612 |   {
 613 |    "cell_type": "code",
 614 |    "execution_count": 8,
 615 |    "metadata": {},
 616 |    "outputs": [
 617 |     {
 618 |      "name": "stdout",
 619 |      "output_type": "stream",
 620 |      "text": [
 621 |       "MAPE on the test set: 16.23532\n"
 622 |      ]
 623 |     }
 624 |    ],
 625 |    "source": [
 626 |     "_, mape_score, _ = mape(test_targets, predictions)\n",
 627 |     "print('MAPE on the test set: %0.5f' % mape_score)"
 628 |    ]
 629 |   },
 630 |   {
 631 |    "cell_type": "markdown",
 632 |    "metadata": {},
 633 |    "source": [
 634 |     "If we used a differerent validation fold then the ideal number of iterations would likely have been different. This is the reason we used 5-fold cv: we want to get a better idea of the best number of iterations than a single validation would show. "
 635 |    ]
 636 |   },
 637 |   {
 638 |    "cell_type": "markdown",
 639 |    "metadata": {},
 640 |    "source": [
 641 |     "## Cross Validation\n",
 642 |     "\n",
 643 |     "To actually use cross validation, we need to repeat the same process 4 more times, using the other 4 folds as validation once each. To use this process, we can write a simple loop. "
 644 |    ]
 645 |   },
 646 |   {
 647 |    "cell_type": "code",
 648 |    "execution_count": 9,
 649 |    "metadata": {},
 650 |    "outputs": [
 651 |     {
 652 |      "name": "stdout",
 653 |      "output_type": "stream",
 654 |      "text": [
 655 |       "Training until validation scores don't improve for 100 rounds.\n",
 656 |       "[500]\tvalid's l2: 0.634723\tvalid's mape: 9.06663\ttrain's l2: 0.682509\ttrain's mape: 9.29991\n",
 657 |       "[1000]\tvalid's l2: 0.594519\tvalid's mape: 8.80087\ttrain's l2: 0.544038\ttrain's mape: 8.05561\n",
 658 |       "Did not meet early stopping. Best iteration is:\n",
 659 |       "[1025]\tvalid's l2: 0.592757\tvalid's mape: 8.79411\ttrain's l2: 0.539732\ttrain's mape: 8.02122\n",
 660 |       "\n",
 661 |       "Fold 0 \t Validation MAPE: 8.79411\n",
 662 |       "\n",
 663 |       "Training until validation scores don't improve for 100 rounds.\n",
 664 |       "[500]\tvalid's l2: 2.01266\tvalid's mape: 13.2204\ttrain's l2: 0.529106\ttrain's mape: 8.40169\n",
 665 |       "Early stopping, best iteration is:\n",
 666 |       "[629]\tvalid's l2: 1.99781\tvalid's mape: 13.1469\ttrain's l2: 0.483788\ttrain's mape: 7.92668\n",
 667 |       "\n",
 668 |       "Fold 1 \t Validation MAPE: 13.14687\n",
 669 |       "\n",
 670 |       "Training until validation scores don't improve for 100 rounds.\n",
 671 |       "[500]\tvalid's l2: 0.83468\tvalid's mape: 11.3004\ttrain's l2: 0.651426\ttrain's mape: 8.58868\n",
 672 |       "[1000]\tvalid's l2: 0.759987\tvalid's mape: 10.4665\ttrain's l2: 0.518724\ttrain's mape: 7.46587\n",
 673 |       "Did not meet early stopping. Best iteration is:\n",
 674 |       "[1021]\tvalid's l2: 0.759038\tvalid's mape: 10.4607\ttrain's l2: 0.515648\ttrain's mape: 7.44514\n",
 675 |       "\n",
 676 |       "Fold 2 \t Validation MAPE: 10.46074\n",
 677 |       "\n",
 678 |       "Training until validation scores don't improve for 100 rounds.\n",
 679 |       "[500]\tvalid's l2: 2.18618\tvalid's mape: 12.6163\ttrain's l2: 0.532689\ttrain's mape: 8.34055\n",
 680 |       "Early stopping, best iteration is:\n",
 681 |       "[628]\tvalid's l2: 2.13564\tvalid's mape: 12.5569\ttrain's l2: 0.486273\ttrain's mape: 7.88759\n",
 682 |       "\n",
 683 |       "Fold 3 \t Validation MAPE: 12.55694\n",
 684 |       "\n",
 685 |       "Training until validation scores don't improve for 100 rounds.\n",
 686 |       "[500]\tvalid's l2: 1.20289\tvalid's mape: 17.6847\ttrain's l2: 0.640655\ttrain's mape: 8.51282\n",
 687 |       "[1000]\tvalid's l2: 1.09902\tvalid's mape: 15.8451\ttrain's l2: 0.501854\ttrain's mape: 7.35708\n",
 688 |       "Did not meet early stopping. Best iteration is:\n",
 689 |       "[974]\tvalid's l2: 1.09834\tvalid's mape: 15.8501\ttrain's l2: 0.506062\ttrain's mape: 7.39222\n",
 690 |       "\n",
 691 |       "Fold 4 \t Validation MAPE: 15.85013\n",
 692 |       "\n",
 693 |       "\n",
 694 |       " Best Number of Iterations:  855\n"
 695 |      ]
 696 |     }
 697 |    ],
 698 |    "source": [
 699 |     "iterations = 0\n",
 700 |     "\n",
 701 |     "# Split the training data\n",
 702 |     "for i, (train_indices, valid_indices) in enumerate(kfold.split(np.array(train))):\n",
 703 |     "\n",
 704 |     "    # Training data for fold\n",
 705 |     "    train_features, train_targets = np.array(train)[train_indices], targets[train_indices]\n",
 706 |     "    \n",
 707 |     "    # Validation data for fold\n",
 708 |     "    valid_features, valid_targets = np.array(train)[valid_indices], targets[valid_indices]\n",
 709 |     "\n",
 710 |     "    # Train the model\n",
 711 |     "    model.fit(X = train_features, y = train_targets, \n",
 712 |     "              early_stopping_rounds = 100,\n",
 713 |     "              eval_metric = mape,\n",
 714 |     "              eval_set = [(valid_features, valid_targets), \n",
 715 |     "                          (train_features, train_targets)],\n",
 716 |     "              eval_names = ['valid', 'train'], verbose = 500)\n",
 717 |     "    \n",
 718 |     "    # Add the number of iterations to the total for averaging\n",
 719 |     "    iterations += model.best_iteration_\n",
 720 |     "    \n",
 721 |     "    # Evaluate the mape\n",
 722 |     "    _, valid_mape, _ = mape(valid_targets, model.predict(valid_features, \n",
 723 |     "                                                         num_iteration=model.best_iteration_))\n",
 724 |     "    \n",
 725 |     "    print('\\nFold %d \\t Validation MAPE: %0.5f\\n' % (i, valid_mape))\n",
 726 |     "    \n",
 727 |     "iterations = int(iterations / kfold.n_splits)\n",
 728 |     "print('\\n Best Number of Iterations: ', iterations)"
 729 |    ]
 730 |   },
 731 |   {
 732 |    "cell_type": "markdown",
 733 |    "metadata": {},
 734 |    "source": [
 735 |     "Looking at the results, we can see the model overfits because the validation error is significantly higher than the training error. This can be addressed by regularizing the overall model or the individual learners in the ensemble. In this case, since we are already using Early Stopping, we probably would want to consider regularization on an individual learner basis. For now, we will not add any more regularization. If the model performs well on all the buildings, then we can come back and address the overfitting issue. This would involve using random search with cross validation to find the regularization hyperparameters that perform the best. \n",
 736 |     "\n",
 737 |     "Once the cross validation has finished, we need to retrain the model on the entire training set. This time, we will use the average number of optimum iterations that were returned from the cross validation."
 738 |    ]
 739 |   },
 740 |   {
 741 |    "cell_type": "code",
 742 |    "execution_count": 10,
 743 |    "metadata": {},
 744 |    "outputs": [
 745 |     {
 746 |      "name": "stdout",
 747 |      "output_type": "stream",
 748 |      "text": [
 749 |       "MAPE on the test set: 16.44667\n"
 750 |      ]
 751 |     }
 752 |    ],
 753 |    "source": [
 754 |     "# Recreate the model with the optimal number of estimators\n",
 755 |     "model = lgb.LGBMRegressor(n_estimators=iterations,\n",
 756 |     "                          learning_rate = 0.01, \n",
 757 |     "                          reg_alpha = 0.1, reg_lambda = 0.1, \n",
 758 |     "                          subsample = 0.9, n_jobs = -1)\n",
 759 |     "\n",
 760 |     "# Refit on the entire training data\n",
 761 |     "model.fit(train, targets)\n",
 762 |     "\n",
 763 |     "\n",
 764 |     "_, mape_score, _ = mape(test_targets, model.predict(np.array(test), num_iteration=iterations))\n",
 765 |     "print('MAPE on the test set: %0.5f' % mape_score)"
 766 |    ]
 767 |   },
 768 |   {
 769 |    "cell_type": "markdown",
 770 |    "metadata": {},
 771 |    "source": [
 772 |     "For comparison, the best model on this dataset on the six Scikit-Learn models was the random forest with a test MAPE of 15.70. We will have to use more data in order to find out which model really is best! "
 773 |    ]
 774 |   },
 775 |   {
 776 |    "cell_type": "markdown",
 777 |    "metadata": {},
 778 |    "source": [
 779 |     "## Feature Importances\n",
 780 |     "\n",
 781 |     "Machine learning is often criticized as a black box: we put in some data and recieve answers - often extremely accurate answers - with no explanations. One method we can use to peer into the black box of gradient boosting machines are feature importances. For a gradient boosting model based on individual decision trees, the [feature importances represent the decrease in impurity from including the feature in the model](https://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined). The absolute value of the feature importances can be [difficult to interpret](https://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf), but the relative magnitude of the importances can be used to determine which features the model considers \"most important\". \n",
 782 |     "We can use these importances for feature selection (which is more important when the number of features is large) or to try and understand how the model makes predictions.\n",
 783 |     "\n",
 784 |     "Feature importances can be extracted from a trained gradient boosting model as follows."
 785 |    ]
 786 |   },
 787 |   {
 788 |    "cell_type": "code",
 789 |    "execution_count": 11,
 790 |    "metadata": {},
 791 |    "outputs": [
 792 |     {
 793 |      "data": {
 794 |       "text/html": [
 795 |        "<div>\n",
 796 |        "<style scoped>\n",
 797 |        "    .dataframe tbody tr th:only-of-type {\n",
 798 |        "        vertical-align: middle;\n",
 799 |        "    }\n",
 800 |        "\n",
 801 |        "    .dataframe tbody tr th {\n",
 802 |        "        vertical-align: top;\n",
 803 |        "    }\n",
 804 |        "\n",
 805 |        "    .dataframe thead th {\n",
 806 |        "        text-align: right;\n",
 807 |        "    }\n",
 808 |        "</style>\n",
 809 |        "<table border=\"1\" class=\"dataframe\">\n",
 810 |        "  <thead>\n",
 811 |        "    <tr style=\"text-align: right;\">\n",
 812 |        "      <th></th>\n",
 813 |        "      <th>feature</th>\n",
 814 |        "      <th>importance</th>\n",
 815 |        "    </tr>\n",
 816 |        "  </thead>\n",
 817 |        "  <tbody>\n",
 818 |        "    <tr>\n",
 819 |        "      <th>0</th>\n",
 820 |        "      <td>timestamp</td>\n",
 821 |        "      <td>4577</td>\n",
 822 |        "    </tr>\n",
 823 |        "    <tr>\n",
 824 |        "      <th>16</th>\n",
 825 |        "      <td>num_time_sin</td>\n",
 826 |        "      <td>3869</td>\n",
 827 |        "    </tr>\n",
 828 |        "    <tr>\n",
 829 |        "      <th>17</th>\n",
 830 |        "      <td>num_time_cos</td>\n",
 831 |        "      <td>3013</td>\n",
 832 |        "    </tr>\n",
 833 |        "    <tr>\n",
 834 |        "      <th>6</th>\n",
 835 |        "      <td>temp</td>\n",
 836 |        "      <td>2503</td>\n",
 837 |        "    </tr>\n",
 838 |        "    <tr>\n",
 839 |        "      <th>10</th>\n",
 840 |        "      <td>yday_sin</td>\n",
 841 |        "      <td>1874</td>\n",
 842 |        "    </tr>\n",
 843 |        "    <tr>\n",
 844 |        "      <th>11</th>\n",
 845 |        "      <td>yday_cos</td>\n",
 846 |        "      <td>1782</td>\n",
 847 |        "    </tr>\n",
 848 |        "    <tr>\n",
 849 |        "      <th>4</th>\n",
 850 |        "      <td>dif</td>\n",
 851 |        "      <td>1645</td>\n",
 852 |        "    </tr>\n",
 853 |        "    <tr>\n",
 854 |        "      <th>1</th>\n",
 855 |        "      <td>biz_day</td>\n",
 856 |        "      <td>1197</td>\n",
 857 |        "    </tr>\n",
 858 |        "    <tr>\n",
 859 |        "      <th>3</th>\n",
 860 |        "      <td>ghi</td>\n",
 861 |        "      <td>1018</td>\n",
 862 |        "    </tr>\n",
 863 |        "    <tr>\n",
 864 |        "      <th>8</th>\n",
 865 |        "      <td>pwat</td>\n",
 866 |        "      <td>980</td>\n",
 867 |        "    </tr>\n",
 868 |        "    <tr>\n",
 869 |        "      <th>14</th>\n",
 870 |        "      <td>wday_sin</td>\n",
 871 |        "      <td>866</td>\n",
 872 |        "    </tr>\n",
 873 |        "    <tr>\n",
 874 |        "      <th>7</th>\n",
 875 |        "      <td>rh</td>\n",
 876 |        "      <td>615</td>\n",
 877 |        "    </tr>\n",
 878 |        "    <tr>\n",
 879 |        "      <th>15</th>\n",
 880 |        "      <td>wday_cos</td>\n",
 881 |        "      <td>377</td>\n",
 882 |        "    </tr>\n",
 883 |        "    <tr>\n",
 884 |        "      <th>9</th>\n",
 885 |        "      <td>ws</td>\n",
 886 |        "      <td>366</td>\n",
 887 |        "    </tr>\n",
 888 |        "    <tr>\n",
 889 |        "      <th>5</th>\n",
 890 |        "      <td>gti</td>\n",
 891 |        "      <td>320</td>\n",
 892 |        "    </tr>\n",
 893 |        "    <tr>\n",
 894 |        "      <th>18</th>\n",
 895 |        "      <td>sun_rise_set_neither</td>\n",
 896 |        "      <td>283</td>\n",
 897 |        "    </tr>\n",
 898 |        "    <tr>\n",
 899 |        "      <th>13</th>\n",
 900 |        "      <td>month_cos</td>\n",
 901 |        "      <td>202</td>\n",
 902 |        "    </tr>\n",
 903 |        "    <tr>\n",
 904 |        "      <th>12</th>\n",
 905 |        "      <td>month_sin</td>\n",
 906 |        "      <td>129</td>\n",
 907 |        "    </tr>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>2</th>\n",
 910 |        "      <td>week_day_end</td>\n",
 911 |        "      <td>16</td>\n",
 912 |        "    </tr>\n",
 913 |        "    <tr>\n",
 914 |        "      <th>20</th>\n",
 915 |        "      <td>sun_rise_set_set</td>\n",
 916 |        "      <td>10</td>\n",
 917 |        "    </tr>\n",
 918 |        "    <tr>\n",
 919 |        "      <th>19</th>\n",
 920 |        "      <td>sun_rise_set_rise</td>\n",
 921 |        "      <td>8</td>\n",
 922 |        "    </tr>\n",
 923 |        "  </tbody>\n",
 924 |        "</table>\n",
 925 |        "</div>"
 926 |       ],
 927 |       "text/plain": [
 928 |        "                 feature  importance\n",
 929 |        "0              timestamp        4577\n",
 930 |        "16          num_time_sin        3869\n",
 931 |        "17          num_time_cos        3013\n",
 932 |        "6                   temp        2503\n",
 933 |        "10              yday_sin        1874\n",
 934 |        "11              yday_cos        1782\n",
 935 |        "4                    dif        1645\n",
 936 |        "1                biz_day        1197\n",
 937 |        "3                    ghi        1018\n",
 938 |        "8                   pwat         980\n",
 939 |        "14              wday_sin         866\n",
 940 |        "7                     rh         615\n",
 941 |        "15              wday_cos         377\n",
 942 |        "9                     ws         366\n",
 943 |        "5                    gti         320\n",
 944 |        "18  sun_rise_set_neither         283\n",
 945 |        "13             month_cos         202\n",
 946 |        "12             month_sin         129\n",
 947 |        "2           week_day_end          16\n",
 948 |        "20      sun_rise_set_set          10\n",
 949 |        "19     sun_rise_set_rise           8"
 950 |       ]
 951 |      },
 952 |      "execution_count": 11,
 953 |      "metadata": {},
 954 |      "output_type": "execute_result"
 955 |     }
 956 |    ],
 957 |    "source": [
 958 |     "feature_names = list(train.columns)\n",
 959 |     "importances = model.feature_importances_\n",
 960 |     "\n",
 961 |     "# Dataframe of feature importances\n",
 962 |     "feature_importances = pd.DataFrame({'feature': feature_names, \n",
 963 |     "                                    'importance': importances})\n",
 964 |     "\n",
 965 |     "# Sort the features by their importance\n",
 966 |     "feature_importances.sort_values('importance', ascending = False)"
 967 |    ]
 968 |   },
 969 |   {
 970 |    "cell_type": "markdown",
 971 |    "metadata": {},
 972 |    "source": [
 973 |     "The most important feature is the `timestamp` which we converted to a numeric to represent the number of seconds since the beginning of the data. The second most important feature is `num_time_sin`, one of the cyclical representations of the time of day, followed by the `temp` (the temperature at the building's location). These importances makes sense based on our domain knowledge because energy use is highly dependent on the time of day and the temperature. While it may be a mistake to read too much into the feature importances, it is reassuring that they agree with our domain knowledge. \n",
 974 |     "\n",
 975 |     "We can make a quick plot to show the normalized feature importances where each importance is divided by the sum of the importances. "
 976 |    ]
 977 |   },
 978 |   {
 979 |    "cell_type": "code",
 980 |    "execution_count": 12,
 981 |    "metadata": {},
 982 |    "outputs": [],
 983 |    "source": [
 984 |     "def plot_feature_importances(df):\n",
 985 |     "    \"\"\"\n",
 986 |     "    Plot importances returned by a model. This can work with any measure of\n",
 987 |     "    feature importance from a model (usually a tree-based model). \n",
 988 |     "    \n",
 989 |     "    Parameters\n",
 990 |     "    --------\n",
 991 |     "        df : dataframe\n",
 992 |     "            feature importances. Must have the features in a column\n",
 993 |     "            called `features` and the importances in a column called `importance\n",
 994 |     "        \n",
 995 |     "    Return\n",
 996 |     "    --------\n",
 997 |     "        shows a plot of the 15 most importance features\n",
 998 |     "        \n",
 999 |     "        df : dataframe\n",
1000 |     "            feature importances sorted by importance (highest to lowest) \n",
1001 |     "            with a column for normalized importance\n",
1002 |     "        \"\"\"\n",
1003 |     "    \n",
1004 |     "    # Sort features according to importance\n",
1005 |     "    df = df.sort_values('importance', ascending = False).reset_index()\n",
1006 |     "    \n",
1007 |     "    # Normalize the feature importances to add up to one\n",
1008 |     "    df['importance_normalized'] = df['importance'] / df['importance'].sum()\n",
1009 |     "\n",
1010 |     "    # Make a horizontal bar chart of feature importances\n",
1011 |     "    plt.figure(figsize = (10, 6))\n",
1012 |     "    ax = plt.subplot()\n",
1013 |     "    \n",
1014 |     "    # Need to reverse the index to plot most important on top\n",
1015 |     "    ax.barh(list(reversed(list(df.index[:15]))), \n",
1016 |     "            df['importance_normalized'].head(15), \n",
1017 |     "            align = 'center', edgecolor = 'k')\n",
1018 |     "    \n",
1019 |     "    # Set the yticks and labels\n",
1020 |     "    ax.set_yticks(list(reversed(list(df.index[:15]))))\n",
1021 |     "    ax.set_yticklabels(df['feature'].head(15))\n",
1022 |     "    \n",
1023 |     "    # Plot labeling\n",
1024 |     "    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')\n",
1025 |     "    plt.show()\n",
1026 |     "    \n",
1027 |     "    return df"
1028 |    ]
1029 |   },
1030 |   {
1031 |    "cell_type": "code",
1032 |    "execution_count": 13,
1033 |    "metadata": {},
1034 |    "outputs": [
1035 |     {
1036 |      "data": {
1037 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAssAAAGECAYAAADA7m3/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzs3Xm8XWV97/HPF0Q4CRgkDldqbTANeilokMEB8OLQasUBKw6VqkGl2taKtYBaNKYqVatVkzq1oYKKVRxAEVTQAoIRlSlMih4EvApeZyOYAzL87h97BTeHs0+mlbOH83m/Xvt11n7Ws571W7/sbH55eNY6qSokSZIk3d1W/Q5AkiRJGlQWy5IkSVIPFsuSJElSDxbLkiRJUg8Wy5IkSVIPFsuSJElSDxbLkiRJUg8Wy5JmvSQnJKkpXs9r+Ty3JVnS5pibGMc5SY7rdxzTSbJ/82ewoN+xSJrd7tHvACRpQJwHPGdS26/7EciGSHLPqvpdv+PYEpLcs98xSNI6zixLUsfvqur/TXrdvG5nkuclWZ3k5iTXJXlXkrld+/+0mbH9ZZI1Sb6aZN+u/dcBWwPHr5u5btqXJLmtO5AkD2z6HNi8P7B5f1CSryW5GfjrZt9eSc5MclOSnyU5OckfbcyFN3H/V5K3JPlpkl8nOTbJVkmWJvlJM/axk467rul3XJLfJPl5krcn2aqrzw5J/qM5/uYkFyb5s679C5prOzTJF5L8FvhvOv94Abi22X9O0/8RSb7YxHlTkguSPHmKuN6UZHnz5/GTJO9MsvWkfn+X5NtJbmnG+3TXvnskWZbk2ibuK5O8bNLxL03ynWb/L5Kcm+SBG5N7SYPPYlmS1qNZOvEB4N+A3YAXAk8EPtjVbXvgfcCjgMcA48CXksxv9u8D3A68CnhA89pY/wb8K/C/gc8m2Q34KnA+sDfw+OYcX06y3UaOfQiwDbA/8Grgn4DTmus6ADgS+Kckfz7puL8HbqBzff8AvILONa7zIeBJwF8BewKrgNOSPHTSOG+nUyTvAbwGeEbTvi+dXP1F8/5ewCeAA4FHAGcApybZdYq4fgw8EnhlE9ML1+1M8s/NOd/fnPPJwOqu449rzvkyOvl+E/D2JC9pjt+Lzp//W4GHNPF8BEmjp6p8+fLla1a/gBOA24Cbul7f79p/HfDyScc8Fijg3j3G3Ar4FXBoV9ttwJJJ/ZYAt01qe2Az9oHN+wOb9y+YIu5PTGrbFlgLHDzN9Z4DHDfp/epJfa4ELp/Udinwzkl5OW9Sn38BftRs/3ET91Mm9bkY+FCzvaDp84ZJffZv2hdswJ/fpcAxk+I6dVKfLwEfb7bnAhPAkT3G2wW4A3jopPal6/IEPBNYA9yr359fX758bdmXa5YlqeObwIu63t8GkOS+wB8B70ryzq79aX7+MXBBkl3ozD4+GrgfnWJ5TnNsW7416f0+wB8nuWlS+3bAoo0c+9JJ7/9f85rcdr9JbedPer8KeF2Se9GZhQc4d1Kfc+nkqdvka5tS8+fxz3Rm0f8XnXtvtuPueV496f31dIpggD9pjjmzx2n2pvPne2GS7vZ70Jm5B/gycA2dZSJfBs4CTq6qn2/IdUgaHhbLktQxUVVXT9G+brnaEcDZU+z/UfPzNODnwN8BPwR+B3wNWN/NandM0bZNj76/nSK2jwJvm6LvL9Zz3slunfS+erStb/le1rN/XZ+a1Db52no5AXgQcDRwLZ0Z4k9w9zxPvvlxqtgnx7DOun6PoTNLf7djquqmJHsD+9FZkvNy4F+TPKGqLtqgK5E0FCyWJWkaVfWTJD8EHlJVK6fq06xL3o3OcoMzmrYHcvdZ2N/Rucmv20+BrZPcv6p+0rQ9YgPDuxB4GJ0lI70Kvy3tUZPePxq4oap+k+TKpu2xwBe6+hwAXLKecdcVu5Pz9Vjg6Ko6FaC5yfLBwBUbEfO3gZvprKW+fIr964rdB1XVab0Gqarb6cySn5vkjc24z+86XtII8AY/SVq/Y4BXJnl9kt2TPCTJwUn+o9n/K+BnwOFJdk3yaODjdGY9u10LPC7Jzknu07R9C7gReFuSRc2THZZuYFz/QufmsxOT7JtklySPa54C8eDNuN6Nsbh5asSuSZ5PZwb+3QBV9X3gU8D7kzwpyUOTLAd2B96xnnF/QGfW/SlJ7pdkXtP+XeDQJHskWUwnz5ML6mlV1U10bpZc1jwRY9ckD0/yumb/1XRuTFyZ5AVJ/rjZ/+IkrwFI8owk/5DO00geBBwM/CGdglnSCLFYlqT1qKqP0nkG80F0itsLgGV01sFSVXcAzwYWApfRWSrwHjpPY+j2j8BedIrmnzXH/hL4SzoztJcBb6CzxGBD4voOnaUC29N5KsS3gZXAGDP3jOh/p7Ne+ELgvXSeGvLurv0vbWI7kc666P2Ap1bVVdMN2syyvw54LZ08fq7ZdRid/3Z9C/gsnRv3LtiEuN9A848gOrPSZ3LXGf2/bq7jGDp5/R86a9qvafb/Cnhac/7v0XlKyVvoFNmSRkj693/uJEnDLJ1nRx9XVW/pdyyStKU4syxJkiT1YLEsSZIk9eAyDEmSJKkHZ5YlSZKkHnzOcgvWrFnj9LwkSdIImDdv3l1+uZIzy5IkSVIPFsuSJElSDxbLGkjj4+P9DmFkmMt2mc92mc/2mMt2mc/2DHsuLZYlSZKkHiyWJUmSpB4sliVJkqQeLJYlSZKkHiyWJUmSpB4sliVJkqQeLJYlSZKkHiyWJUmSpB4sliVJkqQeLJYlSZKkHiyWJUmSpB4sliVJkqQe7tHvAEbNQUcs63cII2Fi7QRjc8b6HcZIMJftMp/tMp/tMZftMp/t2dhcLpw/lxVLj9qCEW0ci+WWrVp8eL9DkCRJGl6rV/Y7grtwGYYkSZLUg8WyJEmS1MOMFMtJdkzyt832zkk+vQXPtTjJU7bU+JIkSZo9ZmpmeUfgbwGq6oaqOmQLnmsxYLEsSZKkzTZTN/i9DViYZDUwDvzvqto9yRLgYGBrYHfg34B7Ai8AbgGeUlW/TLIQeB9wX2AtcHhVXZXk2cAbgduBNcATgTcBY0n2B94KXAu8BxgDJoDDquq7G3Huc4DVwL7AvYAXV9W3tlSiJEmSNDhmamb5tcD3q2oxMPlZILsDz6dTjB4LrK2qPYHzgRc2ff4T+Puq2gs4Enh/074UeFJVPRx4elX9rmk7qaoWV9VJwFXAY5sxlwL/spHnBphbVY+hMzv+oc1LhSRJkobFIDw67uyquhG4Mcka4PNN++XAw5JsDzwG+FSSdcds2/xcBZyQ5JPAyT3Gnwd8OMkioIBtNvTcXf0+DlBV5ya5V5Idq+rXm3i9kiRJ6mFi7QTj4+Mzes5Fixb13DcIxfItXdt3dL2/g058WwG/bmal76KqXp7kkcBBwOokd+sDvJlOUfzMJAuAczbi3HeeavKpp7keSZIkbaKxOWPTFq8zbaaWYdwI7LApB1bVb4Brm/XJpOPhzfbCqvpmVS0Ffg784RTnmgdc32wv2bTweW5zvv2BNVW1ZhPHkSRJ0hCZkWK5qn4BrEpyBfCOTRjiUOAlSS4FrgSe0bS/I8nlzbjnApcCZwO7JVmd5LnAvwJvTbKKzs18m+JXSb4OfBB4ySaOIUmSpCGTKlcUTKd5GsaRVXVhrz5r1qy5M4k7Hn99r26SJElaj/1Wr+T05cv6dv558+al+72/wU+SJEnqYRBu8BtoVXVgv2OQJElSfzizLEmSJPXgzHLL9lu9st8hjISJtROMzRnrdxgjwVy2y3y2y3y2x1y2y3y2Z2NzuXD+3C0YzcbzBr8WdN/gp3aMj48P1DMWh5m5bJf5bJf5bI+5bJf5bM+w5dIb/CRJkqQNZLEsSZIk9eCa5ZYddMSyfocwElwr1h5z2S7z2S7z2R5z2a7ufC6cP5cVS4/qc0TqF4vllq1afHi/Q5AkSW3y5v1ZzWUYkiRJUg8Wy5IkSVIPFsuSJElSDwNZLCdZkmTnrvfHJdlthmP4QpIdZ/KckiRJGiyDeoPfEuAK4AaAqnrpTAdQVU+Z6XNKkiRpsGzwzHKSBUm+k2RlkiuTnJlkLMk5SfZu+twnyXXN9pIkn03y+STXJnlFklcnuSTJN5Ls1OM8hwB7Ax9LsnqKc9yU5O1JLkrylST7NvuvSfL0ps/WSd6R5IIklyV52TTX9YAk5zbnuiLJAU37dc31THndG5o3SZIkDa+NnVleBPxlVR2e5JPAs9bTf3dgT2A74GrgNVW1Z5J3Ay8E3jP5gKr6dJJXAEdW1YUAyV1+6+Bc4Jyqek2SU4C3AH8K7AZ8GDgVeAmwpqr2SbItsCrJmVV17RQxPh84o6qOTbI1MGcDr/vE9Vy7JEkaARNrJxgfH+93GENt0PM33a/j3thi+dqqWt1sXwQsWE//s6vqRuDGJGuAzzftlwMP28hzr/M74Etd49xSVbcmubwrnj8DHtbMUgPMo1PwTlUsXwB8KMk2wGe7rq/bxl63JEkaEWNzxqYtpjS98fHxoc7fxt7gd0vX9u10iu3busbZbpr+d3S9v4NNXy99a1XV5DGrqnvMAH9fVYub1y5VdeZUg1XVucBjgeuBjyZ54RTdprpuSZIkjbg2noZxHbBXs33INP02xo3ADptx/BnA3zSzxSTZNcncqTom+SPgp1W1Evgv4BGbcV5JkiSNkDZmSN8JfDLJC4CzWhgP4ATgg0kmgEdvwvHH0VkqcXE6C55/Bhzco++BwFFJbgVuorOWWpIkSSK/X9GgTbVmzZo7k7jj8df3MxRJktSy/Vav5PTly/odxtAatjXL8+bNu8uTJQbyl5JIkiRJg6CvN6oleR+w36Tm5VV1/BY41x7ARyc131JVj2z7XJIkSRoNfS2Wq+rvZvBclwOLt/R59lu9ckufYlaYWDvB2Bx/90sbzGW7zGe7zGd7zGW7uvO5cP6UzwjQLOEj0FrmmqZ2DNv6pkFmLttlPttlPttjLttlPrWOa5YlSZKkHiyWJUmSpB5chtGyg45Y1u8QRoJr79pjLttlPts1W/K5cP5cViw9qt9hSNoEFsstW7X48H6HIEkaNN78LQ0tl2FIkiRJPVgsS5IkST1YLEuSJEk9DFSxnGRJkp273h+XZLd+xiRJkqTZa9Bu8FsCXAHcAFBVL+1rNJIkSZrV1juznGRBku8kWZnkyiRnJhlLck6SvZs+90lyXbO9JMlnk3w+ybVJXpHk1UkuSfKNJDv1OM8hwN7Ax5KsnuIcNyV5e5KLknwlyb7N/muSPL3ps3WSdyS5IMllSV62nms7OsnlSS5N8rambXET52VJTkly76b9lUm+3bR/YoMzLEmSpKG1oTPLi4C/rKrDk3wSeNZ6+u8O7AlsB1wNvKaq9kzybuCFwHsmH1BVn07yCuDIqroQIEl3l7nAOVX1miSnAG8B/hTYDfgwcCrwEmBNVe2TZFtgVZIzq+rayedL8ufAwcAjq2ptVxH/EeDvq+qrSd4EvBF4FfBaYJequiXJjuu5fkmS7jSxdoLx8fEtfp6ZOMdsYj7bM+i5nO5Xm29osXxtVa1uti8CFqyn/9lVdSNwY5I1wOeb9suBh23gOSf7HfClrnFuqapbk1zeFc+fAQ9rZqkB5tEp9O9WLANPBI6vqrUAVfXLJPOAHavqq02fDwOfarYvozPr/Vngs5t4DZKkWWhszti0/zFuw/j4+BY/x2xiPtsz7Lnc0GL5lq7t24Ex4DZ+v4xju2n639H1/o6NOOdkt1ZVTR6zqu5Ism7M0JkVPmMDxgtQ6+31ewcBjwWeDrwhyZ9U1W0bcbwkSZKGzOY8DeM6YK9m+5Bp+m2MG4EdNuP4M4C/SbINQJJdk8zt0fdM4MVJ5jR9d6qqNcCvkhzQ9HkB8NUkWwF/WFVnA0cDOwLbb0ackiRJGgKb8zSMdwKfTPIC4KyW4jkB+GCSCeDRm3D8cXSWZFyczoLnn9FZl3w3VfWlJIuBC5P8DvgC8E/Ai5oY5gDXAIcBWwMnNss0Ary7qn69CfFJkiRpiOT3Kxu0qdasWXNnEnc8/vp+hiJJGkD7rV7J6cuXbdFzDPu60EFjPtszbLmcN2/eXZ4wMVC/lESSJEkaJH35pSRJ3gfsN6l5eVUdvwXOtQfw0UnNt1TVI9s+lyRJkkZLX4rlqvq7GTzX5cDimTrffqtXztSpRtrE2gnG5oz1O4yRYC7bZT7bNVvyuXB+r3vNJQ26Qft110NvS69Jmy2GbX3TIDOX7TKf7TKfkgada5YlSZKkHiyWJUmSpB5chtGyg45Y1u8QRsJsWcc4E8xlu8znxlk4fy4rlh7V7zAkaZNZLLds1eLD+x2CJA0Ob3qWNORchiFJkiT1YLEsSZIk9TCSxXKSHZP8bb/jkCRJ0nAbyWIZ2BGwWJYkSdJmGdVi+W3AwiSrk7wjyVFJLkhyWZJ/BkiyIMlVSY5LckWSjyV5YpJVScaT7Nv0W5bko0nOatq9g0+SJGmWGNVi+bXA96tqMfBlYBGwL51fe71Xksc2/f4YWA48DHgo8Hxgf+BI4J+6xnsYcBDwaGBpkp1n4iIkSZLUX7Ph0XF/1rwuad5vT6d4/r/AtVV1OUCSK4H/qapKcjmwoGuMz1XVBDCR5Gw6hfdnZyh+SRpaE2snGB8fn7bP+vZrw5nLdpnP9gx6LhctWtRz32wolgO8tar+4y6NyQLglq6mO7re38Fdc1OTxpz8XpI0hbE5Y9P+R2h8fHza/dpw5rJd5rM9w57LUV2GcSOwQ7N9BvDiJNsDJPmDJPfbyPGekWS7JPOBA4ELWotUkiRJA2skZ5ar6hfNjXpXAF8E/hs4PwnATcBfAbdvxJDfAk4HHgS8uapuaDlkSZIkDaCRLJYBqur5k5qWT9Ft967+S7q2r+veB3yvqv66zfgkSZI0+EZ1GYYkSZK02UZ2ZrktVbWs3zFIkiSpP5xZliRJknpwZrll+61e2e8QRsLE2gnG5oz1O4yRYC7bZT43zsL5c/sdgiRtFovllp2+fFm/QxgJw/5MxkFiLttlPiVpdnEZhiRJktSDxbIkSZLUg8swWnbQEcv6HcJIcF1oe8zlplk4fy4rlh7V7zAkSX1msdyyVYsP73cIktrgzbqSJFyGIUmSJPVksSxJkiT1YLEsSZIk9TB0xXKSE5IcsoXP8fIkL9yS55AkSdLg8wa/KVTVB/sdgyRJkvpvoGaWk7w5yRFd749NckSS9yb5dpLTgft17V+a5IIkVyT5z3QsTHJxV59FSS6a5pxva8a+LMk7m7ZlSY5sts9J8vYk30ryvSQHbJGLlyRJ0sAZtJnl/wJOBpYn2Qp4HnA08FRgD+D+wLeBDzX931tVbwJI8lHgqVX1+SRrkiyuqtXAYcAJU50syU7AM4GHVlUl2bFHXPeoqn2TPAV4I/DEFq5V0gCbWDvB+Pj4lPt6tWvTmM/2mMt2mc/2DHouFy1a1HPfQBXLVXVdkl8k2ZNOYXwJcADw8aq6HbghyVldhzwuydHAHGAn4Erg88BxwGFJXg08F9i3xyl/A9wMHNfMWp/Wo9/Jzc+LgAWben2ShsfYnLEpvzzHx8en/VLVxjGf7TGX7TKf7Rn2XA7UMozGccASOjPC62aQa3KnJNsB7wcOqao9gJXAds3uzwB/TmdG+qKq+sVUJ6qq2+gU0p8BDga+1COmW5qftzNg/8CQJEnSljOIxfIpwJOBfYAzgHOB5yXZOskDgMc1/dYVxj9Psj1w5xMyqurm5tgPAMf3OlFz3Lyq+gLwKmBxy9ciSZKkITZws6RV9bskZwO/rqrbk5wCPB64HPge8NWm36+TrGzarwMumDTUx4C/AM6c5nQ7AJ9rZqkD/EOb1yJJkqThNnDFcnNj36OAZwNUVQGvmKpvVb0eeH2PofYHPtSsdZ5SVf2YKdYzV9Wyru0Du7Z/jmuWJUmSZo2BKpaT7EbnJrtTqmqTb5tsZqMX0pmRliRJkjbJQBXLVfVt4MEtjPPMyW1NAb3LpObXVNUZm3s+SZIkjaaBKpa3pKkK6C1hv9UrZ+I0I29i7QRjc8b6HcZIMJebZuH8uf0OQZI0AGZNsTxTTl++rN8hjIRhfybjIDGXkiRtukF8dJwkSZI0ECyWJUmSpB5chtGyg45Y1u8QRoLrbNtjLjsWzp/LiqVH9TsMSdKQsVhu2arFh/c7BElT8eZbSdImcBmGJEmS1IPFsiRJktSDxbIkSZLUw9AUy0lOSHJIv+OQJEnS7DE0xbIkSZI00waiWE7y5iRHdL0/NskRSd6b5NtJTgfu17V/aZILklyR5D/TsTDJxV19FiW5aJpz7pPk60kuTfKtJDsk2S7J8UkuT3JJksc1ff+k6bM6yWVJ/HVokiRJs8CgPDruv4CTgeVJtgKeBxwNPBXYA7g/8G3gQ03/91bVmwCSfBR4alV9PsmaJIurajVwGHDCVCdLck/gJOC5VXVBknsBE8ARAFW1R5KHAmcm2RV4ObC8qj7WHLt1+ymQtCVNrJ1gfHy8lbHaGkcd5rM95rJd5rM9g57LRYt6z4MORLFcVdcl+UWSPekUxpcABwAfr6rbgRuSnNV1yOOSHA3MAXYCrgQ+DxwHHJbk1cBzgX17nPIhwI+r6oLm/L8BSLI/8O9N21VJfgDsCpwPHJPkgcDJVTXYf+KS7mZszti0X4Ybanx8vJVx1GE+22Mu22U+2zPsuRyIZRiN44AldGaE180g1+ROSbYD3g8cUlV7ACuB7ZrdnwH+nM6M9EVV9Yse58pUYzftd1NV/w08nc7s8xlJHr8B1yNJkqQhN0jF8inAk4F9gDOAc4HnJdk6yQOAxzX91hXGP0+yPXDnEzKq6ubm2A8Ax09zrquAnZPsA9CsV75Hc85Dm7ZdgQcB303yYOCaqloBnAo8rIXrlSRJ0oAbiGUYAFX1uyRnA7+uqtuTnAI8Hrgc+B7w1abfr5OsbNqvAy6YNNTHgL8AzlzPuZ4L/HuSMTozxk+kM2P9wSSXA7cBS6rqlqbvXyW5Ffh/wJvaum5JkiQNroEplpsb+x4FPBugqgp4xVR9q+r1wOt7DLU/8KFmrXNPzXrlR02xa8kUfd8KvHW68SRJkjR6BqJYTrIbcBpwyubcPNfMRi+kMyMtSZIkbZaBKJar6tvAg1sY55mT25oCepdJza+pqjM293ySJEkabQNRLG9JUxXQW9J+q1fO5OlG1sTaCcbmjPU7jJFgLjsWzp/b7xAkSUNo5IvlmXb68mX9DmEkDPszGQeJuZQkadMN0qPjJEmSpIFisSxJkiT14DKMlh10xLJ+hzASXGfbnmHP5cL5c1mx9Kh+hyFJmqUsllu2avHh/Q5BGi3eNCtJ6iOXYUiSJEk9WCxLkiRJPbgMo4cky4CbgHsB51bVV5IcAHwQuBV4dFVN9DFESZIkbWEWy+tRVUu73h4KvLOqju9XPJIkSZo5FstdkhwDvBD4IfAz4KIkJwCnATsCzwGelOSJVXVo3wKVJEnSjLBYbiTZC3gesCedvFwMXLRuf1Udl2R/4LSq+nR/opQkSdJMslj+vQOAU6pqLUCSU/scjyQ6z4keHx/vdxh3MWjxDDvz2R5z2S7z2Z5Bz+WiRYt67rNYvqvqdwCS7mpszti0X2IzbXx8fKDiGXbmsz3msl3msz3DnksfHfd75wLPTDKWZAfgaf0OSJIkSf3lzHKjqi5OchKwGvgBcF6fQ5IkSVKfWSx3qapjgWOn2b9k5qKRJElSv7kMQ5IkSerBYlmSJEnqwWJZkiRJ6sE1yy3bb/XKfocwEibWTjA2Z6zfYYyEYc/lwvlz+x2CJGkWs1hu2enLl/U7hJEw7M9kHCTmUpKkTecyDEmSJKkHi2VJkiSpB5dhtOygI5b1O4SRMOzrbLe0hfPnsmLpUf0OQ5KkkWex3LJViw/vdwiaDbyRVJKkGeEyDEmSJKkHi2VJkiSpB4tlSZIkqYehKZaTLEhyxRTtxyXZbTPGPTDJaZsXnSRJkkbR0N/gV1Uv7XcMkiRJGk1DM7PcuEeSDye5LMmnk8xJck6SvZM8Pcnq5vXdJNf2GiTJk5NcleRrwF90te+b5OtJLml+PqRpPy/J4q5+q5I8bIteqSRJkvpu2GaWHwK8pKpWJfkQ8LfrdlTVqcCpAEk+CXx1qgGSbAesBB4PXA2c1LX7KuCxVXVbkicC/wI8CzgOWAK8KsmuwLZVdVnL1yZtsIm1E4yPj29w/43pq/Uzn+0yn+0xl+0yn+0Z9FwuWrSo575hK5Z/WFWrmu0TgVdO7pDkaGCiqt7XY4yHAtdW1XjT/0Tgr5t984APJ1kEFLBN0/4p4A1JjgJeDJzQwrVIm2xszti0f7G7jY+Pb3BfrZ/5bJf5bI+5bJf5bM+w53LYiuWa7n2SJwDPBh67keOs82bg7Kp6ZpIFwDkAVbU2yZeBZwDPAfbeqKglSZI0lIZtzfKDkjy62f5L4GvrdiT5I+D9wHOqamKaMa4CdkmysGucdeYB1zfbSyYddxywArigqn65aeFLkiRpmAxbsfwd4EVJLgN2Aj7QtW8JMB84pbnJ7wtTDVBVN9NZdnF6c4PfD7p2/yvw1iSrgK0nHXcR8Bvg+JauRZIkSQNuaJZhVNV1wFTPUz6w+Xkh8M8bONaX6Kxdntx+PrBrV9Mb1m0k2ZnOPy7O3KCAJUmSNPSGbWa5L5K8EPgmcExV3dHveCRJkjQzhmZmeVMkOQXYZVLza6rqjI0Zp6o+AnyktcAkSZI0FEa6WK6qZ870OfdbvXKmTzmSJtZOMDZnrN9hDKyF8+f2OwRJkmaFkS6W++H05cv6HcJIGPZnMkqSpNHgmmVJkiSpB4tlSZIkqQeXYbTsoCOW9TuEkTAKa5YXzp/LiqVH9TsMSZK0GSyWW7Zq8eH9DkGDwps9JUkaei7DkCRJknqwWJYkSZJ6sFjuIck5Sfaeon3vJCv6EZMkSZJmlmuWN1JVXQhc2O84JEmStOU5swwkeUOSq5J8OcnHkxzZ7Hp2km8l+V6SA5q+ByY5rY/hSpIkaYbM+mK5WWrxLGBP4C+A7qUX96iqfYFXAW/sQ3iSJEnqI5dhwP7A56pqAiDJ57v2ndz8vAhYMMNxachNrJ1gfHy832EADEwco8J8tstlJ0DaAAAbT0lEQVR8tsdctst8tmfQc7lo0aKe+yyWIdPsu6X5eTvmShtpbM7YtH/5Zsr4+PhAxDEqzGe7zGd7zGW7zGd7hj2Xs34ZBvA14GlJtkuyPXBQvwOSJEnSYJj1s6VVdUGSU4FLgR/QedLFmv5GJUmSpEHgzHLHO6vqIcDBwEOAi6rqwOYxcVTVz6tqQbN9TlU9tX+hSpIkaabM+pnlxn8m2Q3YDvhwVV3c74AkSZLUfxbLQFU9v98xSJIkafC4DEOSJEnqwZnllu23emW/QxgJE2snGJsz1u8wNsvC+XP7HYIkSdpMFsstO335sn6HMBKG/ZmMkiRpNLgMQ5IkSerBYlmSJEnqwWUYLTvoiGX9DmEkDNua5YXz57Ji6VH9DkOSJLXMYrllqxYf3u8Q1A/e2ClJ0khyGYYkSZLUg8WyJEmS1IPF8kZIcnDza7ElSZI0C1gsb5yDAYtlSZKkWWKki+UkC5JcleTDSS5L8ukkj01ycrP/GUkmktwzyXZJrmnaD09yQZJLk3wmyZwkjwGeDrwjyeokC/t5bZIkSdryRrpYbjwE+M+qehjwG2BfYM9m3wHAFcA+wCOBbzbtJ1fVPlX1cOA7wEuq6uvAqcBRVbW4qr4/kxchSZKkmTcbHh33w6pa1WyfCLwSuDrJ/6ZTOL8LeCywNXBe02/3JG8BdgS2B86Y2ZA1bCbWTjA+Pt7vMHoa5NiGkflsl/lsj7lsl/lsz6DnctGiRT33zYZiuaZ4fx7w58CtwFeAE+gUy0c2fU4ADq6qS5MsAQ6cgTg1xMbmjE37F62fxsfHBza2YWQ+22U+22Mu22U+2zPsuZwNyzAelOTRzfZfAl8DzgVeBZxfVT8D5gMPBa5s+u0A/DjJNsChXWPd2OyTJEnSLDAbiuXvAC9KchmwE/ABOmuT70+naAa4DLisqtbNQr+h6fNl4KqusT4BHJXkEm/wkyRJGn2zYRnGHVX18inat123UVV/3b2jqj5Ap6hmUvsqfHScJEnSrDEbZpYlSZKkTTLSM8tVdR2we7/jkCRJ0nByZlmSJEnqYaRnlvthv9Ur+x3CSJhYO8HYnLF+h7HBFs6f2+8QJEnSFmCx3LLTly/rdwgjYdifyShJkkaDyzAkSZKkHiyWJUmSpB5chtGyg45Y1u8QRkI/1iwvnD+XFUuPmtFzSpKkwWax3LJViw/vdwjaVN6cKUmSJnEZhiRJktSDxbIkSZLUg8WyJEmS1MPQFctJliR57xY+x95JVmzJc0iSJGnweYPfFKrqQuDCfschSZKk/hqImeUkRyd5ZbP97iRnNdtPSHJiksOSfC/JV4H9uo57WpJvJrkkyVeS3D/JVknGk9y36bNVkquT3KfHuZ+d5IoklyY5t2k7MMlpzfayJB9Kck6Sa9bFKUmSpNE3KDPL5wL/CKwA9ga2TbINsD8wDvwzsBewBjgbuKQ57mvAo6qqkrwUOLqq/jHJicChwHuAJwKXVtXPe5x7KfCkqro+yY49+jwUeBywA/DdJB+oqls375I1aCbWTjA+Pt7vMLaIUb2ufjGf7TKf7TGX7TKf7Rn0XC5atKjnvkEpli8C9kqyA3ALcDGdovkA4CzgnKr6GUCSk4Bdm+MeCJyU5AHAPYFrm/YPAZ+jUyy/GDh+mnOvAk5I8kng5B59Tq+qW4BbkvwUuD/wo025UA2usTlj0/5lGVbj4+MjeV39Yj7bZT7bYy7bZT7bM+y5HIhlGM0s7XXAYcDXgfPozOQuBL4DVI9D/x14b1XtAbwM2K4Z74fAT5I8Hngk8MVpzv1y4PXAHwKrk8yfotstXdu3Mzj/yJAkSdIWNBDFcuNc4Mjm53nAy4HVwDeAA5PMb5ZmPLvrmHnA9c32iyaNdxxwIvDJqrq910mTLKyqb1bVUuDndIpmSZIkaaCK5fOABwDnV9VPgJuB86rqx8Ay4HzgK3SWaKyzDPhUkvPoFLrdTgW2Z/olGADvSHJ5kivoFOqXbuZ1SJIkaUQMzHKCqvofYJuu97t2bR/PFEVvVX2OztrkqTyczo19V63nvH8xRfM5zYuqWjap/+7TjSdJkqTRMTDFcpuSvBb4GzpPxJAkSZI2yUgWy1X1NuBt3W1JjuGu650BPlVVx85YYJIkSRoqI1ksT6Upird4Ybzf6pVb+hSzwsTaCcbmjM3oORfOnzuj55MkSYNv1hTLM+X05cv6HcJIGPZnMkqSpNEwSE/DkCRJkgaKxbIkSZLUg8swWnbQEcv6HcJQWjh/LiuWHtXvMCRJku7CYrllqxYf3u8QhpM3RkqSpAHkMgxJkiSpB4tlSZIkqQeL5Q2Q5KZ+xyBJkqSZZ7G8HkmCeZIkSZqVLAKnkGRBku8keT9wMTCW5Ngklyb5RpL79ztGSZIkbXkWy709BPhIVe3ZvP9GVT0cOBfwkReSJEmzQKqq3zEMnCQLgLOrapfm/S3AdlVVSZ4L/GlVvXRd/zVr1tyZxB2Pv36Gox0Nj/j6Cv7j6Jeuv6MkSVLLFi1adOf2vHnz0r3P5yz39tuu7Vvr9/+quB3z1rqxOWN3+aCOj4/f5b02nblsl/lsl/lsj7lsl/lsz7Dn0mUYkiRJUg8Wy5IkSVIPLieYQlVdB+ze9X77ru1PA5/uQ1iSJEmaYc4sS5IkST1YLEuSJEk9WCxLkiRJPbhmuWX7rV7Z7xCG0sL5c/sdgiRJ0t1YLLfs9OXL+h2CJEmSWuIyDEmSJKkHi2VJkiSpB5dhtOygI5b1O4RNtnD+XFYsParfYUiSJA0Mi+WWrVp8eL9D2HTenChJknQXLsOQJEmSerBYliRJknqwWJYkSZJ6GJpiOcmSJO/tdxySJEmaPYamWJYkSZJmWl+L5SRHJ3lls/3uJGc1209IcmKSw5J8L8lXgf26jntakm8muSTJV5LcP8lWScaT3Lfps1WSq5Pcp8e575/klCSXNq/HNO2vTnJF83pV0zY3yelNvyuSPHcLp0aSJEkDoN+PjjsX+EdgBbA3sG2SbYD9gXHgn4G9gDXA2cAlzXFfAx5VVZXkpcDRVfWPSU4EDgXeAzwRuLSqft7j3CuAr1bVM5NsDWyfZC/gMOCRQIBvNoX6g4EbquoggCTzWs3CgJhYO8H4+Hi/w7jTIMUy7Mxlu8xnu8xne8xlu8xnewY9l4sWLeq5r9/F8kXAXkl2AG4BLqZTNB8AnAWcU1U/A0hyErBrc9wDgZOSPAC4J3Bt0/4h4HN0iuUXA8dPc+7HAy8EqKrbgTVJ9gdOqarfNuc8uYnlS8A7k7wdOK2qzmvh2gfO2JyxaT8sM2l8fHxgYhl25rJd5rNd5rM95rJd5rM9w57Lvi7DqKpbgevozOZ+HTgPeBywEPgOUD0O/XfgvVW1B/AyYLtmvB8CP0nyeDqzw1/cyJDSI87v0Znhvhx4a5KlGzmuJEmShtAg3OB3LnBk8/M84OXAauAbwIFJ5jdLM57ddcw84Ppm+0WTxjsOOBH4ZDNj3Mv/AH8DkGTrJPdqYjg4yZwkc4FnAucl2RlYW1UnAu8EHrHJVytJkqShMQjF8nnAA4Dzq+onwM3AeVX1Y2AZcD7wFTpLNNZZBnwqyXnA5DXJpwLbM/0SDIAjgMcluZzOcpA/qaqLgROAbwHfBI6rqkuAPYBvJVkNHAO8ZZOuVJIkSUOl32uWqar/Abbper9r1/bxTFH0VtXn6KxNnsrD6dzYd9V6zvsT4BlTtL8LeNektjOAM6YbT5IkSaOn78Vym5K8ls7SikP7HYskSZKG30gVy1X1NuBt3W1JjuGu650BPlVVx85YYJIkSRpKI1UsT6UpimesMN5v9cqZOlXrFs6f2+8QJEmSBsrIF8sz7fTly/odgiRJkloyCE/DkCRJkgaSxbIkSZLUg8swWnbQEcv6HcImWTh/LiuWHtXvMCRJkgaKxXLLVi0+vN8hbJohvjFRkiRpS3EZhiRJktSDxbIkSZLUg8WyJEmS1IPFsiRJktSDxXIjydFJXtlsvzvJWc32E5J8PMkJSa5IcnmSf+hvtJIkSZoJFsu/dy5wQLO9N7B9km2A/YHVwB9U1e5VtQdwfJ9ilCRJ0gxKVfU7hoHQFMbfBR4OnAJcCXwCeDPwRuAjwBeA04Ezq+qOdceuWbPmziTuePz1Mxh1ex7x9RX8x9Ev7XcYkiRJM27RokV3bs+bNy/d+3zOcqOqbk1yHXAY8HXgMuBxwMLm/cOBJwF/BzwHeHF/It0yxuaM3eWD0m/j4+MDFc8wM5ftMp/tMp/tMZftMp/tGfZcWizf1bnAkXQK4cuBdwEXAfOB31XVZ5J8HzihbxFKkiRpxlgs39V5wDHA+VX12yQ3N21/AByfZN0a79f1K0BJkiTNHIvlLlX1P8A2Xe937dr9iJmPSJIkSf3k0zAkSZKkHiyWJUmSpB4sliVJkqQeXLPcsv1Wr+x3CJtk4fy5/Q5BkiRp4Fgst+z05cv6HYIkSZJa4jIMSZIkqQeLZUmSJKkHl2G07KAjlvU7hGktnD+XFUuP6ncYkiRJQ8FiuWWrFh/e7xCmN6Q3IEqSJPWDyzAkSZKkHiyWJUmSpB4sliVJkqQeLJbXI8mSJDt3vT8uyW79jEmSJEkzwxv81m8JcAVwA0BVvbSv0UiSJGnGWCx3SfIG4FDgh8DPgYuAvYGPJZkAHg18ETiyqi7sW6CSJEmaERbLjSR7A88C9qSTl4vpFMsX0lUcJ+lbjG2YWDvB+Ph4v8PYIMMS5zAwl+0yn+0yn+0xl+0yn+0Z9FwuWrSo5z6L5d/bH/hcVU0AJPl8n+PZIsbmjE37gRgU4+PjQxHnMDCX7TKf7TKf7TGX7TKf7Rn2XHqD3+8N95SxJEmSWmex/HtfA56WZLsk2wMHNe03Ajv0LyxJkiT1i8swGlV1QZJTgUuBH9BZq7wGOAH4YNcNfpIkSZolLJbv6p1VtSzJHOBc4N+q6mLgM119DuxLZJIkSZpxFst39Z/NLxzZDvhwUyhLkiRplrJY7lJVz+93DJIkSRocFsst22/1yn6HMK2F8+f2OwRJkqShYbHcstOXL+t3CJIkSWqJj46TJEmSerBYliRJknqwWJYkSZJ6sFiWJEmSerBYliRJknqwWJYkSZJ6sFiWJEmSerBYliRJknqwWJYkSZJ6sFiWJEmSerBYliRJknqwWJYkSZJ6SFX1O4aht2bNGpMoSZI0AubNm5fu984sS5IkST1YLEuSJEk9uAxDkiRJ6sGZZUmSJKkHi+Uekjw5yXeTXJ3ktVPs3zbJSc3+byZZ0LXvdU37d5M8aUPHHFWbmsskf5rkoiSXNz8f33XMOc2Yq5vX/WbuivprM/K5IMlEV84+2HXMXk2er06yIkkmjzuKNiOXh3blcXWSO5Isbvb52eydz8cmuTjJbUkOmbTvRUnGm9eLutpn5WcTNj2fSRYnOT/JlUkuS/Lcrn0nJLm26/O5eKaup58287N5e1e+Tu1q36X5XhhvvifuORPXMgg247P5uEnfnTcnObjZN7ifzaryNekFbA18H3gwcE/gUmC3SX3+Fvhgs/084KRme7em/7bALs04W2/ImKP42sxc7gns3GzvDlzfdcw5wN79vr4hy+cC4Ioe434LeDQQ4IvAn/f7Wgc5l5P67AFc0/Xez2bvfC4AHgZ8BDikq30n4Jrm572b7Xs3+2bdZ7OFfO4KLGq2dwZ+DOzYvD+hu+9seG1OLpt9N/UY95PA85rtDwJ/0+9rHYZ8dvXZCfglMKd5P7CfTWeWp7YvcHVVXVNVvwM+ATxjUp9nAB9utj8NPKGZ8XgG8ImquqWqrgWubsbbkDFH0SbnsqouqaobmvYrge2SbDsjUQ+uzflsTinJA4B7VdX51fnG+ghwcPuhD5y2cvmXwMe3aKTDYb35rKrrquoy4I5Jxz4J+HJV/bKqfgV8GXjyLP5swmbks6q+V1XjzfYNwE+B+85M2ANpcz6bU2q+Bx5P53sBOt8TfjYbG5jPQ4AvVtXaLRdqOyyWp/YHwA+73v+oaZuyT1XdBqwB5k9z7IaMOYo2J5fdngVcUlW3dLUd3/yvmjfMov81u7n53CXJJUm+muSArv4/Ws+Yo6itz+ZzuXux7Gdz4z5H031vzsbPJrT034wk+9KZ/ft+V/OxzfKMd8+SCYjNzeV2SS5M8o11SwbofA/8uvle2JQxh1lb9czzuPt350B+Ni2WpzbVf9wmPzakV5+NbR91m5PLzs7kT4C3Ay/r2n9oVe0BHNC8XrCZcQ6Lzcnnj4EHVdWewKuB/05yrw0ccxS18dl8JLC2qq7o2u9n8/c29HPk9+bdbfa1NzPzHwUOq6p1M3yvAx4K7EPnf4O/ZnOCHBKbm8sHVdXewPOB9yRZ2MKYw6ytz+YewBldzQP72bRYntqPgD/sev9A4IZefZLcA5hHZ+1Nr2M3ZMxRtDm5JMkDgVOAF1bVnTMjVXV98/NG4L/p/G+h2WCT89ksDfoFQFVdRGemadem/wPXM+Yo2qzPZuNuMyN+Nu+0MZ+j6b43Z+NnEzbzvxnNP4RPB15fVd9Y115VP66OW4DjmR2fz83K5brlgFV1DZ17EvYEfg7s2HwvbPSYQ66NeuY5wClVdeu6hkH+bFosT+0CYFFzp+s96fwH8dRJfU4F1t2xfQhwVrOm7lTgeencRb8LsIjODSobMuYo2uRcJtmRzpf966pq1brOSe6R5D7N9jbAU4ErmB02J5/3TbI1QJIH0/lsXlNVPwZuTPKoZsnAC4HPzcTF9Nnm/D0nyVbAs+ms16Np87O5ad9xZwB/luTeSe4N/Blwxiz+bMJm5LPpfwrwkar61KR9D2h+hs4a29nw+dycXN573XKA5u/2fsC3m++Bs+l8L0Dne8LP5oa7270eA/3Z7PcdhoP6Ap4CfI/O7NsxTdubgKc329sBn6JzA9+3gAd3HXtMc9x36bpze6oxZ8NrU3MJvB74LbC663U/YC5wEXAZnRv/lgNb9/s6hyCfz2rydSlwMfC0rjH3pvPF9H3gvTS/sGjUX5v59/xA4BuTxvOzOX0+96EzK/Vb4BfAlV3HvrjJ89V0lg3M6s/m5uQT+Cvg1knfnYubfWcBlzc5PRHYvt/XOeC5fEyTr0ubny/pGvPBzffC1c33xLb9vs5Bz2ezbwFwPbDVpDEH9rPpb/CTJEmSenAZhiRJktSDxbIkSZLUg8WyJEmS1IPFsiRJktSDxbIkSZLUg8WyJA2ZJMuSnNhsPyjJTeueod3iOa5L8sQ2x5SkYWSxLEmTNIXiT5LM7Wp7aZJz+hjWlKrq/1bV9lV1+0ydM8kJSd4yU+ebTvc/HCRpS7BYlqSp3QM4YnMHSYfftVtA168alqQtxi9wSZraO4Ajm1+7fjdJHpPkgiRrmp+P6dp3TpJjk6wC1gIPbtrekuTrzbKJzyeZn+RjSX7TjLGga4zlSX7Y7LsoyQE94liQpJpftf3oZux1r5uTXNf02yrJa5N8P8kvknwyyU5d47wgyQ+afcdsaJK6zn9YE++vkrw8yT5JLkvy6yTv7eq/JMmqJP/e5O6qJE/o2r9zklOT/DLJ1UkO79q3LMmnk5yY5DfAy4F/Ap7bXO+lTb/DknwnyY1Jrknysq4xDkzyoyT/mOSnSX6c5LCu/WNJ/q3JxZokX0sy1ux7VPPn9+sklyY5cEPzJGl4WSxL0tQuBM4Bjpy8oykyTwdWAPOBdwGnJ5nf1e0FwF8DOwA/aNqe17T/AbAQOB84HtgJ+A7wxq7jLwAWN/v+G/hUku2mC7iqzm+WZGwP3Bv4BvDxZvcrgYOB/wPsDPwKeF9zPbsBH2hi27m5pgdOd64pPBJYBDwXeA9wDPBE4E+A5yT5P5P6XgPcp7nmk7sK94/T+TW5OwOHAP/SXUwDzwA+DewI/BfwL8BJzXU/vOnzU+CpwL2Aw4B3J3lE1xj/C5hH58/hJcD7kty72fdOYC86v+Z4J+Bo4I4kf0Dnz/wtTfuRwGeS3Hcj8yRpyFgsS1JvS4G/n6IgOggYr6qPVtVtVfVx4CrgaV19TqiqK5v9tzZtx1fV96tqDfBF4PtV9ZWqug34FLDnuoOr6sSq+kVz/L8B2wIP2YjYVwC/pVO0ArwMOKaqflRVtwDLgEOapQyHAKdV1bnNvjcAd2zEuQDeXFU3V9WZzXk/XlU/rarrgfO6r41OMfueqrq1qk4CvgsclOQP/3879w9aZxXGcfz7+KcNaqFKtVix6aTo4OqSQRCUSCWLVQwiIjoJdfBPnSyklOoiuDmIizpUKxXRC9bJpeoighah6FBiWkqSEqnagq2Pwzlv++Zy35vbkiCR7wdC8ubcnPeck+E+9/A7LzAB7Kl9/QC8SyniG99k5qeZ+U9mnhs0kMz8oq5zZubXwBGgvTP/NzBT798D/gDurnGZZ4EXM3MuMy9m5tG6Jk8Bvczs1Xt/RflA9cgVrpOkdcZiWZI6ZOZPwOfAa31N27i8W9w4QdmpbMwO6PJ06+dzA65vai5qTODnGgVYouyEbhll3DV28AAwnZlN0TsOHK4RgiXKTvZFYGudz6XxZuafwOIo97qauQFzmZmt6xN1DNuAM5l5tq9tpXVdJiImI+LbGuVYohS07bVbrB9QGn/V8W0BxoBfB3Q7Duxq1q/2OwHcvtJ4JK1vFsuSNNxe4HmWF2wnKcVT23ZgrnWdXKWaT94DPA7cnJmbgd+BGPFv9wFTdQe7MQtMZubm1tdY3fk9BdzZ6uMGShRjrdwREe25bKes6UnglojY1Nc2bF2XXUfERuATSpxia127HiOsHbAAnKdEZPrNAu/3rd+NmfnGCP1KWscsliVpiMz8BThIyfw2esBdETFdD9Y9AdxL2YVeDZuAC8A8cF1EvE7J3w5VYwwHgacz83hf8zvA/ogYr6+9NSKmatshYGdETETEBmCGtX1/uA3YHRHXR8Qu4B5KxGEWOAociIixiLiPkin+cEhfp4EdcfmJIxsokZV54EJETAIPjTKougv/HvBWPWh4bT00uRH4AHg0Ih6uvx+rhwWvNNstaZ2xWJaklc0Al565nJmLlANkL1HiCq8COzNzYZXu9yUl03ycEkM4zwjxA+BByuG1Q60nYhyrbW8DnwFHIuIs5fDf/XU+x4AXKAcJT1EO//22SnMZ5DvKYcAFYD/wWF1TgCeBHZRd5sPA3poP7vJx/b4YEd/XCMdu4CPKPKYp8x7Vy8CPlAOWZ4A3gWtqIT9FefrGPOX/8Qq+j0r/e7E8NiZJ0tqJiGeA5zJz4r8eiySNwk/EkiRJUgeLZUmSJKmDMQxJkiSpgzvLkiRJUgeLZUmSJKmDxbIkSZLUwWJZkiRJ6mCxLEmSJHWwWJYkSZI6/As4641Q0cnxYQAAAABJRU5ErkJggg==\n",
1038 |       "text/plain": [
1039 |        "<matplotlib.figure.Figure at 0x249f108ddd8>"
1040 |       ]
1041 |      },
1042 |      "metadata": {},
1043 |      "output_type": "display_data"
1044 |     }
1045 |    ],
1046 |    "source": [
1047 |     "_ = plot_feature_importances(feature_importances)"
1048 |    ]
1049 |   },
1050 |   {
1051 |    "cell_type": "markdown",
1052 |    "metadata": {},
1053 |    "source": [
1054 |     "# Gradient Boosting Machine Modeling Function\n",
1055 |     "\n",
1056 |     "The next step is to refactor all of the indvidual code we walked through into a single function. This function will create the model, perform cross validation to find the optimal number of iterations, train the model on the whole training dataset using the optimal number of iterations, make predictions on the test data, and evaluate the MAPE on the test set. In order to integrate with the other models implemented in the modeling notebook, this function needs to return a numpy array:\n",
1057 |     "\n",
1058 |     "`['model', train_time, test_time, test_mape]`\n",
1059 |     "\n",
1060 |     "It also must take in the same arguments: a set of training features, training targets, a set of testing features, testing targets. We will suppress the messages returned during training for this function."
1061 |    ]
1062 |   },
1063 |   {
1064 |    "cell_type": "code",
1065 |    "execution_count": 14,
1066 |    "metadata": {},
1067 |    "outputs": [],
1068 |    "source": [
1069 |     "def gbm_model(train, targets, test, test_targets, n_folds = 5):\n",
1070 |     "    \"\"\"Train and test a light gradient boosting model using\n",
1071 |     "    cross validation to select the optimal number of training iterations. \n",
1072 |     "    \n",
1073 |     "    Parameters\n",
1074 |     "    --------\n",
1075 |     "        train : dataframe, shape = [n_training_samples, n_features]\n",
1076 |     "            Set of training features for training a model\n",
1077 |     "    \n",
1078 |     "        targets : array, shape = [n_training_samples]\n",
1079 |     "            Array of training targets for training a model\n",
1080 |     "\n",
1081 |     "        test : dataframe, shape = [n_testing_samples, n_features]\n",
1082 |     "            Set of testing features for making predictions with a model\n",
1083 |     "\n",
1084 |     "        test_targets : array, shape = [n_testing_samples]\n",
1085 |     "            Array of testing targets for evaluating the model predictions\n",
1086 |     "        \n",
1087 |     "    Return\n",
1088 |     "    --------\n",
1089 |     "        results : array, shape = [4]\n",
1090 |     "            Numpy array of results. \n",
1091 |     "            First entry is the model, second is the training time,\n",
1092 |     "            third is the testing time, and fourth is the MAPE. All entries\n",
1093 |     "            are in strings and so will need to be converted to numbers.\n",
1094 |     "        \n",
1095 |     "    \"\"\"\n",
1096 |     "    \n",
1097 |     "    # KFold cross validation object\n",
1098 |     "    kfold = KFold(n_splits = n_folds)\n",
1099 |     "    \n",
1100 |     "    # Convert to numpy arrays\n",
1101 |     "    train = np.array(train)\n",
1102 |     "    test = np.array(test)\n",
1103 |     "    \n",
1104 |     "    best_iterations = 0\n",
1105 |     "    \n",
1106 |     "    # Create the model with specified hyperparaters\n",
1107 |     "    model = lgb.LGBMRegressor(n_estimators=10000,\n",
1108 |     "                              learning_rate = 0.01, \n",
1109 |     "                              reg_alpha = 0.1, reg_lambda = 0.1, \n",
1110 |     "                              subsample = 0.9, n_jobs = -1)\n",
1111 |     "    \n",
1112 |     "    # Cross validation to find optimal number of iterations\n",
1113 |     "    for train_indices, valid_indices in kfold.split(train):\n",
1114 |     "        \n",
1115 |     "        # Training data for fold\n",
1116 |     "        train_features, train_targets = np.array(train)[train_indices], targets[train_indices]\n",
1117 |     "    \n",
1118 |     "        # Validation data for fold\n",
1119 |     "        valid_features, valid_targets = np.array(train)[valid_indices], targets[valid_indices]\n",
1120 |     "\n",
1121 |     "        # Train the model\n",
1122 |     "        model.fit(X = train_features, y = train_targets, \n",
1123 |     "                  early_stopping_rounds = 100,\n",
1124 |     "                  eval_metric = mape,\n",
1125 |     "                  eval_set = [(valid_features, valid_targets)],\n",
1126 |     "                  eval_names = ['valid'], verbose = False)\n",
1127 |     "    \n",
1128 |     "        # Add the number of iterations to the total for averaging\n",
1129 |     "        best_iterations += model.best_iteration_\n",
1130 |     "        \n",
1131 |     "        \n",
1132 |     "    # Average the best iterations across folds\n",
1133 |     "    best_iterations = int(best_iterations / kfold.n_splits)\n",
1134 |     "    \n",
1135 |     "    # Create the model with optimal number of iterations\n",
1136 |     "    model = lgb.LGBMRegressor(n_estimators=iterations,\n",
1137 |     "                              learning_rate = 0.01, \n",
1138 |     "                              reg_alpha = 0.1, reg_lambda = 0.1, \n",
1139 |     "                              subsample = 0.9, n_jobs = -1)\n",
1140 |     "    \n",
1141 |     "    # Start the training time\n",
1142 |     "    start = timer()\n",
1143 |     "    \n",
1144 |     "    # Fit on the entire training set\n",
1145 |     "    model.fit(train, targets, verbose = False)\n",
1146 |     "    \n",
1147 |     "    # End the training time\n",
1148 |     "    end = timer()\n",
1149 |     "    train_time = end - start\n",
1150 |     "    \n",
1151 |     "    # Start the testing time\n",
1152 |     "    start = timer()\n",
1153 |     "    \n",
1154 |     "    # Make predictions on the testing data\n",
1155 |     "    predictions = model.predict(test)\n",
1156 |     "    \n",
1157 |     "    # End the testing time\n",
1158 |     "    end = timer()\n",
1159 |     "    test_time = end - start\n",
1160 |     "    \n",
1161 |     "    # Calculate the mape\n",
1162 |     "    _, test_mape, _ = mape(test_targets, predictions)\n",
1163 |     "    \n",
1164 |     "    \n",
1165 |     "    # Record the results and return\n",
1166 |     "    results = np.array(['gbm', train_time, test_time, test_mape])\n",
1167 |     "    \n",
1168 |     "    return results"
1169 |    ]
1170 |   },
1171 |   {
1172 |    "cell_type": "code",
1173 |    "execution_count": 15,
1174 |    "metadata": {},
1175 |    "outputs": [
1176 |     {
1177 |      "data": {
1178 |       "text/plain": [
1179 |        "array(['gbm', '6.21472920343108', '0.31054537055683884',\n",
1180 |        "       '16.44666645768823'], dtype='<U19')"
1181 |       ]
1182 |      },
1183 |      "execution_count": 15,
1184 |      "metadata": {},
1185 |      "output_type": "execute_result"
1186 |     }
1187 |    ],
1188 |    "source": [
1189 |     "# Use the model\n",
1190 |     "gbm_results = gbm_model(train, targets, test, test_targets)\n",
1191 |     "gbm_results"
1192 |    ]
1193 |   },
1194 |   {
1195 |    "cell_type": "markdown",
1196 |    "metadata": {},
1197 |    "source": [
1198 |     "### Interface with the other Models\n",
1199 |     "\n",
1200 |     "When we evaluate the models, it will be simplest to do all at once. Therefore, we need to fit this model into our existing functions. The return value is the same, so we can call the model function and add the results to the rest. \n",
1201 |     "\n",
1202 |     "To integrate this function with the six previous models already developed, we can simply add in the function call to the `evaluate_models` function we already wrote. This is the best part about writing functions with standard outputs and inputs: everything can work together! Writing functions increases efficiency and makes reproducing results much easier than ad hoc code scattered around a notebook (speaking from experience)."
1203 |    ]
1204 |   },
1205 |   {
1206 |    "cell_type": "code",
1207 |    "execution_count": 16,
1208 |    "metadata": {},
1209 |    "outputs": [],
1210 |    "source": [
1211 |     "def evaluate_models(df):\n",
1212 |     "    \"\"\"Evaluate machine learning models\n",
1213 |     "    on a building energy dataset. More models can be added\n",
1214 |     "    to the function as required. \n",
1215 |     "    \n",
1216 |     "    \n",
1217 |     "    Parameters\n",
1218 |     "    --------\n",
1219 |     "    df : dataframe\n",
1220 |     "        Building energy dataframe. Each row must have one observation\n",
1221 |     "        and the columns must contain the features. The dataframe\n",
1222 |     "        needs to have an \"elec_cons\" column to be used as targets. \n",
1223 |     "    \n",
1224 |     "    Return\n",
1225 |     "    --------\n",
1226 |     "    results : dataframe, shape = [n_models, 4]\n",
1227 |     "        Modeling metrics. A dataframe with columns:\n",
1228 |     "        model, train_time, test_time, mape. Used for comparing\n",
1229 |     "        models for a given building dataset\n",
1230 |     "        \n",
1231 |     "    \"\"\"\n",
1232 |     "    try:\n",
1233 |     "        # Preprocess the data for machine learning\n",
1234 |     "        train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)\n",
1235 |     "    except Exception as e:\n",
1236 |     "        print('Error processing data: ', e)\n",
1237 |     "        return\n",
1238 |     "        \n",
1239 |     "    # elasticnet\n",
1240 |     "    model = ElasticNet(alpha = 1.0, l1_ratio=0.5)\n",
1241 |     "    elasticnet_results = implement_model(model, train, train_targets, test, \n",
1242 |     "                                         test_targets, model_name = 'elasticnet')\n",
1243 |     "    \n",
1244 |     "    # knn\n",
1245 |     "    model = KNeighborsRegressor()\n",
1246 |     "    knn_results = implement_model(model, train, train_targets, test, \n",
1247 |     "                                  test_targets, model_name = 'knn')\n",
1248 |     "    \n",
1249 |     "    # svm\n",
1250 |     "    model = SVR()\n",
1251 |     "    svm_results = implement_model(model, train, train_targets, test, \n",
1252 |     "                                   test_targets, model_name = 'svm')\n",
1253 |     "    \n",
1254 |     "    # rf\n",
1255 |     "    model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)\n",
1256 |     "    rf_results = implement_model(model, train, train_targets, test, \n",
1257 |     "                                  test_targets, model_name = 'rf')\n",
1258 |     "    \n",
1259 |     "    # et\n",
1260 |     "    model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)\n",
1261 |     "    et_results = implement_model(model, train, train_targets, test, \n",
1262 |     "                                  test_targets, model_name = 'et')\n",
1263 |     "    \n",
1264 |     "    # adaboost\n",
1265 |     "    model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05, \n",
1266 |     "                              loss = 'exponential')\n",
1267 |     "    adaboost_results = implement_model(model, train, train_targets, test, \n",
1268 |     "                                       test_targets, model_name = 'adaboost')\n",
1269 |     "    \n",
1270 |     "    # gbm\n",
1271 |     "    gbm_results = gbm_model(train, train_targets, test, test_targets)\n",
1272 |     "    \n",
1273 |     "    # Put the results into a single array (stack the rows)\n",
1274 |     "    results = np.vstack((elasticnet_results, knn_results, svm_results,\n",
1275 |     "                         rf_results, et_results, gbm_results, adaboost_results))\n",
1276 |     "    \n",
1277 |     "    # Convert the results to a dataframe\n",
1278 |     "    results = pd.DataFrame(results, columns = ['model', 'train_time', 'test_time', 'mape'])\n",
1279 |     "    \n",
1280 |     "    # Convert the numeric results to numbers\n",
1281 |     "    results.iloc[:, 1:] = results.iloc[:, 1:].astype(np.float32)\n",
1282 |     "    \n",
1283 |     "    return results"
1284 |    ]
1285 |   },
1286 |   {
1287 |    "cell_type": "code",
1288 |    "execution_count": 17,
1289 |    "metadata": {},
1290 |    "outputs": [
1291 |     {
1292 |      "data": {
1293 |       "text/html": [
1294 |        "<div>\n",
1295 |        "<style scoped>\n",
1296 |        "    .dataframe tbody tr th:only-of-type {\n",
1297 |        "        vertical-align: middle;\n",
1298 |        "    }\n",
1299 |        "\n",
1300 |        "    .dataframe tbody tr th {\n",
1301 |        "        vertical-align: top;\n",
1302 |        "    }\n",
1303 |        "\n",
1304 |        "    .dataframe thead th {\n",
1305 |        "        text-align: right;\n",
1306 |        "    }\n",
1307 |        "</style>\n",
1308 |        "<table border=\"1\" class=\"dataframe\">\n",
1309 |        "  <thead>\n",
1310 |        "    <tr style=\"text-align: right;\">\n",
1311 |        "      <th></th>\n",
1312 |        "      <th>model</th>\n",
1313 |        "      <th>train_time</th>\n",
1314 |        "      <th>test_time</th>\n",
1315 |        "      <th>mape</th>\n",
1316 |        "    </tr>\n",
1317 |        "  </thead>\n",
1318 |        "  <tbody>\n",
1319 |        "    <tr>\n",
1320 |        "      <th>0</th>\n",
1321 |        "      <td>elasticnet</td>\n",
1322 |        "      <td>0.0765513</td>\n",
1323 |        "      <td>0.00460913</td>\n",
1324 |        "      <td>56.9737</td>\n",
1325 |        "    </tr>\n",
1326 |        "    <tr>\n",
1327 |        "      <th>1</th>\n",
1328 |        "      <td>knn</td>\n",
1329 |        "      <td>54.2147</td>\n",
1330 |        "      <td>3.23389</td>\n",
1331 |        "      <td>23.7799</td>\n",
1332 |        "    </tr>\n",
1333 |        "    <tr>\n",
1334 |        "      <th>2</th>\n",
1335 |        "      <td>svm</td>\n",
1336 |        "      <td>674.615</td>\n",
1337 |        "      <td>55.4858</td>\n",
1338 |        "      <td>24.8263</td>\n",
1339 |        "    </tr>\n",
1340 |        "    <tr>\n",
1341 |        "      <th>3</th>\n",
1342 |        "      <td>rf</td>\n",
1343 |        "      <td>28.751</td>\n",
1344 |        "      <td>0.219195</td>\n",
1345 |        "      <td>16.1971</td>\n",
1346 |        "    </tr>\n",
1347 |        "    <tr>\n",
1348 |        "      <th>4</th>\n",
1349 |        "      <td>et</td>\n",
1350 |        "      <td>14.8552</td>\n",
1351 |        "      <td>0.208349</td>\n",
1352 |        "      <td>18.0844</td>\n",
1353 |        "    </tr>\n",
1354 |        "    <tr>\n",
1355 |        "      <th>5</th>\n",
1356 |        "      <td>gbm</td>\n",
1357 |        "      <td>6.95854</td>\n",
1358 |        "      <td>0.338123</td>\n",
1359 |        "      <td>16.4467</td>\n",
1360 |        "    </tr>\n",
1361 |        "    <tr>\n",
1362 |        "      <th>6</th>\n",
1363 |        "      <td>adaboost</td>\n",
1364 |        "      <td>280.147</td>\n",
1365 |        "      <td>2.07414</td>\n",
1366 |        "      <td>38.9557</td>\n",
1367 |        "    </tr>\n",
1368 |        "  </tbody>\n",
1369 |        "</table>\n",
1370 |        "</div>"
1371 |       ],
1372 |       "text/plain": [
1373 |        "        model train_time   test_time     mape\n",
1374 |        "0  elasticnet  0.0765513  0.00460913  56.9737\n",
1375 |        "1         knn    54.2147     3.23389  23.7799\n",
1376 |        "2         svm    674.615     55.4858  24.8263\n",
1377 |        "3          rf     28.751    0.219195  16.1971\n",
1378 |        "4          et    14.8552    0.208349  18.0844\n",
1379 |        "5         gbm    6.95854    0.338123  16.4467\n",
1380 |        "6    adaboost    280.147     2.07414  38.9557"
1381 |       ]
1382 |      },
1383 |      "execution_count": 17,
1384 |      "metadata": {},
1385 |      "output_type": "execute_result"
1386 |     }
1387 |    ],
1388 |    "source": [
1389 |     "df = pd.read_csv('../data/f-APS_weather.csv')\n",
1390 |     "model_results = evaluate_models(df)\n",
1391 |     "model_results"
1392 |    ]
1393 |   },
1394 |   {
1395 |    "cell_type": "code",
1396 |    "execution_count": 18,
1397 |    "metadata": {},
1398 |    "outputs": [
1399 |     {
1400 |      "data": {
1401 |       "text/html": [
1402 |        "<div>\n",
1403 |        "<style scoped>\n",
1404 |        "    .dataframe tbody tr th:only-of-type {\n",
1405 |        "        vertical-align: middle;\n",
1406 |        "    }\n",
1407 |        "\n",
1408 |        "    .dataframe tbody tr th {\n",
1409 |        "        vertical-align: top;\n",
1410 |        "    }\n",
1411 |        "\n",
1412 |        "    .dataframe thead th {\n",
1413 |        "        text-align: right;\n",
1414 |        "    }\n",
1415 |        "</style>\n",
1416 |        "<table border=\"1\" class=\"dataframe\">\n",
1417 |        "  <thead>\n",
1418 |        "    <tr style=\"text-align: right;\">\n",
1419 |        "      <th></th>\n",
1420 |        "      <th>model</th>\n",
1421 |        "      <th>train_time</th>\n",
1422 |        "      <th>test_time</th>\n",
1423 |        "      <th>mape</th>\n",
1424 |        "    </tr>\n",
1425 |        "  </thead>\n",
1426 |        "  <tbody>\n",
1427 |        "    <tr>\n",
1428 |        "      <th>0</th>\n",
1429 |        "      <td>elasticnet</td>\n",
1430 |        "      <td>0.00777841</td>\n",
1431 |        "      <td>0.00189042</td>\n",
1432 |        "      <td>52.3209</td>\n",
1433 |        "    </tr>\n",
1434 |        "    <tr>\n",
1435 |        "      <th>1</th>\n",
1436 |        "      <td>knn</td>\n",
1437 |        "      <td>0.0735118</td>\n",
1438 |        "      <td>4.14259</td>\n",
1439 |        "      <td>27.4117</td>\n",
1440 |        "    </tr>\n",
1441 |        "    <tr>\n",
1442 |        "      <th>2</th>\n",
1443 |        "      <td>svm</td>\n",
1444 |        "      <td>10.1074</td>\n",
1445 |        "      <td>8.33936</td>\n",
1446 |        "      <td>38.2993</td>\n",
1447 |        "    </tr>\n",
1448 |        "    <tr>\n",
1449 |        "      <th>3</th>\n",
1450 |        "      <td>rf</td>\n",
1451 |        "      <td>2.51328</td>\n",
1452 |        "      <td>0.104967</td>\n",
1453 |        "      <td>21.6745</td>\n",
1454 |        "    </tr>\n",
1455 |        "    <tr>\n",
1456 |        "      <th>4</th>\n",
1457 |        "      <td>et</td>\n",
1458 |        "      <td>1.40114</td>\n",
1459 |        "      <td>0.105561</td>\n",
1460 |        "      <td>24.4212</td>\n",
1461 |        "    </tr>\n",
1462 |        "    <tr>\n",
1463 |        "      <th>5</th>\n",
1464 |        "      <td>gbm</td>\n",
1465 |        "      <td>2.54485</td>\n",
1466 |        "      <td>0.37309</td>\n",
1467 |        "      <td>21.8775</td>\n",
1468 |        "    </tr>\n",
1469 |        "    <tr>\n",
1470 |        "      <th>6</th>\n",
1471 |        "      <td>adaboost</td>\n",
1472 |        "      <td>27.7201</td>\n",
1473 |        "      <td>1.90376</td>\n",
1474 |        "      <td>26.7783</td>\n",
1475 |        "    </tr>\n",
1476 |        "  </tbody>\n",
1477 |        "</table>\n",
1478 |        "</div>"
1479 |       ],
1480 |       "text/plain": [
1481 |        "        model  train_time   test_time     mape\n",
1482 |        "0  elasticnet  0.00777841  0.00189042  52.3209\n",
1483 |        "1         knn   0.0735118     4.14259  27.4117\n",
1484 |        "2         svm     10.1074     8.33936  38.2993\n",
1485 |        "3          rf     2.51328    0.104967  21.6745\n",
1486 |        "4          et     1.40114    0.105561  24.4212\n",
1487 |        "5         gbm     2.54485     0.37309  21.8775\n",
1488 |        "6    adaboost     27.7201     1.90376  26.7783"
1489 |       ]
1490 |      },
1491 |      "execution_count": 18,
1492 |      "metadata": {},
1493 |      "output_type": "execute_result"
1494 |     }
1495 |    ],
1496 |    "source": [
1497 |     "new_df = pd.read_csv('../data/f-Kansas_weather.csv')\n",
1498 |     "new_model_results = evaluate_models(new_df)\n",
1499 |     "new_model_results"
1500 |    ]
1501 |   },
1502 |   {
1503 |    "cell_type": "markdown",
1504 |    "metadata": {},
1505 |    "source": [
1506 |     "We want to avoid reading too much into the metrics at this point, but the Gradient Boosting Machine looks as if it may compete. The difficult part of determining the real \"best\" model is the influence of the hyperparameters. The best model/settings probably is different for every building and optimizing all the models would require an extensive search procedure. For now we will have to be content with selecting a set of hyperparameters and using those for the model on each building. "
1507 |    ]
1508 |   },
1509 |   {
1510 |    "cell_type": "markdown",
1511 |    "metadata": {},
1512 |    "source": [
1513 |     "# Conclusions\n",
1514 |     "\n",
1515 |     "The [gradient boosting machine](http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf) is a powerful boosting ensemble machine learning model. While it has many hyperparameters to tune, the high performance more than makes up for the development time. In this notebook, we walked through am implementation of the gradient boosting machine for the building energy prediction task. The model uses cross validation to find the optimal number of base learners and is built using a sensible set of additional hyperparameters. We stepped through the process of building the model, tuning the number of iterations using cross validation, fitting the optimal model, making predictions with the model, and evaluating the model predictions. Preliminary results from two buildings suggest the Gradient Boosting Machine is competitive with the bagging, tree-based ensemble models in Scikit-Learn although more data is needed to draw significant conclusions.\n",
1516 |     "\n",
1517 |     "Furthermore, in this notebook we saw how we can use feature importances to try and understand how the model makes predictions. Model interpretation will be a focus of a later notebook, but this is one technique we may be able to employ ([Local-Interpretable Model Agnostic Explanations (LIME)](https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime) is another tool). The next step is to evaluate all of the models developed on hundreds of buildings to gather enough data to make significant conclusions. After evaluation, the best model can be selected for optimization through hyperparameter tuning. Then, we can make efforts to try and explain the predictions of the model. It's one thing to know a machine learning model can make accurate predictions but another problem entirely to understand how the model makes predictions. I will see you in the next notebook!"
1518 |    ]
1519 |   },
1520 |   {
1521 |    "cell_type": "code",
1522 |    "execution_count": null,
1523 |    "metadata": {},
1524 |    "outputs": [],
1525 |    "source": []
1526 |   }
1527 |  ],
1528 |  "metadata": {
1529 |   "kernelspec": {
1530 |    "display_name": "Python 3",
1531 |    "language": "python",
1532 |    "name": "python3"
1533 |   },
1534 |   "language_info": {
1535 |    "codemirror_mode": {
1536 |     "name": "ipython",
1537 |     "version": 3
1538 |    },
1539 |    "file_extension": ".py",
1540 |    "mimetype": "text/x-python",
1541 |    "name": "python",
1542 |    "nbconvert_exporter": "python",
1543 |    "pygments_lexer": "ipython3",
1544 |    "version": "3.6.5"
1545 |   }
1546 |  },
1547 |  "nbformat": 4,
1548 |  "nbformat_minor": 2
1549 | }
1550 | 


--------------------------------------------------------------------------------
/Modeling.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Introduction: Model Evaluation\n",
   8 |     "\n",
   9 |     "According to the [No Free Lunch Theorem](https://en.wikipedia.org/wiki/No_free_lunch_theorem) ([source 1](https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341), [source 2](https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf)), no algorithm will beat every other algorithm across all learning tasks. What this means for machine learning is there is no single superior algorithm for every dataset. Therefore, to find the best model for a given task, we have to evaluate multiple models and determine which one achieves the highest generalization performance on the test set. Machine learning is a largely empirical field, and finding the best model remains almost entirely a process of experimentation.\n",
  10 |     "\n",
  11 |     "With that in mind, in this notebook we will implement and evaluate several machine learning models for the building energy prediction task. This is a supervised regression problem where we are asked to build a model to predict energy consumption based on historical energy usage and weather conditions. The models will be tested on the final 6 months of data with the entire preceding data used for training. [Scikit-Learn](http://scikit-learn.org/stable/index.html) will be used to implement the models in this notebook which allows us to use the same syntax for building a diverse array of algorithms. We will show an example of building a machine learning model and then write a function that can train and evaluate any model. Here we will evaluate six models for two datasets, but the function will be applied to hundreds of buildings to determine which model performs the best on average for this task. The model with the best test set performance can then be further developed through the process of hyperparameter tuning. \n",
  12 |     "\n",
  13 |     "## Metric: Mean Absolute Percentage Error\n",
  14 |     "\n",
  15 |     "In order to compare models, we need a single metric to assess performance. There are many choices for regression, including the [root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) (RMSE). RMSE is conveniently measured in the same units as the target, but suffers from the fact that it depends on the magnitude of the targets and therefore cannot be compared between buildings that have different levels of energy use. To compare performance across buildings, we need a metric that normalizes the error by the magnitude of the target. The mean absolute percentage error (MAPE) is one such metric. This performance measure has been used in previous electricity prediction studies ([here](https://www.sciencedirect.com/science/article/pii/S0306261914003596) and [here](https://www.sciencedirect.com/science/article/pii/S0378778812001582)), and can be used to compare predictions between buildings because it measures the error normalized by the magnitude of the energy usage. The MAPE is calculated as\n",
  16 |     "\n",
  17 |     "$$\\mbox{MAPE} = 100\\% * \\frac{1}{n}\\sum_{i=1}^n  \\left|\\frac{y_i-\\hat{y}_i}{y_i}\\right|$$\n",
  18 |     "\n",
  19 |     "where $n$ is the number of observations, $y_i$ is the actual prediction for observation $i$, and $\\hat{y}_i$ is the predicted value. The MAPE represents the average percentage error for predictions. In addition to the MAPE, we can calculate the time it takes for the model to learn. Although time is not a significant consideration when massive computing power is available (through the [CWRU HPC](https://sites.google.com/a/case.edu/hpcc/), it is still an interesting comparison to make. \n",
  20 |     "\n",
  21 |     "## Models to Evaluate\n",
  22 |     "\n",
  23 |     "We will evaluate the following models in this notebook. \n",
  24 |     " \n",
  25 |     "1. Elastic Net Linear Regression with `l1_ratio = 0.5`\n",
  26 |     "2. K-Nearest Neighbors Regression with `k = 10`\n",
  27 |     "3. Support Vector Machine Regression with Radial Basis Function Kernel\n",
  28 |     "4. Random Forest Regression with 100 decision trees \n",
  29 |     "5. Extra Trees Regression with 100 decision trees\n",
  30 |     "6. AdaBoost Regression with 1000 decision trees as the base learner\n",
  31 |     "\n",
  32 |     "Two other models with be evaluated in additional notebooks:\n",
  33 |     "\n",
  34 |     "1. Gradient Boosting Machine\n",
  35 |     "2. Deep Fully Connected Neural Networks\n",
  36 |     "\n",
  37 |     "The six models developed here can all be implemented in Scikit-Learn. We will focus on using the models rather than the theory behind them, but additional resources will be linked to for those interested in learning more."
  38 |    ]
  39 |   },
  40 |   {
  41 |    "cell_type": "markdown",
  42 |    "metadata": {},
  43 |    "source": [
  44 |     "### Imports\n",
  45 |     "\n",
  46 |     "We will use a standard stack of data science libraries: `pandas`, `numpy`, `sklearn`, `matplotlib`. See the `requirements.txt` file for the correct version of these libraries to install. "
  47 |    ]
  48 |   },
  49 |   {
  50 |    "cell_type": "code",
  51 |    "execution_count": 1,
  52 |    "metadata": {},
  53 |    "outputs": [],
  54 |    "source": [
  55 |     "# numpy and pandas for data manipulation\n",
  56 |     "import pandas as pd\n",
  57 |     "import numpy as np\n",
  58 |     "\n",
  59 |     "# Sklearn preprocessing functionality\n",
  60 |     "from sklearn.preprocessing import LabelEncoder, MinMaxScaler\n",
  61 |     "\n",
  62 |     "# Matplotlib for visualizations\n",
  63 |     "import matplotlib.pyplot as plt\n",
  64 |     "\n",
  65 |     "# Adjust default font size \n",
  66 |     "plt.rcParams['font.size'] = 18"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "markdown",
  71 |    "metadata": {},
  72 |    "source": [
  73 |     "### Read in Dataset and Apply Feature Preprocessing\n",
  74 |     "\n",
  75 |     "Let's read in an example dataset and preprocess the features using the function we developed earlier. This function takes in a building energy dataframe, a number of days to use for testing, and a boolean for whether or not to scale the features. We will use 183 days (6 months) for testing and will choose to scale the features because we are using models that depend on the distance between observations to make predictions. [Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) is a best practice when comparing multiple algorithms so that the range of the features does not impact model performance. "
  76 |    ]
  77 |   },
  78 |   {
  79 |    "cell_type": "code",
  80 |    "execution_count": 2,
  81 |    "metadata": {},
  82 |    "outputs": [
  83 |     {
  84 |      "data": {
  85 |       "text/html": [
  86 |        "<div>\n",
  87 |        "<style scoped>\n",
  88 |        "    .dataframe tbody tr th:only-of-type {\n",
  89 |        "        vertical-align: middle;\n",
  90 |        "    }\n",
  91 |        "\n",
  92 |        "    .dataframe tbody tr th {\n",
  93 |        "        vertical-align: top;\n",
  94 |        "    }\n",
  95 |        "\n",
  96 |        "    .dataframe thead th {\n",
  97 |        "        text-align: right;\n",
  98 |        "    }\n",
  99 |        "</style>\n",
 100 |        "<table border=\"1\" class=\"dataframe\">\n",
 101 |        "  <thead>\n",
 102 |        "    <tr style=\"text-align: right;\">\n",
 103 |        "      <th></th>\n",
 104 |        "      <th>timestamp</th>\n",
 105 |        "      <th>biz_day</th>\n",
 106 |        "      <th>week_day_end</th>\n",
 107 |        "      <th>ghi</th>\n",
 108 |        "      <th>dif</th>\n",
 109 |        "      <th>gti</th>\n",
 110 |        "      <th>temp</th>\n",
 111 |        "      <th>rh</th>\n",
 112 |        "      <th>pwat</th>\n",
 113 |        "      <th>ws</th>\n",
 114 |        "      <th>...</th>\n",
 115 |        "      <th>yday_cos</th>\n",
 116 |        "      <th>month_sin</th>\n",
 117 |        "      <th>month_cos</th>\n",
 118 |        "      <th>wday_sin</th>\n",
 119 |        "      <th>wday_cos</th>\n",
 120 |        "      <th>num_time_sin</th>\n",
 121 |        "      <th>num_time_cos</th>\n",
 122 |        "      <th>sun_rise_set_neither</th>\n",
 123 |        "      <th>sun_rise_set_rise</th>\n",
 124 |        "      <th>sun_rise_set_set</th>\n",
 125 |        "    </tr>\n",
 126 |        "  </thead>\n",
 127 |        "  <tbody>\n",
 128 |        "    <tr>\n",
 129 |        "      <th>0</th>\n",
 130 |        "      <td>0.000000</td>\n",
 131 |        "      <td>1.0</td>\n",
 132 |        "      <td>0.0</td>\n",
 133 |        "      <td>0.0</td>\n",
 134 |        "      <td>0.0</td>\n",
 135 |        "      <td>0.0</td>\n",
 136 |        "      <td>0.394397</td>\n",
 137 |        "      <td>0.245303</td>\n",
 138 |        "      <td>0.069231</td>\n",
 139 |        "      <td>0.295082</td>\n",
 140 |        "      <td>...</td>\n",
 141 |        "      <td>0.629749</td>\n",
 142 |        "      <td>0.066987</td>\n",
 143 |        "      <td>0.75</td>\n",
 144 |        "      <td>1.0</td>\n",
 145 |        "      <td>0.25</td>\n",
 146 |        "      <td>0.500000</td>\n",
 147 |        "      <td>1.000000</td>\n",
 148 |        "      <td>1.0</td>\n",
 149 |        "      <td>0.0</td>\n",
 150 |        "      <td>0.0</td>\n",
 151 |        "    </tr>\n",
 152 |        "    <tr>\n",
 153 |        "      <th>1</th>\n",
 154 |        "      <td>0.000011</td>\n",
 155 |        "      <td>1.0</td>\n",
 156 |        "      <td>0.0</td>\n",
 157 |        "      <td>0.0</td>\n",
 158 |        "      <td>0.0</td>\n",
 159 |        "      <td>0.0</td>\n",
 160 |        "      <td>0.390086</td>\n",
 161 |        "      <td>0.250522</td>\n",
 162 |        "      <td>0.067949</td>\n",
 163 |        "      <td>0.295082</td>\n",
 164 |        "      <td>...</td>\n",
 165 |        "      <td>0.629749</td>\n",
 166 |        "      <td>0.066987</td>\n",
 167 |        "      <td>0.75</td>\n",
 168 |        "      <td>1.0</td>\n",
 169 |        "      <td>0.25</td>\n",
 170 |        "      <td>0.533050</td>\n",
 171 |        "      <td>0.998907</td>\n",
 172 |        "      <td>1.0</td>\n",
 173 |        "      <td>0.0</td>\n",
 174 |        "      <td>0.0</td>\n",
 175 |        "    </tr>\n",
 176 |        "    <tr>\n",
 177 |        "      <th>2</th>\n",
 178 |        "      <td>0.000022</td>\n",
 179 |        "      <td>1.0</td>\n",
 180 |        "      <td>0.0</td>\n",
 181 |        "      <td>0.0</td>\n",
 182 |        "      <td>0.0</td>\n",
 183 |        "      <td>0.0</td>\n",
 184 |        "      <td>0.387931</td>\n",
 185 |        "      <td>0.256785</td>\n",
 186 |        "      <td>0.067949</td>\n",
 187 |        "      <td>0.295082</td>\n",
 188 |        "      <td>...</td>\n",
 189 |        "      <td>0.629749</td>\n",
 190 |        "      <td>0.066987</td>\n",
 191 |        "      <td>0.75</td>\n",
 192 |        "      <td>1.0</td>\n",
 193 |        "      <td>0.25</td>\n",
 194 |        "      <td>0.565955</td>\n",
 195 |        "      <td>0.995631</td>\n",
 196 |        "      <td>1.0</td>\n",
 197 |        "      <td>0.0</td>\n",
 198 |        "      <td>0.0</td>\n",
 199 |        "    </tr>\n",
 200 |        "    <tr>\n",
 201 |        "      <th>3</th>\n",
 202 |        "      <td>0.000034</td>\n",
 203 |        "      <td>1.0</td>\n",
 204 |        "      <td>0.0</td>\n",
 205 |        "      <td>0.0</td>\n",
 206 |        "      <td>0.0</td>\n",
 207 |        "      <td>0.0</td>\n",
 208 |        "      <td>0.383621</td>\n",
 209 |        "      <td>0.262004</td>\n",
 210 |        "      <td>0.067949</td>\n",
 211 |        "      <td>0.303279</td>\n",
 212 |        "      <td>...</td>\n",
 213 |        "      <td>0.629749</td>\n",
 214 |        "      <td>0.066987</td>\n",
 215 |        "      <td>0.75</td>\n",
 216 |        "      <td>1.0</td>\n",
 217 |        "      <td>0.25</td>\n",
 218 |        "      <td>0.598572</td>\n",
 219 |        "      <td>0.990187</td>\n",
 220 |        "      <td>1.0</td>\n",
 221 |        "      <td>0.0</td>\n",
 222 |        "      <td>0.0</td>\n",
 223 |        "    </tr>\n",
 224 |        "    <tr>\n",
 225 |        "      <th>4</th>\n",
 226 |        "      <td>0.000045</td>\n",
 227 |        "      <td>1.0</td>\n",
 228 |        "      <td>0.0</td>\n",
 229 |        "      <td>0.0</td>\n",
 230 |        "      <td>0.0</td>\n",
 231 |        "      <td>0.0</td>\n",
 232 |        "      <td>0.379310</td>\n",
 233 |        "      <td>0.268267</td>\n",
 234 |        "      <td>0.067949</td>\n",
 235 |        "      <td>0.303279</td>\n",
 236 |        "      <td>...</td>\n",
 237 |        "      <td>0.629749</td>\n",
 238 |        "      <td>0.066987</td>\n",
 239 |        "      <td>0.75</td>\n",
 240 |        "      <td>1.0</td>\n",
 241 |        "      <td>0.25</td>\n",
 242 |        "      <td>0.630758</td>\n",
 243 |        "      <td>0.982600</td>\n",
 244 |        "      <td>1.0</td>\n",
 245 |        "      <td>0.0</td>\n",
 246 |        "      <td>0.0</td>\n",
 247 |        "    </tr>\n",
 248 |        "  </tbody>\n",
 249 |        "</table>\n",
 250 |        "<p>5 rows × 21 columns</p>\n",
 251 |        "</div>"
 252 |       ],
 253 |       "text/plain": [
 254 |        "   timestamp  biz_day  week_day_end  ghi  dif  gti      temp        rh  \\\n",
 255 |        "0   0.000000      1.0           0.0  0.0  0.0  0.0  0.394397  0.245303   \n",
 256 |        "1   0.000011      1.0           0.0  0.0  0.0  0.0  0.390086  0.250522   \n",
 257 |        "2   0.000022      1.0           0.0  0.0  0.0  0.0  0.387931  0.256785   \n",
 258 |        "3   0.000034      1.0           0.0  0.0  0.0  0.0  0.383621  0.262004   \n",
 259 |        "4   0.000045      1.0           0.0  0.0  0.0  0.0  0.379310  0.268267   \n",
 260 |        "\n",
 261 |        "       pwat        ws        ...         yday_cos  month_sin  month_cos  \\\n",
 262 |        "0  0.069231  0.295082        ...         0.629749   0.066987       0.75   \n",
 263 |        "1  0.067949  0.295082        ...         0.629749   0.066987       0.75   \n",
 264 |        "2  0.067949  0.295082        ...         0.629749   0.066987       0.75   \n",
 265 |        "3  0.067949  0.303279        ...         0.629749   0.066987       0.75   \n",
 266 |        "4  0.067949  0.303279        ...         0.629749   0.066987       0.75   \n",
 267 |        "\n",
 268 |        "   wday_sin  wday_cos  num_time_sin  num_time_cos  sun_rise_set_neither  \\\n",
 269 |        "0       1.0      0.25      0.500000      1.000000                   1.0   \n",
 270 |        "1       1.0      0.25      0.533050      0.998907                   1.0   \n",
 271 |        "2       1.0      0.25      0.565955      0.995631                   1.0   \n",
 272 |        "3       1.0      0.25      0.598572      0.990187                   1.0   \n",
 273 |        "4       1.0      0.25      0.630758      0.982600                   1.0   \n",
 274 |        "\n",
 275 |        "   sun_rise_set_rise  sun_rise_set_set  \n",
 276 |        "0                0.0               0.0  \n",
 277 |        "1                0.0               0.0  \n",
 278 |        "2                0.0               0.0  \n",
 279 |        "3                0.0               0.0  \n",
 280 |        "4                0.0               0.0  \n",
 281 |        "\n",
 282 |        "[5 rows x 21 columns]"
 283 |       ]
 284 |      },
 285 |      "execution_count": 2,
 286 |      "metadata": {},
 287 |      "output_type": "execute_result"
 288 |     }
 289 |    ],
 290 |    "source": [
 291 |     "# Import the feature preprocessing\n",
 292 |     "from utilities import preprocess_data\n",
 293 |     "\n",
 294 |     "df = pd.read_csv('../data/f-APS_weather.csv')\n",
 295 |     "\n",
 296 |     "# Preprocess the data for machine learning\n",
 297 |     "train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)\n",
 298 |     "\n",
 299 |     "train.head()"
 300 |    ]
 301 |   },
 302 |   {
 303 |    "cell_type": "markdown",
 304 |    "metadata": {},
 305 |    "source": [
 306 |     "The data is ready for machine learning. All of the values are numeric, the features have been scaled to between 0 and 1, and there are no missing values. "
 307 |    ]
 308 |   },
 309 |   {
 310 |    "cell_type": "markdown",
 311 |    "metadata": {},
 312 |    "source": [
 313 |     "## Machine Learning in Scikit-Learn\n",
 314 |     "\n",
 315 |     "For each model, we will calculate three stats: the mean absolute percentage error, the time to train the model, and the time to make predictions. MAPE is calculated from the definition of the metric using the predicted values and the known test targets. The training and testing times are determined using the `default_timer` class from the [`timeit` module](https://docs.python.org/2/library/timeit.html). The `default_timer` [automatically adjusts for operating system](https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python) and provides the best timer accordingly. We will measure the wall-clock time, which is the total time to execute a program. This [differs from the CPU time](https://www.pythoncentral.io/measure-time-in-python-time-time-vs-time-clock/) (also called execution time) which measures the time a CPU spent executing a specific program. Since we are interested in the time out of curiosity rather than for model selection, the minor distinctions in time calculation are not a significant concern. \n",
 316 |     "\n",
 317 |     "All of the models in this notebook will be built using Scikit-Learn. This open-source Python library benefits from a consistent syntax for implementing models which makes it extremely simple to quickly implement many models. There are three steps to using a machine learning model in Scikit-Learn:\n",
 318 |     "\n",
 319 |     "1. Instantiate the model, specifying the hyperparameters\n",
 320 |     "2. `fit`: train the model on the training data\n",
 321 |     "3. `predict`: make predictions on the testing data\n",
 322 |     "\n",
 323 |     "These same three steps apply to every model in Scikit-Learn. \n",
 324 |     "\n",
 325 |     "## Model Hyperparameters\n",
 326 |     "\n",
 327 |     "Model hyperparameters can best be thought of as settings for a machine learnig model. [In contrast to model parameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/), which are learned by the model during training, model hyperparameters are set by the developer before training. As an example, the number of trees in a random forest is a model hyperparameter while the weights learned during linear regression are model parameters. The choice of hyperparameters can have a significant effect both on the model's training and on the test performance. Much like choosing the best model is an experimental process, optimizing the hyperparameters of a selected model requires evaluating many different combinations to find the one that performs the best. This can be an extremely time-intensive and computationally expensive process. \n",
 328 |     "\n",
 329 |     "For this notebook, we will stick with the Scikit-Learn hyperparameters for the most part except where noted. [Scikit-Learn aims](https://arxiv.org/abs/1309.0238) to provide a set of reasonable default hyperparameters designed to get practicioners up and running with a decent model quickly. However, these hyperparameters are likely to be nowhere near optimal, and tuning them is recommended after a working system is built. If we had unlimited time, we would try out not only multiple models, but also work on optimizing the hyperparameters of each model for the problem. For now, we will stick with the default hyperparameters for comparing models and then optimize the hyperparameters of the model that performs the best. We will note that this might not be a fair comparison because of the model dependence on hyperparameters, but it is a limitation we have to work with. (Note that for models where it is applicable, we set `n_jobs=-1` to use all available cores on the machine.)\n",
 330 |     "\n",
 331 |     "### Approach \n",
 332 |     "\n",
 333 |     "In this notebook, we focus on implementing the algorithms rather than explaining how they work. Two extremely good resources for learning the theory of these models are [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), and [Hands-On Machine Learning in Python with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do). Both of these books provide the theory as well as how to implement the methods in R and Python respectively. Machine learning is only mastered through doing, and both these books place a large emphasis on practicing the techniques described. With those considerations in mind, let's get to modeling! \n",
 334 |     "\n",
 335 |     "For the first model, we will walk through the modeling steps individually, and then we will refactor the steps into a function that can be re-used for any model. "
 336 |    ]
 337 |   },
 338 |   {
 339 |    "cell_type": "markdown",
 340 |    "metadata": {},
 341 |    "source": [
 342 |     "## ElasticNet Linear Regression\n",
 343 |     "\n",
 344 |     "[Regularization and Variable Selection via the Elastic Net](https://web.stanford.edu/~hastie/Papers/elasticnet.pdf)\n",
 345 |     "\n",
 346 |     "ElasticNet Linear Regression is a regularized version of Linear Regression. It is a mix of two regularization methods: Lasso Regression and Ridge Regression. The blend of these two methods is controlled by the `l1_ratio` hyperparameter with the overall amount of regularization controlled by the `alpha` hyperparameter. We will use the default values of these hyperparameters which are `l1_ratio = 0.5` and `alpha = 1.0`. Unlike ordinary least squares (OLS) linear regression, which has an analytical solution, ElasticNet must be solved by coordinate descent. The equation for the parameter matrix $\\beta$ in ElasticNet is:\n",
 347 |     "\n",
 348 |     "$$\\hat{\\beta} = \\underset{\\beta}{\\operatorname{argmin}} (\\| y-X \\beta \\|^2 + \\alpha \\lambda_1 \\|\\beta\\|_1 + \\alpha (1 - \\lambda_1) \\|\\beta\\|_2^2)$$\n",
 349 |     "\n",
 350 |     "where $\\lambda_1$ multiplies the L1 norm of the parameter matrix and $1 - \\lambda_1$ multiplies the L2 norm of the parameter matrix. $\\alpha$ controls the overall amount of regularization, and if $\\alpha = 0$, then ElasticNet simplifies to ordinary least squares regression with an objective function of Mean Squared Error (MSE). "
 351 |    ]
 352 |   },
 353 |   {
 354 |    "cell_type": "code",
 355 |    "execution_count": 3,
 356 |    "metadata": {},
 357 |    "outputs": [],
 358 |    "source": [
 359 |     "from sklearn.linear_model import ElasticNet\n",
 360 |     "\n",
 361 |     "# Timing utilities\n",
 362 |     "from timeit import default_timer as timer"
 363 |    ]
 364 |   },
 365 |   {
 366 |    "cell_type": "code",
 367 |    "execution_count": 4,
 368 |    "metadata": {},
 369 |    "outputs": [
 370 |     {
 371 |      "data": {
 372 |       "text/plain": [
 373 |        "ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,\n",
 374 |        "      max_iter=1000, normalize=False, positive=False, precompute=False,\n",
 375 |        "      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)"
 376 |       ]
 377 |      },
 378 |      "execution_count": 4,
 379 |      "metadata": {},
 380 |      "output_type": "execute_result"
 381 |     }
 382 |    ],
 383 |    "source": [
 384 |     "# Set up the model with default hyperparameters\n",
 385 |     "model = ElasticNet(alpha = 1.0, l1_ratio=0.5)\n",
 386 |     "model"
 387 |    ]
 388 |   },
 389 |   {
 390 |    "cell_type": "markdown",
 391 |    "metadata": {},
 392 |    "source": [
 393 |     "### Training \n",
 394 |     "\n",
 395 |     "Training in Scikit-Learn is done using the `fit` method (function) of a model. This takes in the training features and the targets. "
 396 |    ]
 397 |   },
 398 |   {
 399 |    "cell_type": "code",
 400 |    "execution_count": 5,
 401 |    "metadata": {},
 402 |    "outputs": [],
 403 |    "source": [
 404 |     "# Start the timer\n",
 405 |     "train_start = timer()\n",
 406 |     "\n",
 407 |     "# Train the model\n",
 408 |     "model.fit(train, train_targets)\n",
 409 |     "\n",
 410 |     "# Stop the timer\n",
 411 |     "train_end = timer()\n",
 412 |     "\n",
 413 |     "train_time = train_end - train_start"
 414 |    ]
 415 |   },
 416 |   {
 417 |    "cell_type": "markdown",
 418 |    "metadata": {},
 419 |    "source": [
 420 |     "### Testing \n",
 421 |     "\n",
 422 |     "Making predictions in Scikit-Learn using the `predict` method of a model. It takes in only the testing features."
 423 |    ]
 424 |   },
 425 |   {
 426 |    "cell_type": "code",
 427 |    "execution_count": 6,
 428 |    "metadata": {},
 429 |    "outputs": [],
 430 |    "source": [
 431 |     "# Start the time\n",
 432 |     "test_start = timer()\n",
 433 |     "\n",
 434 |     "# Make predictions on testing data\n",
 435 |     "predictions = model.predict(test)\n",
 436 |     "\n",
 437 |     "# Stop the time\n",
 438 |     "test_end = timer()\n",
 439 |     "\n",
 440 |     "test_time = test_end - test_start\n",
 441 |     "\n",
 442 |     "# Calculate the mape\n",
 443 |     "mape = 100 * np.mean( abs(predictions - test_targets) / test_targets)"
 444 |    ]
 445 |   },
 446 |   {
 447 |    "cell_type": "code",
 448 |    "execution_count": 7,
 449 |    "metadata": {},
 450 |    "outputs": [
 451 |     {
 452 |      "name": "stdout",
 453 |      "output_type": "stream",
 454 |      "text": [
 455 |       "Training time (seconds): 0.0936\n",
 456 |       "Prediction time (seconds): 0.0048\n",
 457 |       "Testing MAPE: 56.97\n"
 458 |      ]
 459 |     }
 460 |    ],
 461 |    "source": [
 462 |     "print(\"Training time (seconds): %0.4f\" % train_time)\n",
 463 |     "print(\"Prediction time (seconds): %0.4f\" % test_time)\n",
 464 |     "print(\"Testing MAPE: %0.2f\" % mape)"
 465 |    ]
 466 |   },
 467 |   {
 468 |    "cell_type": "markdown",
 469 |    "metadata": {},
 470 |    "source": [
 471 |     "The three results can be recorded in a numpy array along with the name of the model. When we get results from multiple models, we can stack the results and then record them in a dataframe. "
 472 |    ]
 473 |   },
 474 |   {
 475 |    "cell_type": "code",
 476 |    "execution_count": 8,
 477 |    "metadata": {},
 478 |    "outputs": [
 479 |     {
 480 |      "data": {
 481 |       "text/plain": [
 482 |        "array(['ElasticNet', '0.09360281483673485', '0.004777918515813356',\n",
 483 |        "       '56.973741837828975'], dtype='<U20')"
 484 |       ]
 485 |      },
 486 |      "execution_count": 8,
 487 |      "metadata": {},
 488 |      "output_type": "execute_result"
 489 |     }
 490 |    ],
 491 |    "source": [
 492 |     "# Record the results in a numpy array\n",
 493 |     "results = np.array(['ElasticNet', train_time, test_time, mape])\n",
 494 |     "\n",
 495 |     "results"
 496 |    ]
 497 |   },
 498 |   {
 499 |    "cell_type": "markdown",
 500 |    "metadata": {},
 501 |    "source": [
 502 |     "# Modeling Function\n",
 503 |     "\n",
 504 |     "Let's take the individual steps and put them into a single function. The function will operate the exact same for each model allowing for an efficient standardized workflow. This function takes in a model, training features, training labels, testing features, and testing labels, and returns the 3 numeric results along with the name of the model in a numpy array. The function will be able to train and evaluate any Scikit-Learn supervised regression model. "
 505 |    ]
 506 |   },
 507 |   {
 508 |    "cell_type": "code",
 509 |    "execution_count": 9,
 510 |    "metadata": {},
 511 |    "outputs": [],
 512 |    "source": [
 513 |     "def implement_model(model, train, training_targets, test, testing_targets, model_name):\n",
 514 |     "    \"\"\"Train a machine learning model and make predictions on a test set\n",
 515 |     "    \n",
 516 |     "    Parameters\n",
 517 |     "    --------\n",
 518 |     "    model : Scikit-Learn model object\n",
 519 |     "        Model to use for training and making predictions\n",
 520 |     "    \n",
 521 |     "    train : dataframe, shape = [n_training_samples, n_features]\n",
 522 |     "        Set of training features for training a model\n",
 523 |     "    \n",
 524 |     "    train_targets : array, shape = [n_training_samples]\n",
 525 |     "        Array of training targets for training a model\n",
 526 |     "        \n",
 527 |     "    test : dataframe, shape = [n_testing_samples, n_features]\n",
 528 |     "        Set of testing features for making predictions with a model\n",
 529 |     "    \n",
 530 |     "    test_targets : array, shape = [n_testing_samples]\n",
 531 |     "        Array of testing targets for evaluating the model predictions\n",
 532 |     "        \n",
 533 |     "    model_name : string\n",
 534 |     "        Name of the model used for returning results\n",
 535 |     "        \n",
 536 |     "    Returns\n",
 537 |     "    --------\n",
 538 |     "    \n",
 539 |     "    results : array, shape = [4]\n",
 540 |     "        Numpy array of results. \n",
 541 |     "        First entry is the model, second is the training time,\n",
 542 |     "        third is the testing time, and fourth is the MAPE. All entries\n",
 543 |     "        are in strings and so will need to be converted to numbers.\n",
 544 |     "    \n",
 545 |     "    \"\"\"\n",
 546 |     "    \n",
 547 |     "    # Preprocess the data for machine learning\n",
 548 |     "    train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)\n",
 549 |     "    \n",
 550 |     "    train_start = timer()\n",
 551 |     "    \n",
 552 |     "    # Start the timer\n",
 553 |     "    train_start = timer()\n",
 554 |     "\n",
 555 |     "    # Train the model\n",
 556 |     "    model.fit(train, train_targets)\n",
 557 |     "\n",
 558 |     "    # Calculate training time\n",
 559 |     "    train_end = timer()\n",
 560 |     "    train_time = train_end - train_start\n",
 561 |     "\n",
 562 |     "    # Start test timer\n",
 563 |     "    test_start = timer()\n",
 564 |     "\n",
 565 |     "    # Make predictions\n",
 566 |     "    predictions = model.predict(test)\n",
 567 |     "\n",
 568 |     "    # Calculate testing time\n",
 569 |     "    test_end = timer()\n",
 570 |     "    test_time = test_end - test_start\n",
 571 |     "\n",
 572 |     "    # Calculate the mape\n",
 573 |     "    mape = 100 * np.mean( abs(predictions - test_targets) / test_targets)\n",
 574 |     "\n",
 575 |     "    # Record the results\n",
 576 |     "    results = [model_name, train_time, test_time, mape]\n",
 577 |     "    \n",
 578 |     "    return results\n"
 579 |    ]
 580 |   },
 581 |   {
 582 |    "cell_type": "code",
 583 |    "execution_count": 10,
 584 |    "metadata": {},
 585 |    "outputs": [
 586 |     {
 587 |      "data": {
 588 |       "text/plain": [
 589 |        "['elasticnet', 0.08054083317867355, 0.0008706562336163737, 56.973741837828975]"
 590 |       ]
 591 |      },
 592 |      "execution_count": 10,
 593 |      "metadata": {},
 594 |      "output_type": "execute_result"
 595 |     }
 596 |    ],
 597 |    "source": [
 598 |     "elasticnet_results = implement_model(model, train, train_targets, test, \n",
 599 |     "                                     test_targets, model_name = 'elasticnet')\n",
 600 |     "elasticnet_results"
 601 |    ]
 602 |   },
 603 |   {
 604 |    "cell_type": "markdown",
 605 |    "metadata": {},
 606 |    "source": [
 607 |     "Now we can use this function for any Scikit-Learn supervised regression machine learning model. We will go through the rest of the models and use the function to evaluate each one."
 608 |    ]
 609 |   },
 610 |   {
 611 |    "cell_type": "markdown",
 612 |    "metadata": {},
 613 |    "source": [
 614 |     "## K-Nearest Neighbors Regression\n",
 615 |     "\n",
 616 |     "[K-Nearest Neighbors](https://link.springer.com/chapter/10.1007/978-3-642-38652-7_2)\n",
 617 |     "\n",
 618 |     "K-Nearest Neighbors is a non-parameteric method that makes predictions for a new observation based on the K nearest observations as determined by a distance measure. For this implementation, the distance measure is the L2 norm or [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) (the Minkowski measure with p = 2). There is no actual learning done during the training phase of K-Nearest Neighbors because the work of calculating the nearest neighbors and making an estimation is done at testing time (this is an example of a [\"lazy learner\"](https://en.wikipedia.org/wiki/Lazy_learning)). K-Nearest Neighbors is extremely sensitive to the `n_neighbors` hyperparameter which we have set at 10 (up from the default 5). "
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "code",
 623 |    "execution_count": 11,
 624 |    "metadata": {},
 625 |    "outputs": [
 626 |     {
 627 |      "data": {
 628 |       "text/plain": [
 629 |        "KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 630 |        "          metric_params=None, n_jobs=1, n_neighbors=10, p=2,\n",
 631 |        "          weights='uniform')"
 632 |       ]
 633 |      },
 634 |      "execution_count": 11,
 635 |      "metadata": {},
 636 |      "output_type": "execute_result"
 637 |     }
 638 |    ],
 639 |    "source": [
 640 |     "from sklearn.neighbors import KNeighborsRegressor\n",
 641 |     "\n",
 642 |     "model = KNeighborsRegressor(n_neighbors = 10)\n",
 643 |     "model"
 644 |    ]
 645 |   },
 646 |   {
 647 |    "cell_type": "code",
 648 |    "execution_count": 12,
 649 |    "metadata": {},
 650 |    "outputs": [
 651 |     {
 652 |      "data": {
 653 |       "text/plain": [
 654 |        "['KNN', 23.473687023886598, 3.4355423598204986, 23.656239736972875]"
 655 |       ]
 656 |      },
 657 |      "execution_count": 12,
 658 |      "metadata": {},
 659 |      "output_type": "execute_result"
 660 |     }
 661 |    ],
 662 |    "source": [
 663 |     "# Call the model function\n",
 664 |     "knn_results = implement_model(model, train, train_targets, test, \n",
 665 |     "                              test_targets, model_name = 'KNN')\n",
 666 |     "knn_results"
 667 |    ]
 668 |   },
 669 |   {
 670 |    "cell_type": "markdown",
 671 |    "metadata": {},
 672 |    "source": [
 673 |     "## Support Vector Machine\n",
 674 |     "\n",
 675 |     "[Support Vector Machine](https://ieeexplore.ieee.org/abstract/document/708428/)\n",
 676 |     "\n",
 677 |     "The support vector machine works by transforming the data to a new high-dimensional feature space using a kernel. This procedure, called the kernel trick, is designed to make the original data, which may not be linearly separable, linearly separable in the high-dimensional feature space. The support vector machine is highly sensitive to a number of hyperparameters: the `kernel`, `gamma`, `C` (the error term), and `epsilon`. For this implementation, we will use the default hyperparameters in Scikit-Learn which are:\n",
 678 |     "\n",
 679 |     "* `kernel=rbf`: Gaussian Radial Basis Function kernel\n",
 680 |     "* `gamma=auto`: Gamma is equal to $\\frac{1}{\\text{n_features}}$\n",
 681 |     "* `C=1.0`: penalty parameter of the error term\n",
 682 |     "* `epsilon=0.1`: the epsilon tube within which no penalty is associated in the training loss"
 683 |    ]
 684 |   },
 685 |   {
 686 |    "cell_type": "code",
 687 |    "execution_count": 13,
 688 |    "metadata": {},
 689 |    "outputs": [],
 690 |    "source": [
 691 |     "from sklearn.svm import SVR\n",
 692 |     "\n",
 693 |     "model = SVR(C=)\n",
 694 |     "model"
 695 |    ]
 696 |   },
 697 |   {
 698 |    "cell_type": "code",
 699 |    "execution_count": 15,
 700 |    "metadata": {},
 701 |    "outputs": [
 702 |     {
 703 |      "data": {
 704 |       "text/plain": [
 705 |        "['svm', 531.2380162893111, 59.102470593970565, 24.826293891805097]"
 706 |       ]
 707 |      },
 708 |      "execution_count": 15,
 709 |      "metadata": {},
 710 |      "output_type": "execute_result"
 711 |     }
 712 |    ],
 713 |    "source": [
 714 |     "svm_results = implement_model(model, train, train_targets, test, \n",
 715 |     "                              test_targets, model_name = 'svm')\n",
 716 |     "svm_results"
 717 |    ]
 718 |   },
 719 |   {
 720 |    "cell_type": "markdown",
 721 |    "metadata": {},
 722 |    "source": [
 723 |     "## Random Forest\n",
 724 |     "\n",
 725 |     "[Random Forest](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)\n",
 726 |     "\n",
 727 |     "The random forest is an ensemble method which makes one powerful estimator out of a number of simpler models, in this case individual decision tree regressors. The random forest is an example of a [bagging (bootstrap aggregating) ensemble method](https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/) in which all of the individual learners are trained independently and then predictions are made by averaging the predictions of the individuals. This allows for efficient training because the indidual decision tree regressors can be trained in parallel.\n",
 728 |     "\n",
 729 |     "The random forest is fast to train, can be interpreted via the model feature importances, is very accurate, and has only one main hyperparameter to adjust, the number of trees in the forst. Other hyperparameters can be used to control the amount of overfitting by modifying the individuals in the forest. A random forest has less variance than a single decision tree which is very prone to overfitting. The \"random\" in the name comes from the fact that the model only trains on a subsample of the training examples (chosen with replacement called \"bootstrapping\") and only on a subset of the features. Both of these behaviors can be controlled through the model hyperparameters.\n",
 730 |     "\n",
 731 |     "For this case, we will increase the number of trees to 100 (`n_estimators`) and leave the rest of the model hyperparameters at the default values."
 732 |    ]
 733 |   },
 734 |   {
 735 |    "cell_type": "code",
 736 |    "execution_count": 16,
 737 |    "metadata": {},
 738 |    "outputs": [
 739 |     {
 740 |      "data": {
 741 |       "text/plain": [
 742 |        "RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n",
 743 |        "           max_features='auto', max_leaf_nodes=None,\n",
 744 |        "           min_impurity_decrease=0.0, min_impurity_split=None,\n",
 745 |        "           min_samples_leaf=1, min_samples_split=2,\n",
 746 |        "           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n",
 747 |        "           oob_score=False, random_state=None, verbose=0, warm_start=False)"
 748 |       ]
 749 |      },
 750 |      "execution_count": 16,
 751 |      "metadata": {},
 752 |      "output_type": "execute_result"
 753 |     }
 754 |    ],
 755 |    "source": [
 756 |     "from sklearn.ensemble import RandomForestRegressor\n",
 757 |     "\n",
 758 |     "model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)\n",
 759 |     "model"
 760 |    ]
 761 |   },
 762 |   {
 763 |    "cell_type": "code",
 764 |    "execution_count": 17,
 765 |    "metadata": {},
 766 |    "outputs": [
 767 |     {
 768 |      "data": {
 769 |       "text/plain": [
 770 |        "['rf', 18.59618313886108, 0.20558733802420193, 15.739369033234297]"
 771 |       ]
 772 |      },
 773 |      "execution_count": 17,
 774 |      "metadata": {},
 775 |      "output_type": "execute_result"
 776 |     }
 777 |    ],
 778 |    "source": [
 779 |     "rf_results = implement_model(model, train, train_targets, test, \n",
 780 |     "                             test_targets, model_name='rf')\n",
 781 |     "rf_results"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "markdown",
 786 |    "metadata": {},
 787 |    "source": [
 788 |     "## ExtraTrees Regressor\n",
 789 |     "\n",
 790 |     "[Extremely Randomized Trees](https://link.springer.com/article/10.1007/s10994-006-6226-1)\n",
 791 |     "\n",
 792 |     "The extra trees regressor is another bagging ensemble method. As with the random forest, the individual predictors are decision tree regressors and the ensemble makes a prediction by averaging the predictions of all the individuals. The difference between the extra trees model and a random forest is that the node splits are made on a random value of the feature in the extra trees ensemble. In other words, the values for splits of nodes are chosen randomly rather than by trying out all possible values of a feature as in a random forest. The \"extra\" means \"extra random\" in reference to this behavior. An additional difference is that the training observations are not sampled for extra trees (although this behavior can be changed through the [`bootstrap` hyperparameter in the function call](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)). \n",
 793 |     "\n",
 794 |     "The extra trees model can be slightly faster than the random forest for training because the splits are chosen randomly instead of evaluating all splits for the optimal split. The idea behind extra trees is that the model will have lower variance than the random forest which can improve generalization performance on the test set. We will use 100 trees (`n_estimators`) and keep the other hyperparameters at the defaults. "
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": 18,
 800 |    "metadata": {},
 801 |    "outputs": [
 802 |     {
 803 |      "data": {
 804 |       "text/plain": [
 805 |        "ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,\n",
 806 |        "          max_features='auto', max_leaf_nodes=None,\n",
 807 |        "          min_impurity_decrease=0.0, min_impurity_split=None,\n",
 808 |        "          min_samples_leaf=1, min_samples_split=2,\n",
 809 |        "          min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n",
 810 |        "          oob_score=False, random_state=None, verbose=0, warm_start=False)"
 811 |       ]
 812 |      },
 813 |      "execution_count": 18,
 814 |      "metadata": {},
 815 |      "output_type": "execute_result"
 816 |     }
 817 |    ],
 818 |    "source": [
 819 |     "from sklearn.ensemble import ExtraTreesRegressor\n",
 820 |     "\n",
 821 |     "model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)\n",
 822 |     "model"
 823 |    ]
 824 |   },
 825 |   {
 826 |    "cell_type": "code",
 827 |    "execution_count": 19,
 828 |    "metadata": {},
 829 |    "outputs": [
 830 |     {
 831 |      "data": {
 832 |       "text/plain": [
 833 |        "['et', 9.403137355856302, 0.09805060240341845, 18.385035628942195]"
 834 |       ]
 835 |      },
 836 |      "execution_count": 19,
 837 |      "metadata": {},
 838 |      "output_type": "execute_result"
 839 |     }
 840 |    ],
 841 |    "source": [
 842 |     "et_results = implement_model(model, train, train_targets, test, \n",
 843 |     "                             test_targets, model_name = 'et')\n",
 844 |     "et_results"
 845 |    ]
 846 |   },
 847 |   {
 848 |    "cell_type": "markdown",
 849 |    "metadata": {},
 850 |    "source": [
 851 |     "## AdaBoost\n",
 852 |     "\n",
 853 |     "[Explaining AdaBoost](https://link.springer.com/chapter/10.1007/978-3-642-41136-6_5)\n",
 854 |     "\n",
 855 |     "The AdaBoost - standing for Adaptive Boosting - algorithm is another ensemble method. However, unlike the random forest and extra trees, it is a [boosting ensemble](https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/) and not a bagging ensemble. The difference is that the individual learners are not trained independently, but rather sequentially, with each learner learning from the mistakes of the previous. In AdaBoost, this takes the form of re-weighting the observations the previous learner got most wrong when training the next. AdaBoost trains on a random subset of the training examples, and the training examples with greater weight are more likely to be included in the training set. This makes the examples that are more difficult - missed more often - more likely to appear in the training set. Moreover, while bagging methods simply average the predictions of each individual, boosting methods weight the predictions of each individual based on error rates. Better learners are given exponentially more weight in the final prediction. \n",
 856 |     "\n",
 857 |     "Boosting methods can be built on any individual learner, but the most common choice is the decision tree. These decision trees are kept very small (sometimes decision trees with only one level are used which are called decision stumps) and on their own are weak learners. Weak learners perform better than random guessing, but are not very accurate. However, by adding weak learners to the ensemble sequentially, the overall model is a strong classifier. Boosting is a general method that can be applied to any weak learner and AdaBoost was one of the first successful implementations. In later work, we will look at another boosting method called Gradient Boosting which is now considered more capable than AdaBoost. We will set the following hyperparameters in the AdaBoost ensemble:\n",
 858 |     "\n",
 859 |     "* `n_estimators=1000`: number of decision trees. This is set high because the individual learners are weak\n",
 860 |     "* `learning_rate=0.05`: the contribution of each learner to the ensemble. There is a tradeoff between `n_estimators` and the `learning_rate` with a lower learning rate complementing a higher number of weak learners.\n",
 861 |     "\n",
 862 |     "The other hyperparameters will be set at the default values. The default base learner for AdaBoost is the `DecisionTreeRegressor`. AdaBoost cannot be trained in parallel which means we cannot set the `n_jobs` training argument."
 863 |    ]
 864 |   },
 865 |   {
 866 |    "cell_type": "code",
 867 |    "execution_count": 20,
 868 |    "metadata": {},
 869 |    "outputs": [
 870 |     {
 871 |      "data": {
 872 |       "text/plain": [
 873 |        "AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear',\n",
 874 |        "         n_estimators=1000, random_state=None)"
 875 |       ]
 876 |      },
 877 |      "execution_count": 20,
 878 |      "metadata": {},
 879 |      "output_type": "execute_result"
 880 |     }
 881 |    ],
 882 |    "source": [
 883 |     "from sklearn.ensemble import AdaBoostRegressor\n",
 884 |     "\n",
 885 |     "model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05)\n",
 886 |     "model"
 887 |    ]
 888 |   },
 889 |   {
 890 |    "cell_type": "code",
 891 |    "execution_count": 21,
 892 |    "metadata": {},
 893 |    "outputs": [
 894 |     {
 895 |      "data": {
 896 |       "text/plain": [
 897 |        "['adaboost', 133.36641199829296, 0.9832505582719477, 36.654214548779]"
 898 |       ]
 899 |      },
 900 |      "execution_count": 21,
 901 |      "metadata": {},
 902 |      "output_type": "execute_result"
 903 |     }
 904 |    ],
 905 |    "source": [
 906 |     "adaboost_results = implement_model(model, train, train_targets, test, \n",
 907 |     "                                    test_targets, model_name = 'adaboost')\n",
 908 |     "adaboost_results"
 909 |    ]
 910 |   },
 911 |   {
 912 |    "cell_type": "markdown",
 913 |    "metadata": {},
 914 |    "source": [
 915 |     "# Implementing all Models\n",
 916 |     "\n",
 917 |     "We can use the function we developed to train and evaluate all of the machine learning models. This function will take in a building energy dataframe and return a dataframe of metrics for all the models developed here. The function will call the `implement_model` function once for each model.  "
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "code",
 922 |    "execution_count": 22,
 923 |    "metadata": {},
 924 |    "outputs": [],
 925 |    "source": [
 926 |     "def evaluate_models(df):\n",
 927 |     "    \"\"\"Evaluate scikit-learn machine learning models\n",
 928 |     "    on a building energy dataset. More models can be added\n",
 929 |     "    to the function as required. \n",
 930 |     "    \n",
 931 |     "    \n",
 932 |     "    Parameters\n",
 933 |     "    --------\n",
 934 |     "    df : dataframe\n",
 935 |     "        Building energy dataframe. Each row must have one observation\n",
 936 |     "        and the columns must contain the features. The dataframe\n",
 937 |     "        needs to have an \"elec_cons\" column to be used as targets. \n",
 938 |     "    \n",
 939 |     "    Return\n",
 940 |     "    --------\n",
 941 |     "    results : dataframe, shape = [n_models, 4]\n",
 942 |     "        Modeling metrics. A dataframe with columns:\n",
 943 |     "        model, train_time, test_time, mape. Used for comparing\n",
 944 |     "        models for a given building dataset\n",
 945 |     "        \n",
 946 |     "    \"\"\"\n",
 947 |     "    try:\n",
 948 |     "        # Preprocess the data for machine learning\n",
 949 |     "        train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)\n",
 950 |     "    except Exception as e:\n",
 951 |     "        print('Error processing data: ', e)\n",
 952 |     "        return\n",
 953 |     "        \n",
 954 |     "    # elasticnet\n",
 955 |     "    model = ElasticNet(alpha = 1.0, l1_ratio=0.5)\n",
 956 |     "    elasticnet_results = implement_model(model, train, train_targets, test, \n",
 957 |     "                                         test_targets, model_name = 'elasticnet')\n",
 958 |     "    \n",
 959 |     "    # knn\n",
 960 |     "    model = KNeighborsRegressor()\n",
 961 |     "    knn_results = implement_model(model, train, train_targets, test, \n",
 962 |     "                                  test_targets, model_name = 'knn')\n",
 963 |     "    \n",
 964 |     "    # svm\n",
 965 |     "    model = SVR()\n",
 966 |     "    svm_results = implement_model(model, train, train_targets, test, \n",
 967 |     "                                   test_targets, model_name = 'svm')\n",
 968 |     "    \n",
 969 |     "    # rf\n",
 970 |     "    model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)\n",
 971 |     "    rf_results = implement_model(model, train, train_targets, test, \n",
 972 |     "                                  test_targets, model_name = 'rf')\n",
 973 |     "    \n",
 974 |     "    # et\n",
 975 |     "    model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)\n",
 976 |     "    et_results = implement_model(model, train, train_targets, test, \n",
 977 |     "                                  test_targets, model_name = 'et')\n",
 978 |     "    \n",
 979 |     "    # adaboost\n",
 980 |     "    model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05, \n",
 981 |     "                              loss = 'exponential')\n",
 982 |     "    adaboost_results = implement_model(model, train, train_targets, test, \n",
 983 |     "                                       test_targets, model_name = 'adaboost')\n",
 984 |     "    \n",
 985 |     "    # Put the results into a single array (stack the rows)\n",
 986 |     "    results = np.vstack((elasticnet_results, knn_results, svm_results,\n",
 987 |     "                         rf_results, et_results, adaboost_results))\n",
 988 |     "    \n",
 989 |     "    # Convert the results to a dataframe\n",
 990 |     "    results = pd.DataFrame(results, columns = ['model', 'train_time', 'test_time', 'mape'])\n",
 991 |     "    \n",
 992 |     "    # Convert the numeric results to numbers\n",
 993 |     "    results.iloc[:, 1:] = results.iloc[:, 1:].astype(np.float32)\n",
 994 |     "    \n",
 995 |     "    return results"
 996 |    ]
 997 |   },
 998 |   {
 999 |    "cell_type": "markdown",
1000 |    "metadata": {},
1001 |    "source": [
1002 |     "#### Test the Function\n",
1003 |     "\n",
1004 |     "Here we will test the function on two datasets. Later we will want to run the function on hundreds of buildings to get an accurate measure of the performance of the models."
1005 |    ]
1006 |   },
1007 |   {
1008 |    "cell_type": "code",
1009 |    "execution_count": 23,
1010 |    "metadata": {},
1011 |    "outputs": [
1012 |     {
1013 |      "data": {
1014 |       "text/html": [
1015 |        "<div>\n",
1016 |        "<style scoped>\n",
1017 |        "    .dataframe tbody tr th:only-of-type {\n",
1018 |        "        vertical-align: middle;\n",
1019 |        "    }\n",
1020 |        "\n",
1021 |        "    .dataframe tbody tr th {\n",
1022 |        "        vertical-align: top;\n",
1023 |        "    }\n",
1024 |        "\n",
1025 |        "    .dataframe thead th {\n",
1026 |        "        text-align: right;\n",
1027 |        "    }\n",
1028 |        "</style>\n",
1029 |        "<table border=\"1\" class=\"dataframe\">\n",
1030 |        "  <thead>\n",
1031 |        "    <tr style=\"text-align: right;\">\n",
1032 |        "      <th></th>\n",
1033 |        "      <th>model</th>\n",
1034 |        "      <th>train_time</th>\n",
1035 |        "      <th>test_time</th>\n",
1036 |        "      <th>mape</th>\n",
1037 |        "    </tr>\n",
1038 |        "  </thead>\n",
1039 |        "  <tbody>\n",
1040 |        "    <tr>\n",
1041 |        "      <th>0</th>\n",
1042 |        "      <td>elasticnet</td>\n",
1043 |        "      <td>0.0661613</td>\n",
1044 |        "      <td>0.0013945</td>\n",
1045 |        "      <td>56.9737</td>\n",
1046 |        "    </tr>\n",
1047 |        "    <tr>\n",
1048 |        "      <th>1</th>\n",
1049 |        "      <td>knn</td>\n",
1050 |        "      <td>20.9769</td>\n",
1051 |        "      <td>2.96457</td>\n",
1052 |        "      <td>23.7799</td>\n",
1053 |        "    </tr>\n",
1054 |        "    <tr>\n",
1055 |        "      <th>2</th>\n",
1056 |        "      <td>svm</td>\n",
1057 |        "      <td>525.235</td>\n",
1058 |        "      <td>57.757</td>\n",
1059 |        "      <td>24.8263</td>\n",
1060 |        "    </tr>\n",
1061 |        "    <tr>\n",
1062 |        "      <th>3</th>\n",
1063 |        "      <td>rf</td>\n",
1064 |        "      <td>18.9462</td>\n",
1065 |        "      <td>0.104385</td>\n",
1066 |        "      <td>15.7012</td>\n",
1067 |        "    </tr>\n",
1068 |        "    <tr>\n",
1069 |        "      <th>4</th>\n",
1070 |        "      <td>et</td>\n",
1071 |        "      <td>10.7384</td>\n",
1072 |        "      <td>0.211137</td>\n",
1073 |        "      <td>18.9027</td>\n",
1074 |        "    </tr>\n",
1075 |        "    <tr>\n",
1076 |        "      <th>5</th>\n",
1077 |        "      <td>adaboost</td>\n",
1078 |        "      <td>283.145</td>\n",
1079 |        "      <td>2.09613</td>\n",
1080 |        "      <td>39.5748</td>\n",
1081 |        "    </tr>\n",
1082 |        "  </tbody>\n",
1083 |        "</table>\n",
1084 |        "</div>"
1085 |       ],
1086 |       "text/plain": [
1087 |        "        model train_time  test_time     mape\n",
1088 |        "0  elasticnet  0.0661613  0.0013945  56.9737\n",
1089 |        "1         knn    20.9769    2.96457  23.7799\n",
1090 |        "2         svm    525.235     57.757  24.8263\n",
1091 |        "3          rf    18.9462   0.104385  15.7012\n",
1092 |        "4          et    10.7384   0.211137  18.9027\n",
1093 |        "5    adaboost    283.145    2.09613  39.5748"
1094 |       ]
1095 |      },
1096 |      "execution_count": 23,
1097 |      "metadata": {},
1098 |      "output_type": "execute_result"
1099 |     }
1100 |    ],
1101 |    "source": [
1102 |     "results = evaluate_models(df)\n",
1103 |     "results"
1104 |    ]
1105 |   },
1106 |   {
1107 |    "cell_type": "code",
1108 |    "execution_count": 24,
1109 |    "metadata": {},
1110 |    "outputs": [],
1111 |    "source": [
1112 |     "# Write results to disk\n",
1113 |     "results.to_csv('../data/APS_modeling_results.csv', index = False)"
1114 |    ]
1115 |   },
1116 |   {
1117 |    "cell_type": "code",
1118 |    "execution_count": 25,
1119 |    "metadata": {},
1120 |    "outputs": [
1121 |     {
1122 |      "data": {
1123 |       "text/html": [
1124 |        "<div>\n",
1125 |        "<style scoped>\n",
1126 |        "    .dataframe tbody tr th:only-of-type {\n",
1127 |        "        vertical-align: middle;\n",
1128 |        "    }\n",
1129 |        "\n",
1130 |        "    .dataframe tbody tr th {\n",
1131 |        "        vertical-align: top;\n",
1132 |        "    }\n",
1133 |        "\n",
1134 |        "    .dataframe thead th {\n",
1135 |        "        text-align: right;\n",
1136 |        "    }\n",
1137 |        "</style>\n",
1138 |        "<table border=\"1\" class=\"dataframe\">\n",
1139 |        "  <thead>\n",
1140 |        "    <tr style=\"text-align: right;\">\n",
1141 |        "      <th></th>\n",
1142 |        "      <th>model</th>\n",
1143 |        "      <th>train_time</th>\n",
1144 |        "      <th>test_time</th>\n",
1145 |        "      <th>mape</th>\n",
1146 |        "    </tr>\n",
1147 |        "  </thead>\n",
1148 |        "  <tbody>\n",
1149 |        "    <tr>\n",
1150 |        "      <th>0</th>\n",
1151 |        "      <td>elasticnet</td>\n",
1152 |        "      <td>0.0684406</td>\n",
1153 |        "      <td>0.00101819</td>\n",
1154 |        "      <td>56.9737</td>\n",
1155 |        "    </tr>\n",
1156 |        "    <tr>\n",
1157 |        "      <th>1</th>\n",
1158 |        "      <td>knn</td>\n",
1159 |        "      <td>21.5241</td>\n",
1160 |        "      <td>2.95038</td>\n",
1161 |        "      <td>23.7799</td>\n",
1162 |        "    </tr>\n",
1163 |        "    <tr>\n",
1164 |        "      <th>2</th>\n",
1165 |        "      <td>svm</td>\n",
1166 |        "      <td>528.563</td>\n",
1167 |        "      <td>58.6768</td>\n",
1168 |        "      <td>24.8263</td>\n",
1169 |        "    </tr>\n",
1170 |        "    <tr>\n",
1171 |        "      <th>3</th>\n",
1172 |        "      <td>rf</td>\n",
1173 |        "      <td>19.6616</td>\n",
1174 |        "      <td>0.104487</td>\n",
1175 |        "      <td>16.041</td>\n",
1176 |        "    </tr>\n",
1177 |        "    <tr>\n",
1178 |        "      <th>4</th>\n",
1179 |        "      <td>et</td>\n",
1180 |        "      <td>10.9921</td>\n",
1181 |        "      <td>0.104274</td>\n",
1182 |        "      <td>18.3991</td>\n",
1183 |        "    </tr>\n",
1184 |        "    <tr>\n",
1185 |        "      <th>5</th>\n",
1186 |        "      <td>adaboost</td>\n",
1187 |        "      <td>284.092</td>\n",
1188 |        "      <td>2.09143</td>\n",
1189 |        "      <td>39.2332</td>\n",
1190 |        "    </tr>\n",
1191 |        "  </tbody>\n",
1192 |        "</table>\n",
1193 |        "</div>"
1194 |       ],
1195 |       "text/plain": [
1196 |        "        model train_time   test_time     mape\n",
1197 |        "0  elasticnet  0.0684406  0.00101819  56.9737\n",
1198 |        "1         knn    21.5241     2.95038  23.7799\n",
1199 |        "2         svm    528.563     58.6768  24.8263\n",
1200 |        "3          rf    19.6616    0.104487   16.041\n",
1201 |        "4          et    10.9921    0.104274  18.3991\n",
1202 |        "5    adaboost    284.092     2.09143  39.2332"
1203 |       ]
1204 |      },
1205 |      "execution_count": 25,
1206 |      "metadata": {},
1207 |      "output_type": "execute_result"
1208 |     }
1209 |    ],
1210 |    "source": [
1211 |     "df_new = pd.read_csv('../data/f-Kansas_weather.csv')\n",
1212 |     "\n",
1213 |     "# Evaluate models on another dataset\n",
1214 |     "results_new = evaluate_models(df_new)\n",
1215 |     "results_new"
1216 |    ]
1217 |   },
1218 |   {
1219 |    "cell_type": "code",
1220 |    "execution_count": 26,
1221 |    "metadata": {},
1222 |    "outputs": [],
1223 |    "source": [
1224 |     "results_new.to_csv('../data/Kansas_modeling_results.csv', index = False)"
1225 |    ]
1226 |   },
1227 |   {
1228 |    "cell_type": "markdown",
1229 |    "metadata": {},
1230 |    "source": [
1231 |     "From these preliminary results, it appears the random forest and extra trees regressor significantly outperform the competition. At this point it is too early to have any takeaways, and we will have to evaluate hundreds of buildings to make meaningful comparisons."
1232 |    ]
1233 |   },
1234 |   {
1235 |    "cell_type": "markdown",
1236 |    "metadata": {},
1237 |    "source": [
1238 |     "# Conclusions\n",
1239 |     "\n",
1240 |     "\n",
1241 |     "We can now use the `evaluate_models` function to evaluate the six models across hundreds of buildings. This process cannot be done in a notebook, but we will look at the results in a future notebook. The next step after evaluating all the models is to select the best performer for further development in the process known as hyperparameter optimization. I will see you in the next notebook! "
1242 |    ]
1243 |   },
1244 |   {
1245 |    "cell_type": "code",
1246 |    "execution_count": null,
1247 |    "metadata": {},
1248 |    "outputs": [],
1249 |    "source": []
1250 |   }
1251 |  ],
1252 |  "metadata": {
1253 |   "kernelspec": {
1254 |    "display_name": "Python 3",
1255 |    "language": "python",
1256 |    "name": "python3"
1257 |   },
1258 |   "language_info": {
1259 |    "codemirror_mode": {
1260 |     "name": "ipython",
1261 |     "version": 3
1262 |    },
1263 |    "file_extension": ".py",
1264 |    "mimetype": "text/x-python",
1265 |    "name": "python",
1266 |    "nbconvert_exporter": "python",
1267 |    "pygments_lexer": "ipython3",
1268 |    "version": "3.6.5"
1269 |   }
1270 |  },
1271 |  "nbformat": 4,
1272 |  "nbformat_minor": 2
1273 | }
1274 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # prediction-documentation
2 | Work related to predicting building energy consumption
3 | 


--------------------------------------------------------------------------------
/utilities.py:
--------------------------------------------------------------------------------
  1 | # numpy and pandas for data manipulation
  2 | import pandas as pd
  3 | import numpy as np
  4 | 
  5 | # Sklearn preprocessing functionality
  6 | from sklearn.preprocessing import LabelEncoder, MinMaxScaler
  7 | 
  8 | # Models from Sklearn
  9 | from sklearn.linear_model import ElasticNet
 10 | from sklearn.neighbors import KNeighborsRegressor
 11 | from sklearn.svm import SVR
 12 | from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
 13 | 
 14 | # Timing utilities
 15 | from timeit import default_timer as timer
 16 | 
 17 | 
 18 | def preprocess_data(df, test_days = 183, scale = True):
 19 |     """Preprocess a building energy dataframe for machine learning models. 
 20 |     
 21 |     Parameters:
 22 |     ------
 23 |     
 24 |     df : dataframe
 25 |         Building energy dataframe with each row containing one observation
 26 |         and the columns holding the features. The dataframe must contain the 
 27 |         "elec_cons" column to be used as the target.
 28 |     
 29 |     test_days : integer (default = 183)
 30 |         Number of testing days used for splitting into training and testing sets.
 31 |         The most recent test_days will be in the testing set while the rest of the data
 32 |         will be used for training.
 33 |         
 34 |     scale : boolean (default = True)
 35 |         Indicator for whether or not the features should be scaled. If True, 
 36 |         the features are scaled the range of 0 to 1.
 37 |         
 38 |     Return:
 39 |     ______
 40 |     
 41 |     train : dataframe, shape = [n_training_samples, n_features]
 42 |         Set of training features for training a model
 43 |     
 44 |     train_targets : array, shape = [n_training_samples]
 45 |         Array of training targets for training a model
 46 |         
 47 |     test : dataframe, shape = [n_testing_samples, n_features]
 48 |         Set of testing features for making predictions with a model
 49 |     
 50 |     test_targets : array, shape = [n_testing_samples]
 51 |         Array of testing targets for evaluating the model predictions
 52 |         
 53 |     """
 54 |     
 55 |     # Fill in NaN values 
 56 |     df['sun_rise_set'] = df['sun_rise_set'].fillna('neither')
 57 |     
 58 |     # Convert to a datetime index
 59 |     df['timestamp'] = pd.DatetimeIndex(df['timestamp'])
 60 | 
 61 |     # Create new time features
 62 |     df['yday'] = df['timestamp'].dt.dayofyear
 63 |     df['month'] = df['timestamp'].dt.month
 64 |     df['wday'] = df['timestamp'].dt.dayofweek
 65 |     
 66 |     cyc_features = ['yday', 'month', 'wday', 'num_time'] 
 67 | 
 68 |     # Iterate through the variables
 69 |     for feature in cyc_features:
 70 |         df['%s_sin' % feature] = np.sin(2 * np.pi * df[feature] / df[feature].max())
 71 |         df['%s_cos' % feature] = np.cos(2 * np.pi * df[feature] / df[feature].max())
 72 |         
 73 |     # Remove the ordered time features
 74 |     df = df.drop(columns = ['yday', 'month', 'wday', 'num_time', 'day_of_week'])
 75 |     
 76 |     # Convert the timestamp to total seconds since beginning of data
 77 |     df['timestamp'] = (df['timestamp'] - df['timestamp'].min()).dt.total_seconds()
 78 |     
 79 |     label_encoder = LabelEncoder()
 80 | 
 81 |     # Label encode
 82 |     df['week_day_end'] = label_encoder.fit_transform(df['week_day_end'])
 83 | 
 84 |     # One hot encode
 85 |     df = pd.get_dummies(df)
 86 |     
 87 |     # Remove observations 0 or less
 88 |     df = df[df['elec_cons'] > 0]
 89 | 
 90 |     # Select the targets 
 91 |     targets = np.array(df['elec_cons']).reshape((-1, ))
 92 | 
 93 |     columns_remove = ['elec_cons', 'elec_cons_imp', 'pow_dem', 'anom_flag', 'anom_missed_flag', 'cleaned_energy', 'forecast']
 94 | 
 95 |     # Remove the columns only if present in dataframe
 96 |     df = df.drop(columns = [x for x in columns_remove if x in df.columns])
 97 |     
 98 |     index = 0
 99 |     frequency = 0
100 | 
101 |     # Check to make sure that timestamps are not repeated
102 |     while frequency < 1:
103 |         frequency = df['timestamp'][index + 1] - df['timestamp'][index]
104 | 
105 |         # Make sure to increment index
106 |         index += 1
107 | 
108 |     # Observations per day
109 |     daily_observations = (60 * 60 * 24) / frequency
110 | 
111 |     # Start of test period
112 |     test_start = int(len(df) - test_days * daily_observations)
113 |     
114 |     # Select the training and testing features
115 |     train, test = df.iloc[:test_start], df.iloc[test_start:]
116 | 
117 |     # Select the training and testing targets
118 |     train_targets, test_targets = targets[:test_start], targets[test_start:]
119 | 
120 |     
121 |     if scale:
122 |         # Create the scaler object with a specified range
123 |         scaler = MinMaxScaler(feature_range = (0, 1))
124 | 
125 |         # Fit on the training data
126 |         scaler.fit(train)
127 | 
128 |         # Transform both the training and testing data
129 |         train = scaler.transform(train)
130 |         test = scaler.transform(test)
131 | 
132 |         features = list(df.columns)
133 | 
134 |         # Convert back to dataframes
135 |         train = pd.DataFrame(train, columns=features)
136 |         test = pd.DataFrame(test, columns=features)
137 |     
138 |     # Check for missing values
139 |     assert ~np.any(train.isnull()), "Training Data Contains Missing Values!"
140 |     assert ~np.any(test.isnull()), "Testing Data Contains Missing Values!"
141 |     
142 |     return train, train_targets, test, test_targets
143 | 
144 | 
145 | def evaluate_models(df):
146 |     """Evaluate scikit-learn machine learning models
147 |     on a building energy dataset. More models can be added
148 |     to the function as required. 
149 |     
150 |     
151 |     Parameters
152 |     --------
153 |     df : dataframe
154 |         Building energy dataframe. Each row must have one observation
155 |         and the columns must contain the features. The dataframe
156 |         needs to have an "elec_cons" column to be used as targets. 
157 |     
158 |     Return
159 |     --------
160 |     results : dataframe, shape = [n_models, 4]
161 |         Modeling metrics. A dataframe with columns:
162 |         model, train_time, test_time, mape. Used for comparing
163 |         models for a given building dataset
164 |         
165 |     """
166 |     try:
167 |         # Preprocess the data for machine learning
168 |         train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)
169 |     except Exception as e:
170 |         print('Error processing data: ', e)
171 |         return
172 |         
173 |     # elasticnet
174 |     model = ElasticNet(alpha = 1.0, l1_ratio=0.5)
175 |     elasticnet_results = implement_model(model, train, train_targets, test, 
176 |                                          test_targets, model_name = 'elasticnet')
177 |     
178 |     # knn
179 |     model = KNeighborsRegressor()
180 |     knn_results = implement_model(model, train, train_targets, test, 
181 |                                   test_targets, model_name = 'knn')
182 |     
183 |     # svm
184 |     model = SVR()
185 |     svm_results = implement_model(model, train, train_targets, test, 
186 |                                    test_targets, model_name = 'svm')
187 |     
188 |     # rf
189 |     model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)
190 |     rf_results = implement_model(model, train, train_targets, test, 
191 |                                   test_targets, model_name = 'rf')
192 |     
193 |     # et
194 |     model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)
195 |     et_results = implement_model(model, train, train_targets, test, 
196 |                                   test_targets, model_name = 'et')
197 |     
198 |     # adaboost
199 |     model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05, 
200 |                               loss = 'exponential')
201 |     adaboost_results = implement_model(model, train, train_targets, test, 
202 |                                        test_targets, model_name = 'adaboost')
203 |     
204 |     # Put the results into a single array (stack the rows)
205 |     results = np.vstack((elasticnet_results, knn_results, svm_results,
206 |                          rf_results, et_results, adaboost_results))
207 |     
208 |     # Convert the results to a dataframe
209 |     results = pd.DataFrame(results, columns = ['model', 'train_time', 'test_time', 'mape'])
210 |     
211 |     # Convert the numeric results to numbers
212 |     results.iloc[:, 1:] = results.iloc[:, 1:].astype(np.float32)
213 |     
214 |     return results


--------------------------------------------------------------------------------