├── README.md └── lstm_meetup.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Meetup 2 | 3 | Contains notebooks and user materials David and I used to present at various meetups. 4 | -------------------------------------------------------------------------------- /lstm_meetup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practical Introduction to LSTM Networks\n", 8 | "#### Peter Schneider (peter.schneider@soma-analytics.com) & David Haber (david@cognitir.com) - Deep Learning Meetup - 24 June 2016 / Berlin Machine Learning Meetup, 1 August 2016" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## Motivation\n", 16 | "\n", 17 | "### Limitations of Feed-Forward Neural Networks\n", 18 | "\n", 19 | "- Feed-forward neural networks can't make use of time information, but when modeling time series such as audio signal, stock markets or sensor signals, time information is crucial to successful modeling.\n", 20 | "- Sequences are an integral part of intelligence. Human intelligence is based on sequential pattern recognition. Can you say the alphabet backwards?\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "- Recurrent Neural Networks offer a way to model time information by allowing cyclical connections.\n", 25 | "\n", 26 | "### Prominent Applications of Recurrent Neural Networks\n", 27 | "\n", 28 | "- Speech Processing/Recognition\n", 29 | " - Google applies RNNs in their email auto-response technology\n", 30 | " - Apple uses RNNs for speech recognition in Siri\n", 31 | "- Finance\n", 32 | " - Recurrent Nets have been successfully applied in automatic trading systems\n", 33 | "- Music Composition\n", 34 | " - http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/\n", 35 | "- Motor Control\n", 36 | "- Biological Sequence Analysis\n", 37 | "- Machine Translation\n", 38 | "- Reinforcement Learning\n", 39 | "- Meta Learning" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "### Backpropagation - A method to compute the gradient\n", 47 | "\n", 48 | "* Gradient descent can help us to optimize weights in a neural network.\n", 49 | "* To perform learning we must compute the gradient first before we can upate weights and biases\n", 50 | "$\\rightarrow$ this is what backpropagation does.\n", 51 | "* It applies the chain rule to compute the updates layer by layer from the top to the bottom." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "Let's assume following cost function for a single training example x:\n", 59 | "\n", 60 | "$$ C = \\frac{1}{2} \\sum_{j} (y_j - a_j^L)^2 $$\n", 61 | "\n", 62 | "with\n", 63 | "\n", 64 | "$$ a^l = \\sigma(z^l) $$\n", 65 | "\n", 66 | "and \n", 67 | "\n", 68 | "$$ z^l \\equiv w^l a^{l-1} + b^l$$\n", 69 | "\n", 70 | "and define\n", 71 | "\n", 72 | "$$ \\delta_j^l \\equiv \\frac{\\partial C}{\\partial z_j^l} $$\n", 73 | "\n", 74 | "#### Error at Output Layer: $\\delta^L_j$\n", 75 | "\n", 76 | "Goal: Compute gradient of output layer\n", 77 | "\n", 78 | "\\begin{align}\n", 79 | "\\delta^L_j & = \\frac{\\partial C}{\\partial z^L_{j}} \\\\\n", 80 | "& = y_j - \\sigma'(z_j^L)\n", 81 | "\\end{align}\n", 82 | "\n", 83 | "#### Recursive Computation of Error at Hidden Layers: $\\delta_j^l$\n", 84 | "\n", 85 | "Goal: Make $\\delta^l$ a function of $\\delta^{l+1}$ to recursively compute gradients in lower layers\n", 86 | "\n", 87 | "Note that\n", 88 | "\n", 89 | "\\begin{align}\n", 90 | "z^{l+1}_k & = \\sum_{m} w_{km}^{l+1} a^l_k + b_j^{l+1} \\\\\n", 91 | "& = \\sum_{m} w_{km}^{l+1} \\sigma(z_m^l) + b_j^{l+1}\n", 92 | "\\end{align}\n", 93 | "\n", 94 | "So to compute the full gradient of $z_j^l$ we need to gather the gradients of all higher layers\n", 95 | "\n", 96 | "\\begin{align}\n", 97 | "\\frac{\\partial C}{\\partial z^l_{j}} & = \\sum_k \\frac{\\partial C}{\\partial z^{l+1}_{k}} \\frac{\\partial z^{l+1}_{k}}{\\partial z^l_{j}} \\\\\n", 98 | "& = \\sum_k \\delta_k^{l+1} w_{jk}^{l+1} \\sigma'(z_j^l)\n", 99 | "\\end{align}\n", 100 | "\n", 101 | "the result is: $\\delta_j^l = \\sum_k \\delta_k^{l+1} w_{jk}^{l+1} \\sigma'(z_j^l)$\n", 102 | "\n", 103 | "\n", 104 | "#### Weight Update\n", 105 | "\n", 106 | "\\begin{align} \n", 107 | "\\frac{\\partial C}{\\partial w^l_{jk}} & = \\frac{\\partial C}{\\partial z^l_{j}} \\frac{\\partial z_j^l}{\\partial w^l_{jk}} \\\\\n", 108 | "&= \\delta_j^l a_k^{l-1}\n", 109 | "\\end{align}\n", 110 | "\n", 111 | "note that $z^l_j = \\sum_{k}w_{jk}a_k^{l-1} + b_j$\n", 112 | "\n", 113 | "#### Bias Update\n", 114 | "\n", 115 | "\\begin{align} \n", 116 | "\\frac{\\partial C}{\\partial b^l_{j}} & = \\frac{\\partial C}{\\partial z^l_{j}} \\frac{\\partial z_j^l}{\\partial b^l_{j}} \\\\\n", 117 | "&= \\delta_j^l\n", 118 | "\\end{align}" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### Vanishing Gradient Problem - A heuristic approach\n", 126 | "\n", 127 | "Suppose we have a network with 4 layers, a single unit per layer and use the sigmoid activation function $\\sigma(x) = \\frac{1}{1 + e^{-x}}$\n", 128 | "\n", 129 | "![alternate text](single_layer_single_unit_nn.png)\n", 130 | "\n", 131 | "Furthermore, suppose we want to compute the gradient for the bias of the first layer $\\frac{\\partial C}{\\partial b^l_{j}}$.\n", 132 | "\n", 133 | "First it is important to note that $\\sigma'(x) = \\sigma(x)(1-\\sigma(x))$ is always $\\leq 0.25$:" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "import numpy as np\n", 145 | "\n", 146 | "from bokeh.plotting import figure\n", 147 | "from bokeh.io import show, gridplot, output_notebook\n", 148 | "\n", 149 | "def sigmoid(x):\n", 150 | " return 1. / (1. + np.exp(-x)) \n", 151 | "\n", 152 | "p = figure(title=\"Activation Functions\", plot_width=600, plot_height=400)\n", 153 | "x = np.linspace(-5, 5, 100)\n", 154 | "y_s = np.multiply(sigmoid(x),(1.-sigmoid(x)))\n", 155 | "y_t = 1. - np.tanh(x)**2\n", 156 | "p.line(x, y_s, color='blue', legend='first derivative of sigmoid')\n", 157 | "p.line(x, y_t, color='green', legend='first derivative of tanh')\n", 158 | "output_notebook()\n", 159 | "show(p)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "* Recap that $\\frac{\\partial C}{\\partial b^l_{j}} = \\delta_j^l$ so it follows from recursive substitution in $\\delta^l = \\frac{\\partial C}{\\partial a^L} \\sigma'(z^L) \\prod^{L-1}_{i=l} w^{i+1} \\sigma'(z^i)$\n", 167 | "* If weights are initialized to $w \\sim \\mathcal{N}(0,1)$ then most likely $|w| < 4$ and $\\left|w \\sigma'(z)\\right| < 1 $\n", 168 | "* It follows $\\prod_{i=j}^{L-1} w^i \\sigma'(z^i)$ will converge to zero for the lower layers in a deep neural network $\\rightarrow$ so the gradients in the lower layer tend to **vanish** (keep in mind: $\\sigma'(x) \\leq 0.25$). \n", 169 | "\n", 170 | "### Exploding gradient problem\n", 171 | "\n", 172 | "* If $\\left\\vert w \\sigma'(z) \\right\\vert > 1$ then the product will explode.\n", 173 | "\n", 174 | "### Summary\n", 175 | "\n", 176 | "1. In general it is hard to control the vanishing and the exploding gradient problem as we would have to maintain a range of values for the input and the weights to control the gradient.\n", 177 | "\n", 178 | "2. The problem seems to be severe especially with the sigmoid activation function (Glorot, Bengio: \"Understanding the difficulty of training deep feedforward neural networks\", 2010)\n", 179 | "\n", 180 | "3. One could argue that using a tanh activation might be a better option, but keep in mind that the derivative of tanh tends to go faster to zero than the sigmoid (x > $\\left\\vert 1.6 \\right\\vert$)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "### Recurrent Neural Networks\n", 188 | "\n", 189 | "* Feedforward Neural Networks don't possess an internal state, so once an input is processed the network \"forgets\" about it.\n", 190 | "\n", 191 | "* In time series processing, this is a drawback as information that is in the time structure is lost.\n", 192 | "\n", 193 | "* Recurrent Neural Networks (RNNs) try to bridge the gap by allowing cyclical connections $$ z^{l,t}_i = w'^l_{i} a^{l-1, t} + u'^l_{i} a^{l, t-1} + b^l_i $$ and $$ a^{l,t}_i = \\sigma(z^{l,t}_i) $$.\n", 194 | "\n", 195 | "* through $ u'^l_{i} a^{l, t-1}$ (non) linear time dependencies are captured.\n", 196 | "\n", 197 | "\n", 198 | "\n", 199 | "#### Backpropagation Through Time\n", 200 | "\n", 201 | "* Conventional BPTT (Williams and Zipser, 1992) is similar to normal backpropagation but unrolls the neural network over time:\n", 202 | "\n", 203 | "\n", 204 | "\n", 205 | "* The time-dependent derivative is as follows: \n", 206 | "\n", 207 | "\\begin{equation}\n", 208 | "\\delta^{t,l}_h = \\sigma'(z^{l,t}_h) \\left( \\sum_{k=1}^{K} \\delta^{t,l+1}_k w_{hk}^{l+1} + \\sum_{h'=1}^{H} \\delta_{h'}^{t+1,l} w_{hh'} \\right)\n", 209 | "\\end{equation}\n", 210 | "\n", 211 | "* As the weights are the same for every time step, we have to sum over them in order to get weight and bias derivative: \n", 212 | "\n", 213 | "\\begin{equation}\n", 214 | "\\frac{\\partial C}{\\partial w^{l}_{i,j}} = \\sum_{t=1}^{T} \\frac{\\partial C}{\\partial z^{t,l}_j} \\frac{\\partial z^{t,l}_j}{\\partial w^{l}_{i,j}} = \\sum_{t=1}^{T} \\delta_j^{t,l} \\sigma(a^{l,t}_j)\n", 215 | "\\end{equation}\n", 216 | "\n", 217 | "\\begin{equation}\n", 218 | "\\frac{\\partial C}{\\partial b^{l}_{j}} =\\sum_{t=1}^{T} \\frac{\\partial C}{\\partial z^{t,l}_j} \\frac{\\partial z^{t,l}_j}{\\partial b^{l}_{j}} = \\sum_{t=1}^{T} \\delta_j^{t,l}\n", 219 | "\\end{equation}\n", 220 | "\n", 221 | "* Backpropagation through time exhibits the *same issues* with a vanishing or exploding gradient (Hochreiter, 1991) even for a small number of time steps (short-term memory) $\\rightarrow$, therefore it is hard to capture long-range time dependencies (long-term memory).\n", 222 | "\n", 223 | "* To circumvent this problem, Hochreiter and Schmidhuber (1997) modified recurrent units into long-short term memory units." 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": { 229 | "collapsed": true 230 | }, 231 | "source": [ 232 | "# Long Short Term Memory Networks (LSTMs)\n", 233 | "\n", 234 | "- Learning to store information over extended time intervals (long-range time dependencies) via recurrent backpropagation is difficult.\n", 235 | "- LSTMs enforces *constant* (neither exploding nor vanishing) error flow through internal states of LSTM units.\n", 236 | "- Local in space and time! O(1) update complexity per time step and weight.\n", 237 | "- An LSTM unit uses a set of gates to control what is entering and leaving the unit (input gate and output gate). Essentially, the unit learns when to keep, override or access information which we refer to as *cell state*.\n", 238 | "\n", 239 | "### LSTM Unit Structure\n", 240 | "\n", 241 | "#### Cell State (Latent State)\n", 242 | "\n", 243 | "The LSTM is structured so that can remove or add information to the *cell state*. *Gates* regulate the information flow within the LSTM.\n", 244 | "\n", 245 | "The flow of the *cell state* is shown in the following diagram:\n", 246 | "\n", 247 | "\n", 248 | "\n", 249 | "#### Forget Gate\n", 250 | "\n", 251 | "The *forget gate* applies a sigmoid function to the the previous output $h_{t-1}$ and input $x_t$. It outputs a number between 0 and 1 to decide which part of the previous *cell state* it wants to keep. Zero means \"don't let anything through\", one means \"let everything through\".\n", 252 | "\n", 253 | "\n", 254 | "\n", 255 | "#### Input Gate\n", 256 | "\n", 257 | "The *input gate* controls the flow of new information into the new *cell state*. A *sigmoid layer* regulates which values we will update and a *tanh layer* proposes values that could be added to the state.\n", 258 | "\n", 259 | "\n", 260 | "\n", 261 | "#### Cell State Update\n", 262 | "\n", 263 | "We update step of the *cell state* is simple. We multiply the old state $C_{t-1}$ by the output of the *forget layer* $f_t$ and add $i_t * \\tilde{C}_t$:\n", 264 | "\n", 265 | "\n", 266 | "\n", 267 | "#### Output Gate\n", 268 | "\n", 269 | "As a last step, we generate the LSTM unit's output:\n", 270 | "\n", 271 | "\n", 272 | "\n", 273 | "#### LSTM Modules\n", 274 | "\n", 275 | "\n", 276 | "\n", 277 | "### LSTM Structure Variants\n", 278 | "\n", 279 | "* Peephole connections (Gers & Schmidhuber, 2000)\n", 280 | "* Gated Recurrent Units (Cho, et al., 2014)\n", 281 | "* Depth Gated RNNs (Yao, et al., 2015)\n", 282 | "* Clockwork RNNs (Koutnik, et al., 2014)\n", 283 | "* Highway Networks (Srivastava, Greff, Schmidhuber, 2015)\n", 284 | "\n", 285 | "See \"LSTM: A search space odyssey\" (Greff, et al., 2015) for a useful comparison of popular variants." 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "### Keras\n", 293 | "\n", 294 | "Keras is a neural networks library that can run on either Theano or TensorFlow. Keras has been build for:\n", 295 | "\n", 296 | "- Modularity\n", 297 | "- Minimalism\n", 298 | "- Easy extensibility\n", 299 | "- Work with Python" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "### Keras Implementation of an LSTM" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "#### Import Libraries" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": { 320 | "collapsed": false 321 | }, 322 | "outputs": [], 323 | "source": [ 324 | "from keras.models import Sequential\n", 325 | "from keras.layers import Dense, LSTM\n", 326 | "from keras.preprocessing import sequence\n", 327 | "\n", 328 | "from pandas import read_csv\n", 329 | "from sklearn.cross_validation import train_test_split\n", 330 | "from scipy import stats\n", 331 | "from pandas_datareader.data import DataReader\n", 332 | "\n", 333 | "import datetime\n", 334 | "\n", 335 | "import numpy as np\n", 336 | "np.random.seed(25)\n", 337 | "\n", 338 | "import matplotlib.pyplot as plt\n", 339 | "%matplotlib inline\n", 340 | "\n", 341 | "import time" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "#### Load Data" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": null, 354 | "metadata": { 355 | "collapsed": false 356 | }, 357 | "outputs": [], 358 | "source": [ 359 | "start = datetime.datetime(1962, 1, 1)\n", 360 | "end = datetime.datetime(2016, 1, 1)\n", 361 | "\n", 362 | "DataReader(\"GE\", 'yahoo', start, end).to_csv(\"ge.csv\")" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": { 369 | "collapsed": false 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "timesteps = 5\n", 374 | "slide = 3\n", 375 | "prediction_steps = 1\n", 376 | "\n", 377 | "# Load Data\n", 378 | "df = read_csv('ge.csv')\n", 379 | "\n", 380 | "# series[0] is open price at 1.1.1962\n", 381 | "series = df['Open'].as_matrix()\n", 382 | "\n", 383 | "# Standardize data\n", 384 | "series = (series - np.mean(series)) / np.std(series)\n", 385 | "\n", 386 | "# Create starting indices window\n", 387 | "a = np.arange(0, len(series)-(timesteps+prediction_steps), slide)\n", 388 | "\n", 389 | "# Create window summation indices\n", 390 | "b = np.arange(0, timesteps + prediction_steps)\n", 391 | "\n", 392 | "# Create mask\n", 393 | "start_tiles = np.tile(a, (timesteps + prediction_steps, 1)).T\n", 394 | "sum_tiles = np.tile(b, (len(start_tiles), 1))\n", 395 | "index_mask = start_tiles + sum_tiles\n", 396 | "x_mask = index_mask[:,:-prediction_steps]\n", 397 | "y_mask = index_mask[:, -prediction_steps:]\n", 398 | "\n", 399 | "# Build data set\n", 400 | "x, y = series[x_mask], series[y_mask]\n", 401 | "\n", 402 | "# Split data\n", 403 | "split_idx = int(len(x) * 0.75)\n", 404 | "\n", 405 | "x_train = x[:split_idx]\n", 406 | "y_train = y[:split_idx]\n", 407 | "\n", 408 | "x_test = x[split_idx:]\n", 409 | "y_test = y[split_idx:]\n", 410 | "\n", 411 | "# Need to add one more dim to make batches\n", 412 | "x_train = np.expand_dims(x_train, 2)\n", 413 | "x_test = np.expand_dims(x_test, 2)\n", 414 | "\n", 415 | "print x_train.shape\n", 416 | "print y_train.shape" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "#### Define Model Architecture" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": { 430 | "collapsed": false 431 | }, 432 | "outputs": [], 433 | "source": [ 434 | "model = Sequential()\n", 435 | "model.add(LSTM(25, input_shape=(timesteps, 1), activation='sigmoid'))#, return_sequences=True))\n", 436 | "#model.add(LSTM(25, input_shape=(timesteps, 1), activation='sigmoid'))\n", 437 | "model.add(Dense(prediction_steps, activation='linear'))" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "#### Configure Learning Process" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": { 451 | "collapsed": false 452 | }, 453 | "outputs": [], 454 | "source": [ 455 | "batch_size = 128\n", 456 | "epochs = 15\n", 457 | "\n", 458 | "model.compile(loss='mse', optimizer='rmsprop')\n", 459 | "\n", 460 | "print model.summary()" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "#### Perform Training in Batches" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "outputs": [], 477 | "source": [ 478 | "hist = model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=epochs, verbose=0)\n", 479 | "plt.figure(figsize=(35,10))\n", 480 | "plt.plot(np.arange(0, epochs), hist.history['loss'])\n", 481 | "plt.show()" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "#### Evaluate Performance" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": { 495 | "collapsed": false 496 | }, 497 | "outputs": [], 498 | "source": [ 499 | "# Evaluate model\n", 500 | "y_pred = model.predict(x_test, batch_size=batch_size)\n", 501 | "\n", 502 | "for i in range(prediction_steps):\n", 503 | " rmse = np.sqrt(np.sum((y_test[:, i] - y_pred[:, i])**2))\n", 504 | " print \"LSTM RMSE %i: %.4f\" % (i, rmse)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": { 510 | "collapsed": true 511 | }, 512 | "source": [ 513 | "#### Visualize Performance" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": { 520 | "collapsed": false 521 | }, 522 | "outputs": [], 523 | "source": [ 524 | "def chow(X, Y, breakpoint, alpha = 0.05):\n", 525 | " \"\"\"\n", 526 | " Performs a chow test.\n", 527 | " Split input matrix and output vector into two\n", 528 | " using specified breakpoint.\n", 529 | " X - independent variables matrix\n", 530 | " Y - dependent variable vector\n", 531 | " breakpoint - index to split.\n", 532 | " alpha is significance level for hypothesis test\n", 533 | " \"\"\"\n", 534 | " k = len(X[0])\n", 535 | " n = len(X)\n", 536 | "\n", 537 | " # Split into two datasets.\n", 538 | " X1 = X[:breakpoint][:]\n", 539 | " Y1 = Y[:breakpoint][:]\n", 540 | "\n", 541 | " #print X1.shape\n", 542 | " #print Y1.shape\n", 543 | "\n", 544 | " X2 = X[breakpoint:][:]\n", 545 | " Y2 = Y[breakpoint:][:]\n", 546 | " \n", 547 | " #print X2.shape\n", 548 | " #print Y2.shape\n", 549 | "\n", 550 | " # Perform separate three least squares.\n", 551 | " #allfit = lm.ols(X,Y)\n", 552 | " allfit = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), Y)\n", 553 | " #lowerfit = lm.ols(X1, Y1)\n", 554 | " lowerfit = np.dot(np.dot(np.linalg.inv(np.dot(X1.T, X1)), X1.T), Y1)\n", 555 | " #upperfit = lm.ols(X2, Y2)\n", 556 | " upperfit = np.dot(np.dot(np.linalg.inv(np.dot(X2.T, X2)), X2.T), Y2)\n", 557 | "\n", 558 | " RSS = np.sum(np.square(np.dot(X, allfit) - Y)) #allfit.RSS\n", 559 | " RSS1 = np.sum(np.square(np.dot(X1, lowerfit) - Y1)) #lowerfit.RSS\n", 560 | " RSS2 = np.sum(np.square(np.dot(X2, upperfit) - Y2)) #upperfit.RSS\n", 561 | "\n", 562 | " df1 = k\n", 563 | " df2 = n - 2 *k\n", 564 | "\n", 565 | " num = (RSS - (RSS1 + RSS2)) / float(df1)\n", 566 | " den = (RSS1 + RSS2) / (df2)\n", 567 | "\n", 568 | " Ftest = num/den\n", 569 | " Fcrit = stats.f.ppf([1 -0.05], df1, df2)\n", 570 | " return (Ftest, Fcrit, df1, df2, RSS, RSS1, RSS2)\n", 571 | "\n", 572 | "t = np.arange(0, len(y_pred[:1000, 0]))\n", 573 | "\n", 574 | "plt.figure(figsize=(35, 15))\n", 575 | "plt.plot(t, y_test.T.ravel()[:1000], 'b')\n", 576 | "plt.plot(t, y_pred.T.ravel()[:1000], 'r')\n", 577 | "plt.show()\n", 578 | "\n", 579 | "lower = 5\n", 580 | "upper = x_test.shape[0] - 2\n", 581 | "idx = lower\n", 582 | "Ftests = []\n", 583 | "Fcrits = []\n", 584 | "x_chow = x_test[:, :, 0]\n", 585 | "y_chow = y_test[:, 0]\n", 586 | "\n", 587 | "while idx <= upper:\n", 588 | " (Ftest, Fcrit, df1, df2, RSS, RSS1, RSS2) = chow(x_chow, y_chow, idx)\n", 589 | " Ftests.append(Ftest)\n", 590 | " Fcrits.append(Fcrit[0])\n", 591 | " idx += 1\n", 592 | "\n", 593 | "plt.figure(figsize=(35, 15))\n", 594 | "plt.plot(Ftests)\n", 595 | "plt.plot(Fcrits)\n", 596 | "plt.ylim([-5, 10])\n", 597 | "plt.show()" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "### Impulse Response of Network" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": null, 610 | "metadata": { 611 | "collapsed": false 612 | }, 613 | "outputs": [], 614 | "source": [ 615 | "from collections import defaultdict\n", 616 | "\n", 617 | "def random_color():\n", 618 | " rgb = []\n", 619 | " for i in range(3):\n", 620 | " c = np.random.randint(0, 256, 1)\n", 621 | " rgb += [c[0]]\n", 622 | "\n", 623 | " return '#%02x%02x%02x' % tuple(rgb)\n", 624 | "\n", 625 | "'''\n", 626 | "Setting the activations functions\n", 627 | "'''\n", 628 | "# gate activation function (tanh)\n", 629 | "f = np.tanh\n", 630 | "# cell activation function (sigmoid)\n", 631 | "g = sigmoid\n", 632 | "# unit output activation\n", 633 | "h = np.tanh\n", 634 | "'''\n", 635 | "Initialize the plots\n", 636 | "'''\n", 637 | "width, height = 500, 300\n", 638 | "p_data = figure(title=\"Input\", plot_width=width, plot_height=height)\n", 639 | "p_unit = figure(title=\"Unit output\", plot_width=width, plot_height=height)\n", 640 | "p_out = figure(title=\"Output Gate\", plot_width=width, plot_height=height)\n", 641 | "p_cell = figure(title=\"Cell State\", plot_width=width, plot_height=height)\n", 642 | "p_forget = figure(title=\"Forget Gate\", plot_width=width, plot_height=height)\n", 643 | "p_input = figure(title=\"Input Gate\", plot_width=width, plot_height=height)\n", 644 | "'''\n", 645 | "Initialize loop\n", 646 | "'''\n", 647 | "repeat = 20\n", 648 | "T = 30\n", 649 | "time = np.arange(0, T)\n", 650 | "weight_dict = defaultdict(list)\n", 651 | "for _ in range(repeat):\n", 652 | " '''\n", 653 | " Initializing weights and states\n", 654 | " '''\n", 655 | " # unit input weights\n", 656 | " w_i, w_f, w_c, w_o = np.random.standard_normal(4)\n", 657 | " # recurrent connections\n", 658 | " u_i, u_f, u_c, u_o = np.random.standard_normal(4)\n", 659 | " # peephole connections\n", 660 | " c_i, c_f, c_o = np.random.standard_normal(3)\n", 661 | " # store input weights\n", 662 | " weight_dict['w_i'] += [w_i]\n", 663 | " weight_dict['w_f'] += [w_f]\n", 664 | " weight_dict['w_c'] += [w_c]\n", 665 | " weight_dict['w_o'] += [w_o]\n", 666 | " # store recurrent conns\n", 667 | " weight_dict['u_i'] += [u_i]\n", 668 | " weight_dict['u_f'] += [u_f]\n", 669 | " weight_dict['u_c'] += [u_c]\n", 670 | " weight_dict['u_o'] += [u_o]\n", 671 | " # store peephole connections\n", 672 | " weight_dict['c_i'] += [c_i]\n", 673 | " weight_dict['c_f'] += [c_f]\n", 674 | " weight_dict['c_o'] += [c_o] \n", 675 | " '''\n", 676 | " Simlate cell over multiple time steps with a spike at t=0\n", 677 | " '''\n", 678 | " a_i, a_f, a_o = np.zeros(T), np.zeros(T), np.zeros(T)\n", 679 | " s_c, a_b = np.zeros(T+1), np.zeros(T+1)\n", 680 | " x = np.zeros(T)\n", 681 | " x[0] = 1\n", 682 | " for m, t in zip(range(T), range(1, T+1)):\n", 683 | " # input gate\n", 684 | " z_i = w_i * x[m] + u_i * a_b[t-1] + c_i * s_c[t-1]\n", 685 | " a_i[m] = f(z_i)\n", 686 | " # forget gate\n", 687 | " z_f = w_f * x[m] + u_f * a_b[t-1] + c_f * s_c[t-1]\n", 688 | " a_f[m] = f(z_f)\n", 689 | " # cell state\n", 690 | " z_c = w_c * x[m] + u_c * a_b[t-1]\n", 691 | " s_c[t] = a_f[m] * s_c[t-1] + a_i[m] * g(z_c)\n", 692 | " # output gate\n", 693 | " z_o = w_o * x[m] + u_o * a_b[t-1] + c_o * s_c[t]\n", 694 | " a_o[m] = f(z_o)\n", 695 | " # unit output\n", 696 | " a_b[t] = a_o[m] * h(s_c[t])\n", 697 | "\n", 698 | " # delete the initial values\n", 699 | " s_c = s_c[1:]\n", 700 | " a_b = a_b[1:]\n", 701 | " # store final values\n", 702 | " weight_dict['a_i'] += [a_i[-1]]\n", 703 | " weight_dict['a_f'] += [a_f[-1]]\n", 704 | " weight_dict['a_o'] += [a_o[-1]]\n", 705 | " weight_dict['a_b'] += [a_b[-1]]\n", 706 | " weight_dict['s_c'] += [s_c[-1]]\n", 707 | " # plot\n", 708 | " p_data.line(time, x, color=random_color())\n", 709 | " p_unit.line(time, a_b, color=random_color())\n", 710 | " p_out.line(time, a_o, color=random_color())\n", 711 | " p_cell.line(time, s_c, color=random_color())\n", 712 | " p_forget.line(time, a_f, color=random_color())\n", 713 | " p_input.line(time, a_i, color=random_color())\n", 714 | " \n", 715 | "# show the plotq\n", 716 | "grid = [[p_data, p_input], [p_forget, p_cell], [p_out, p_unit]]\n", 717 | "show(gridplot(grid))" 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "metadata": {}, 723 | "source": [ 724 | "### Sources\n", 725 | "\n", 726 | "* RNN and LSTM images taken from http://colah.github.io/posts/2015-08-Understanding-LSTMs/\n", 727 | "* Supervised Sequence Labelling with Recurrent Neural Networks (Graves): https://www.cs.toronto.edu/~graves/preprint.pdf\n", 728 | "* Neural Networks and Deep Learning: http://www.neuralnetworksanddeeplearning.com" 729 | ] 730 | } 731 | ], 732 | "metadata": { 733 | "kernelspec": { 734 | "display_name": "Python 2", 735 | "language": "python", 736 | "name": "python2" 737 | }, 738 | "language_info": { 739 | "codemirror_mode": { 740 | "name": "ipython", 741 | "version": 2 742 | }, 743 | "file_extension": ".py", 744 | "mimetype": "text/x-python", 745 | "name": "python", 746 | "nbconvert_exporter": "python", 747 | "pygments_lexer": "ipython2", 748 | "version": "2.7.11" 749 | } 750 | }, 751 | "nbformat": 4, 752 | "nbformat_minor": 0 753 | } 754 | --------------------------------------------------------------------------------