├── CLML-Datasets.ipynb ├── CLML-Datasets.lisp ├── CLML-Time-Series-Part-1.ipynb ├── CLML-Time-Series-Part-1.lisp ├── CLML-Wine-pca-k-means-and-hierarchical-clustering.ipynb ├── CLML-Wine-pca-k-means-and-hierarchical-clustering.lisp └── README.md /CLML-Datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# CLML Datasets Tutorial" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "> (C) 2015 Mike Maul -- CC-BY-SA 3.0" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This document is part series of tutorials illustrating the use of CLML. " 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Datasets, what and why" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "CLML datasets are two dimensional tabular data structures. In CLML datasets are used for (not to sound recursive) storing datasets. Datasets may contain numerical and categorical data. Datasets also contain column metadata (`dimensions`) and also provide facilities for extracting columns, dataset cleaning and splitting. Datasets in CLML are similar to dataframes in R or Pandas.DataFrames in Python.\n", 36 | "\n", 37 | "Lets get started by loading the system necessary for this tutorial and creating a namespace to work in." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 1, 43 | "metadata": { 44 | "collapsed": false 45 | }, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "To load \"clml.utility\":\n", 52 | " Load 1 ASDF system:\n", 53 | " clml.utility\n", 54 | "\n", 55 | "; Loading \"clml.utility\"\n", 56 | "....\n", 57 | "To load \"clml.hjs\":\n", 58 | " Load 1 ASDF system:\n", 59 | " clml.hjs\n", 60 | "\n", 61 | "; Loading \"clml.hjs\"\n", 62 | "\n", 63 | "To load \"iolib\":\n", 64 | " Load 1 ASDF system:\n", 65 | " iolib\n", 66 | "\n", 67 | "; Loading \"iolib\"\n", 68 | ".....\n", 69 | "To load \"clml.extras.eazy-gnuplot\":\n", 70 | " Load 1 ASDF system:\n", 71 | " clml.extras.eazy-gnuplot\n", 72 | "\n", 73 | "; Loading \"clml.extras.eazy-gnuplot\"\n", 74 | "\n", 75 | "To load \"eazy-gnuplot\":\n", 76 | " Load 1 ASDF system:\n", 77 | " eazy-gnuplot\n", 78 | "\n", 79 | "; Loading \"eazy-gnuplot\"\n", 80 | "\n" 81 | ] 82 | }, 83 | { 84 | "data": { 85 | "text/plain": [ 86 | "(:CLML.UTILITY :CLML.HJS :IOLIB :CLML.EXTRAS.EAZY-GNUPLOT :EAZY-GNUPLOT)" 87 | ] 88 | }, 89 | "execution_count": 1, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "(ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net\n", 96 | " :clml.hjs ; Need clml.hjs.read-data for dataset\n", 97 | " :iolib\n", 98 | " :clml.extras.eazy-gnuplot\n", 99 | " :eazy-gnuplot\n", 100 | " ))" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 2, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "#" 114 | ] 115 | }, 116 | "execution_count": 2, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "(defpackage #:datasets-tutorial\n", 123 | " (:use #:cl\n", 124 | " #:cl-jupyter-user ; Not needed unless using iPython notebook\n", 125 | " #:clml.hjs.read-data\n", 126 | " #:clml.hjs.meta ; util function\n", 127 | " #:clml.extras.eazy-gnuplot))\n" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 3, 133 | "metadata": { 134 | "collapsed": false 135 | }, 136 | "outputs": [ 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "#" 141 | ] 142 | }, 143 | "execution_count": 3, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "(in-package :datasets-tutorial)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Lets load some data that we will use as we learn about datasets." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 4, 162 | "metadata": { 163 | "collapsed": false 164 | }, 165 | "outputs": [ 166 | { 167 | "data": { 168 | "text/plain": [ 169 | "DATASET" 170 | ] 171 | }, 172 | "execution_count": 4, 173 | "metadata": {}, 174 | "output_type": "execute_result" 175 | } 176 | ], 177 | "source": [ 178 | "(defparameter dataset (read-data-from-file \n", 179 | " (clml.utility.data:fetch \"https://mmaul.github.io/clml.data/sample/cars.csv\") \n", 180 | " :type :csv :csv-type-spec '(integer integer)))" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "##Data and Datasets\n", 188 | "\n", 189 | "CLML has a number of different specializations of datasets such as\n", 190 | " - `unspecialized-dataset` untyped and unspecialized data\n", 191 | " - `numeric-dataset` dataset containing numeric (`double-float`) data\n", 192 | " - `category-dataset` dataset for categorical (`string`) data\n", 193 | " - `numeric-and-category-dataset` dataset containing a mixture of numeric and categorical data\n", 194 | " - `numeric-matrix-dataset` dataset where numeric values are stored as a matrix\n", 195 | " - `numeric-matrix-and-category-dataset` dataset where numeric values are stored as a matrix as well as having categorical data\n", 196 | "\n", 197 | "###Data representation\n", 198 | "All datasets except the matrix datasets represent data as a vector of vectors. The inner vector contains the columns of each row. For datasets with categories, numeric and category data are stored in seperate vectors.\n", 199 | "\n", 200 | "We can see below how the data is represented." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 5, 206 | "metadata": { 207 | "collapsed": false 208 | }, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "#(#(4 2) #(4 10) #(7 4) #(7 22) #(8 16) #(9 10) #(10 18) #(10 26) #(10 34)\n", 214 | " #(11 17) #(11 28) #(12 14) #(12 20) #(12 24) #(12 28) #(13 26) #(13 34)\n", 215 | " #(13 34) #(13 46) #(14 26) #(14 36) #(14 60) #(14 80) #(15 20) #(15 26)\n", 216 | " #(15 54) #(16 32) #(16 40) #(17 32) #(17 40) #(17 50) #(18 42) #(18 56)\n", 217 | " #(18 76) #(18 84) #(19 36) #(19 46) #(19 68) #(20 32) #(20 48) #(20 52)\n", 218 | " #(20 56) #(20 64) #(22 66) #(23 54) #(24 70) #(24 92) #(24 93) #(24 120)\n", 219 | " #(25 85))" 220 | ] 221 | }, 222 | "execution_count": 5, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "(dataset-points dataset)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "It may not be convenient to display the whole dataset to take a look at is. We could have used `subseq` but there is a helper method called `head-points`." 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 6, 241 | "metadata": { 242 | "collapsed": false 243 | }, 244 | "outputs": [ 245 | { 246 | "data": { 247 | "text/plain": [ 248 | "#(#(4 2) #(4 10) #(7 4) #(7 22) #(8 16))" 249 | ] 250 | }, 251 | "execution_count": 6, 252 | "metadata": {}, 253 | "output_type": "execute_result" 254 | } 255 | ], 256 | "source": [ 257 | "(head-points dataset)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "###Dimensions\n", 265 | "All Datasets have the `dimensions` slot which contain the column metadata. The dimensions slot contains a list of `dimension` instances. Each dimension instance contains the following slots (accessor prefix is dimension):\n", 266 | " - `name` column name\n", 267 | " - `type` type of data in column (e.g. :category :numeric :unknown)\n", 268 | " - `index` index on column vectors of column\n", 269 | " - `metadata` - alist that CAN containing useful information, such as equality tests for category data" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 7, 275 | "metadata": { 276 | "collapsed": false 277 | }, 278 | "outputs": [ 279 | { 280 | "data": { 281 | "text/plain": [ 282 | "#(#\n", 283 | " #)" 284 | ] 285 | }, 286 | "execution_count": 7, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "(dataset-dimensions dataset)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "### Creating datasets\n", 300 | "Datasets can be created directly or can be created by reading them from a file. Supported data formats or CSV and SEXP.\n", 301 | " Earlier we used the `read-data-from-file` function to read a dataset from a CSV file. The file in this case is a file that is obtained with the `fetch` from the `clml.utility` system, which downloads and caches a file in a location on a local files system or a URL. Datasets can also be created programatically." 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 11, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "#\n", 315 | "DIMENSIONS: cat 1 | num 1\n", 316 | "TYPES: CATEGORY | NUMERIC\n", 317 | "NUMBER OF DIMENSIONS: 2\n", 318 | "CATEGORY DATA POINTS: 2 POINTS\n", 319 | "NUMERIC DATA POINTS: 2 POINTS\n" 320 | ] 321 | }, 322 | "execution_count": 11, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "(make-numeric-and-category-dataset \n", 329 | " '(\"cat 1\" \"num 1\") ; <-- Column names \n", 330 | " (vector (v2dvec #(1.0d0)) (v2dvec #(2.0d0))) ; <-- Numeric data \n", 331 | " '(1) ; <-- Indexes of numeric column\n", 332 | " #(#(\"a\") #(\"b\")) ; <-- Category Data\n", 333 | " '(0) ; <-- Indexes of category data\n", 334 | ")" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "### Specializing datasets\n", 342 | "The dataset we loaded is currently unspecialized, we haven't told CLML much about it yet. We can use the `pick-and-specialize-data` method to fill in the details." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 8, 348 | "metadata": { 349 | "collapsed": false 350 | }, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "#\n", 356 | "DIMENSIONS: speed | distance\n", 357 | "TYPES: NUMERIC | NUMERIC\n", 358 | "NUMBER OF DIMENSIONS: 2\n", 359 | "NUMERIC DATA POINTS: 50 POINTS\n" 360 | ] 361 | }, 362 | "execution_count": 8, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "(pick-and-specialize-data dataset :data-types '(:numeric :numeric))" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": { 374 | "collapsed": false 375 | }, 376 | "source": [ 377 | "We can see `pick-and-specialize-data` returned a numeric dataset based on the supplied `:data-types` specification. `pick-and-specialize-data` has two parameters `:range` and :`except`. Both parameters deal with column selection `:range` specifies a range of columns (as a list) to use in our new dataset while `:except` specifies a list of columns to exclude from our new dataset. We had also mentioned the matrix datasets, `pick-and-specialize-data` can also change the representation from a vector of vectors to a matrix." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 56, 383 | "metadata": { 384 | "collapsed": false 385 | }, 386 | "outputs": [ 387 | { 388 | "name": "stdout", 389 | "output_type": "stream", 390 | "text": [ 391 | "\n", 392 | "#\n", 393 | "DIMENSIONS: speed | distance\n", 394 | "\n", 395 | "TYPES: NUMERIC | NUMERIC\n", 396 | "\n", 397 | "NUMBER OF DIMENSIONS: 2\n", 398 | "\n", 399 | "NUMERIC-MATRIX DATA POINTS: 50 POINTS\n", 400 | " " 401 | ] 402 | }, 403 | { 404 | "data": { 405 | "text/plain": [ 406 | "#2A((4.0d0 2.0d0)\n", 407 | " (4.0d0 10.0d0)\n", 408 | " (7.0d0 4.0d0)\n", 409 | " (7.0d0 22.0d0)\n", 410 | " (8.0d0 16.0d0)\n", 411 | " (9.0d0 10.0d0)\n", 412 | " (10.0d0 18.0d0)\n", 413 | " (10.0d0 26.0d0)\n", 414 | " (10.0d0 34.0d0)\n", 415 | " (11.0d0 17.0d0)\n", 416 | " (11.0d0 28.0d0)\n", 417 | " (12.0d0 14.0d0)\n", 418 | " (12.0d0 20.0d0)\n", 419 | " (12.0d0 24.0d0)\n", 420 | " (12.0d0 28.0d0)\n", 421 | " (13.0d0 26.0d0)\n", 422 | " (13.0d0 34.0d0)\n", 423 | " (13.0d0 34.0d0)\n", 424 | " (13.0d0 46.0d0)\n", 425 | " (14.0d0 26.0d0)\n", 426 | " (14.0d0 36.0d0)\n", 427 | " (14.0d0 60.0d0)\n", 428 | " (14.0d0 80.0d0)\n", 429 | " (15.0d0 20.0d0)\n", 430 | " (15.0d0 26.0d0)\n", 431 | " (15.0d0 54.0d0)\n", 432 | " (16.0d0 32.0d0)\n", 433 | " (16.0d0 40.0d0)\n", 434 | " (17.0d0 32.0d0)\n", 435 | " (17.0d0 40.0d0)\n", 436 | " (17.0d0 50.0d0)\n", 437 | " (18.0d0 42.0d0)\n", 438 | " (18.0d0 56.0d0)\n", 439 | " (18.0d0 76.0d0)\n", 440 | " (18.0d0 84.0d0)\n", 441 | " (19.0d0 36.0d0)\n", 442 | " (19.0d0 46.0d0)\n", 443 | " (19.0d0 68.0d0)\n", 444 | " (20.0d0 32.0d0)\n", 445 | " (20.0d0 48.0d0)\n", 446 | " (20.0d0 52.0d0)\n", 447 | " (20.0d0 56.0d0)\n", 448 | " (20.0d0 64.0d0)\n", 449 | " (22.0d0 66.0d0)\n", 450 | " (23.0d0 54.0d0)\n", 451 | " (24.0d0 70.0d0)\n", 452 | " (24.0d0 92.0d0)\n", 453 | " (24.0d0 93.0d0)\n", 454 | " (24.0d0 120.0d0)\n", 455 | " (25.0d0 85.0d0))" 456 | ] 457 | }, 458 | "execution_count": 56, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | } 462 | ], 463 | "source": [ 464 | "(let ((ds (pick-and-specialize-data dataset :data-types '(:numeric :numeric) \n", 465 | " :store-numeric-data-as-matrix t)))\n", 466 | " (print ds)\n", 467 | " (dataset-numeric-points ds))" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "We should also show an example of a dataset with categories." 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 10, 480 | "metadata": { 481 | "collapsed": false 482 | }, 483 | "outputs": [ 484 | { 485 | "data": { 486 | "text/plain": [ 487 | "#\n", 488 | "DIMENSIONS: year season | UKgas\n", 489 | "TYPES: CATEGORY | NUMERIC\n", 490 | "NUMBER OF DIMENSIONS: 2\n", 491 | "CATEGORY DATA POINTS: 108 POINTS\n", 492 | "NUMERIC DATA POINTS: 108 POINTS\n" 493 | ] 494 | }, 495 | "execution_count": 10, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "(pick-and-specialize-data (read-data-from-file \n", 502 | " (clml.utility.data:fetch \"https://mmaul.github.io/clml.data/sample/UKgas.sexp\"))\n", 503 | " :data-types '(:category :numeric))" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "Datasets can be created and combined. Generally the dataset creation methods take the form of `make-` and either take vectors containing data or other datasets and create a new dataset." 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "### Missing Values\n", 518 | "CLML datasets support missing values. Missing values are represented as follows in the dataset-points:\n", 519 | " - category CLML.HJS.MISSING-VALUE:+C-NAN+\n", 520 | " - numeric CLML.HJS.MISSING-VALUE:+NAN+\n", 521 | " - unspecialized :NA\n", 522 | " \n", 523 | "There are also the following predicates available to detect missing values:\n", 524 | " - `CLML.HJS.MISSING-VALUE:C-NAN-P`\n", 525 | " - `CLML.HJS.MISSING-VALUE:NAN-P`\n", 526 | " \n", 527 | "The `read-data-from-file` also supports the mapping representations of missing values in data files to datasets.\n", 528 | "The `missing-values-list` keyword argument specifies the character sequences that will be recognized as missing values.\n", 529 | "\n", 530 | "To illustrate missing values support lets read in a CSV file containing the follow:\n", 531 | "\n", 532 | " a, b, c\n", 533 | " 1.0, 2.0, x\n", 534 | " NA, 3.0, NA\n", 535 | " \n", 536 | "Here missing values are represented in the CSV file by *NA*. For the `read-data` function to recognize the missing values we must set the `:missing-values-list` parameter as shown below:" 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": 12, 542 | "metadata": { 543 | "collapsed": false 544 | }, 545 | "outputs": [ 546 | { 547 | "data": { 548 | "text/plain": [ 549 | "\"#\n", 550 | "DIMENSIONS: a | b | c\n", 551 | "TYPES: UNKNOWN | UNKNOWN | UNKNOWN\n", 552 | "NUMBER OF DIMENSIONS: 3\n", 553 | "DATA POINTS: 2 POINTS\n", 554 | "\n", 555 | "#(#(1.0d0 2.0d0 x) #(NA 3.0d0 NA))\n", 556 | "\"" 557 | ] 558 | }, 559 | "execution_count": 12, 560 | "metadata": {}, 561 | "output_type": "execute_result" 562 | } 563 | ], 564 | "source": [ 565 | "(let ((ds (read-data-from-file \n", 566 | " (clml.utility.data:fetch \"https://mmaul.github.io/clml.data/sample/simple1.csv\") \n", 567 | " :type :csv \n", 568 | " :csv-type-spec '(double-float double-float string) \n", 569 | " :missing-values-list '(\"NA\")\n", 570 | " )))\n", 571 | " (format nil \"~A~%~A~%\" ds (dataset-points ds)))" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "We can also see how missing values are represented in a specialized dataset:" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 13, 584 | "metadata": { 585 | "collapsed": false 586 | }, 587 | "outputs": [ 588 | { 589 | "data": { 590 | "text/plain": [ 591 | "\"#\n", 592 | "DIMENSIONS: a | b | c\n", 593 | "TYPES: NUMERIC | NUMERIC | CATEGORY\n", 594 | "NUMBER OF DIMENSIONS: 3\n", 595 | "CATEGORY DATA POINTS: 2 POINTS\n", 596 | "NUMERIC DATA POINTS: 2 POINTS\n", 597 | "\n", 598 | "#(#(1.0d0 2.0d0) #(# 3.0d0))\n", 599 | "#(#(x) #(0))\n", 600 | "\"" 601 | ] 602 | }, 603 | "execution_count": 13, 604 | "metadata": {}, 605 | "output_type": "execute_result" 606 | } 607 | ], 608 | "source": [ 609 | "(let ((ds (pick-and-specialize-data\n", 610 | " (read-data-from-file \n", 611 | " (clml.utility.data:fetch \"https://mmaul.github.io/clml.data/sample/simple1.csv\") \n", 612 | " :type :csv \n", 613 | " :csv-type-spec '(double-float double-float string) \n", 614 | " :missing-values-list '(\"NA\")\n", 615 | " )\n", 616 | " :data-types '(:numeric :numeric :category)\n", 617 | " )))\n", 618 | " (format nil \"~A~%~A~%~A~%\" ds (dataset-numeric-points ds) (dataset-category-points ds))\n", 619 | ")" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "#### Dataset manipulation\n", 627 | "The following operations can be preformed on datasets:\n", 628 | " - copying\n", 629 | " - splitting and sampling\n", 630 | " - cleaning\n", 631 | " - shuffling\n", 632 | " - deduplication\n", 633 | " - storing\n", 634 | " \n", 635 | "We will use the UK Gas dataset to illustrate these operations." 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 14, 641 | "metadata": { 642 | "collapsed": false 643 | }, 644 | "outputs": [ 645 | { 646 | "data": { 647 | "text/plain": [ 648 | "UKGAS" 649 | ] 650 | }, 651 | "execution_count": 14, 652 | "metadata": {}, 653 | "output_type": "execute_result" 654 | } 655 | ], 656 | "source": [ 657 | "(defparameter ukgas (pick-and-specialize-data (read-data-from-file \n", 658 | " (clml.utility.data:fetch \"https://mmaul.github.io/clml.data/sample/UKgas.sexp\"))\n", 659 | " :data-types '(:category :numeric)))" 660 | ] 661 | }, 662 | { 663 | "cell_type": "markdown", 664 | "metadata": {}, 665 | "source": [ 666 | "##### Copying\n", 667 | "The simplest operation is copying. `copy-dataset` makes a deep copy of the contents of a dataset." 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "execution_count": 15, 673 | "metadata": { 674 | "collapsed": false 675 | }, 676 | "outputs": [ 677 | { 678 | "data": { 679 | "text/plain": [ 680 | "#\n", 681 | "DIMENSIONS: year season | UKgas\n", 682 | "TYPES: CATEGORY | NUMERIC\n", 683 | "NUMBER OF DIMENSIONS: 2\n", 684 | "CATEGORY DATA POINTS: 108 POINTS\n", 685 | "NUMERIC DATA POINTS: 108 POINTS\n" 686 | ] 687 | }, 688 | "execution_count": 15, 689 | "metadata": {}, 690 | "output_type": "execute_result" 691 | } 692 | ], 693 | "source": [ 694 | "(copy-dataset ukgas)" 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": {}, 700 | "source": [ 701 | "##### Splitting and Sampling\n", 702 | "Datasets can be subdivided by two similar methods `make-bootstrap-sample-dataset` and `divide-dataset`\n", 703 | "\n", 704 | "The `divide dataset` returns a dataset split into two parts based upon the `:divide-ratio` like `pick-and-specialize-data` `divide-dataset` can limit the section values accessed with the `:range` and `:except` parameters. It can also pull values in a pseudo-random manner values in to their new datasets." 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 17, 710 | "metadata": { 711 | "collapsed": false 712 | }, 713 | "outputs": [ 714 | { 715 | "data": { 716 | "text/plain": [ 717 | "(#\n", 718 | "DIMENSIONS: year season | UKgas\n", 719 | "\n", 720 | "TYPES: CATEGORY | NUMERIC\n", 721 | "\n", 722 | "NUMBER OF DIMENSIONS: 2\n", 723 | "\n", 724 | "CATEGORY DATA POINTS: 81 POINTS\n", 725 | "\n", 726 | "NUMERIC DATA POINTS: 81 POINTS\n", 727 | "\n", 728 | " #\n", 729 | "DIMENSIONS: year season | UKgas\n", 730 | "\n", 731 | "TYPES: CATEGORY | NUMERIC\n", 732 | "\n", 733 | "NUMBER OF DIMENSIONS: 2\n", 734 | "\n", 735 | "CATEGORY DATA POINTS: 27 POINTS\n", 736 | "\n", 737 | "NUMERIC DATA POINTS: 27 POINTS\n", 738 | ")" 739 | ] 740 | }, 741 | "execution_count": 17, 742 | "metadata": {}, 743 | "output_type": "execute_result" 744 | } 745 | ], 746 | "source": [ 747 | " (multiple-value-list (divide-dataset ukgas :divide-ratio '(3 1) :random t))" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "`make-bootstrap-sample-datasets` on the other hand shuffles a dataset into a number of specified datasets of equal length to the original dataset. The `:number-of-datasets`parameter defaults to 10." 755 | ] 756 | }, 757 | { 758 | "cell_type": "code", 759 | "execution_count": 18, 760 | "metadata": { 761 | "collapsed": false 762 | }, 763 | "outputs": [ 764 | { 765 | "data": { 766 | "text/plain": [ 767 | "(#\n", 768 | "DIMENSIONS: year season | UKgas\n", 769 | "\n", 770 | "TYPES: CATEGORY | NUMERIC\n", 771 | "\n", 772 | "NUMBER OF DIMENSIONS: 2\n", 773 | "\n", 774 | "CATEGORY DATA POINTS: 108 POINTS\n", 775 | "\n", 776 | "NUMERIC DATA POINTS: 108 POINTS\n", 777 | "\n", 778 | " #\n", 779 | "DIMENSIONS: year season | UKgas\n", 780 | "\n", 781 | "TYPES: CATEGORY | NUMERIC\n", 782 | "\n", 783 | "NUMBER OF DIMENSIONS: 2\n", 784 | "\n", 785 | "CATEGORY DATA POINTS: 108 POINTS\n", 786 | "\n", 787 | "NUMERIC DATA POINTS: 108 POINTS\n", 788 | "\n", 789 | " #\n", 790 | "DIMENSIONS: year season | UKgas\n", 791 | "\n", 792 | "TYPES: CATEGORY | NUMERIC\n", 793 | "\n", 794 | "NUMBER OF DIMENSIONS: 2\n", 795 | "\n", 796 | "CATEGORY DATA POINTS: 108 POINTS\n", 797 | "\n", 798 | "NUMERIC DATA POINTS: 108 POINTS\n", 799 | ")" 800 | ] 801 | }, 802 | "execution_count": 18, 803 | "metadata": {}, 804 | "output_type": "execute_result" 805 | } 806 | ], 807 | "source": [ 808 | "(make-bootstrap-sample-datasets ukgas :number-of-datasets 3)" 809 | ] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "metadata": {}, 814 | "source": [ 815 | "##### Cleaning\n", 816 | "\n", 817 | "One nice features of CLML is the dataset cleaning capabilities. The `dataset-cleaning` method provides the following:\n", 818 | " - outlier detection for numeric points\n", 819 | " - standard deviation \n", 820 | " - mean deviation \n", 821 | " - user provided function\n", 822 | " - smirnov-grubbs \n", 823 | " - outlier detection for categorical points\n", 824 | " - frequency based\n", 825 | " - User provided function\n", 826 | " - Outlier and missing value interpolation using:\n", 827 | " - zero \n", 828 | " - min \n", 829 | " - max \n", 830 | " - mean \n", 831 | " - median \n", 832 | " - mode \n", 833 | " - spline\n", 834 | " \n", 835 | "To illustrate we will preform dataset cleaning where outliers will be points that exceed 1 standard deviation and will be replaced by zero." 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 19, 841 | "metadata": { 842 | "collapsed": false 843 | }, 844 | "outputs": [ 845 | { 846 | "data": { 847 | "text/plain": [ 848 | "#\n", 849 | "DIMENSIONS: year season | UKgas\n", 850 | "TYPES: CATEGORY | NUMERIC\n", 851 | "NUMBER OF DIMENSIONS: 2\n", 852 | "CATEGORY DATA POINTS: 108 POINTS\n", 853 | "NUMERIC DATA POINTS: 108 POINTS\n" 854 | ] 855 | }, 856 | "execution_count": 19, 857 | "metadata": {}, 858 | "output_type": "execute_result" 859 | } 860 | ], 861 | "source": [ 862 | " (dataset-cleaning ukgas :outlier-types-alist '((\"UKgas\" . :std-dev)) \n", 863 | " :outlier-values-alist '((:std-dev . 1)) \n", 864 | " :interp-types-alist '((\"UKgas\" . :zero)))" 865 | ] 866 | }, 867 | { 868 | "cell_type": "markdown", 869 | "metadata": {}, 870 | "source": [ 871 | "#####Adding dimensions and Concatenating Datasets\n", 872 | "\n", 873 | "###### Adding dimensions\n", 874 | "In some cases you may want to add a computed column or add a column to a dataset to hold the product of a computation on a dataset. The `add-dim` method can accomplish this easily. It can add an existing column of points with the :points parameter, it can also create a column with points filled with a initial value with the `:initial-value` parameter. The two mandatory parameters are the dataset to add the dimension to, the name of the new dimension and the type. If the dataset is either a category or numeric only dataset `add-dim` will create a numeric-and-category-dataset if a column of a different type is added." 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": 20, 880 | "metadata": { 881 | "collapsed": false 882 | }, 883 | "outputs": [ 884 | { 885 | "data": { 886 | "text/plain": [ 887 | "#\n", 888 | "DIMENSIONS: year season | UKgas | mpg\n", 889 | "TYPES: CATEGORY | NUMERIC | NUMERIC\n", 890 | "NUMBER OF DIMENSIONS: 3\n", 891 | "CATEGORY DATA POINTS: 108 POINTS\n", 892 | "NUMERIC DATA POINTS: 108 POINTS\n" 893 | ] 894 | }, 895 | "execution_count": 20, 896 | "metadata": {}, 897 | "output_type": "execute_result" 898 | } 899 | ], 900 | "source": [ 901 | "(add-dim ukgas \"mpg\" :numeric :initial-value 0.0d0)" 902 | ] 903 | }, 904 | { 905 | "cell_type": "markdown", 906 | "metadata": {}, 907 | "source": [ 908 | "##### Concatenating datasets\n", 909 | "Two datasets with equal numbers of rows can be concatenated or glued together vertically. `concatenate-datasets` takes two datasets as parameters and return a dataset with the points of the first dataset stacked on top of the points of the second dataset. The dimension name names of the first one dataset are retained in the new dataset." 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 21, 915 | "metadata": { 916 | "collapsed": false 917 | }, 918 | "outputs": [ 919 | { 920 | "data": { 921 | "text/plain": [ 922 | "#\n", 923 | "DIMENSIONS: year season | UKgas\n", 924 | "TYPES: CATEGORY | NUMERIC\n", 925 | "NUMBER OF DIMENSIONS: 2\n", 926 | "CATEGORY DATA POINTS: 216 POINTS\n", 927 | "NUMERIC DATA POINTS: 216 POINTS\n" 928 | ] 929 | }, 930 | "execution_count": 21, 931 | "metadata": {}, 932 | "output_type": "execute_result" 933 | } 934 | ], 935 | "source": [ 936 | "(concatenate-datasets ukgas ukgas)" 937 | ] 938 | }, 939 | { 940 | "cell_type": "markdown", 941 | "metadata": {}, 942 | "source": [ 943 | "##### Deduplicating datasets\n", 944 | "Datasets can be deduplicated in place with the `dedup-dataset!` method. This functionality is currently only implemented for numeric, category and unspecialized datasets.\n", 945 | "\n", 946 | "##### Storing datasets\n", 947 | "Datasets can be saved to a file in csv format with the `write-dataset` method." 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": 24, 953 | "metadata": { 954 | "collapsed": false 955 | }, 956 | "outputs": [ 957 | { 958 | "data": { 959 | "text/plain": [ 960 | "(CLML.UTILITY.CSV::OK)" 961 | ] 962 | }, 963 | "execution_count": 24, 964 | "metadata": {}, 965 | "output_type": "execute_result" 966 | } 967 | ], 968 | "source": [ 969 | " (write-dataset ukgas \"gasgas.csv\")" 970 | ] 971 | }, 972 | { 973 | "cell_type": "markdown", 974 | "metadata": {}, 975 | "source": [ 976 | "#### Working with dataset points\n", 977 | "\n", 978 | "Columns and values can be accessed and extracted from datasets using the `!!` macros. This macro returns the column name or list of column names as a vectors of vectors if multiple column names are specified or as a single vector if a single column name is specified." 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": 23, 984 | "metadata": { 985 | "collapsed": false 986 | }, 987 | "outputs": [ 988 | { 989 | "data": { 990 | "text/plain": [ 991 | "#(160.1d0 129.7d0 84.8d0 120.1d0 160.1d0 124.9d0 84.8d0 116.9d0 169.7d0 140.9d0\n", 992 | " 89.7d0 123.3d0 187.3d0 144.1d0 92.9d0 120.1d0 176.1d0 147.3d0 89.7d0 123.3d0\n", 993 | " 185.7d0 155.3d0 99.3d0 131.3d0 200.1d0 161.7d0 102.5d0 136.1d0 204.9d0\n", 994 | " 176.1d0 112.1d0 140.9d0 227.3d0 195.3d0 115.3d0 142.5d0 244.9d0 214.5d0\n", 995 | " 118.5d0 153.7d0 244.9d0 216.1d0 188.9d0 142.5d0 301.0d0 196.9d0 136.1d0\n", 996 | " 267.3d0 317.0d0 230.5d0 152.1d0 336.2d0 371.4d0 240.1d0 158.5d0 355.4d0\n", 997 | " 449.9d0 286.6d0 179.3d0 403.4d0 491.5d0 321.8d0 177.7d0 409.8d0 593.9d0\n", 998 | " 329.8d0 176.1d0 483.5d0 584.3d0 395.4d0 187.3d0 485.1d0 669.2d0 421.0d0\n", 999 | " 216.1d0 509.1d0 827.7d0 467.5d0 209.7d0 542.7d0 840.5d0 414.6d0 217.7d0\n", 1000 | " 670.8d0 848.5d0 437.0d0 209.7d0 701.2d0 925.3d0 443.4d0 214.5d0 683.6d0\n", 1001 | " 917.3d0 515.5d0 224.1d0 694.8d0 989.4d0 477.1d0 233.7d0 730.0d0 1087.0d0\n", 1002 | " 534.7d0 281.8d0 787.6d0 1163.9d0 613.1d0 347.4d0 782.8d0)" 1003 | ] 1004 | }, 1005 | "execution_count": 23, 1006 | "metadata": {}, 1007 | "output_type": "execute_result" 1008 | } 1009 | ], 1010 | "source": [ 1011 | "(!! ukgas \"UKgas\")" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "markdown", 1016 | "metadata": {}, 1017 | "source": [ 1018 | "Dataset points can also be accessed with the slot accessor. Since category and numeric data are stored separately in heterogeneous datasets separate accessors are used to access the points.\n", 1019 | "\n", 1020 | "The list below shows which methods are applicable to the dataset type.\n", 1021 | "- `dataset-points`: `unspecialized-dataset`\n", 1022 | "- `dataset-numeric-points`: `numeric-dataset` `numeric-and-category-dataset` `numeric-matrix-dataset numeric-matrix-and-category-dataset`\n", 1023 | "- `dataset-category-points`: `category-dataset` `numeric-and-category-dataset` `numeric-matrix-dataset` `numeric-matrix-and-category-dataset`" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": 25, 1029 | "metadata": { 1030 | "collapsed": false 1031 | }, 1032 | "outputs": [ 1033 | { 1034 | "data": { 1035 | "text/plain": [ 1036 | "#(#(160.1d0) #(129.7d0) #(84.8d0) #(120.1d0) #(160.1d0) #(124.9d0) #(84.8d0)\n", 1037 | " #(116.9d0) #(169.7d0) #(140.9d0) #(89.7d0) #(123.3d0) #(187.3d0) #(144.1d0)\n", 1038 | " #(92.9d0) #(120.1d0) #(176.1d0) #(147.3d0) #(89.7d0) #(123.3d0) #(185.7d0)\n", 1039 | " #(155.3d0) #(99.3d0) #(131.3d0) #(200.1d0) #(161.7d0) #(102.5d0) #(136.1d0)\n", 1040 | " #(204.9d0) #(176.1d0) #(112.1d0) #(140.9d0) #(227.3d0) #(195.3d0) #(115.3d0)\n", 1041 | " #(142.5d0) #(244.9d0) #(214.5d0) #(118.5d0) #(153.7d0) #(244.9d0) #(216.1d0)\n", 1042 | " #(188.9d0) #(142.5d0) #(301.0d0) #(196.9d0) #(136.1d0) #(267.3d0) #(317.0d0)\n", 1043 | " #(230.5d0) #(152.1d0) #(336.2d0) #(371.4d0) #(240.1d0) #(158.5d0) #(355.4d0)\n", 1044 | " #(449.9d0) #(286.6d0) #(179.3d0) #(403.4d0) #(491.5d0) #(321.8d0) #(177.7d0)\n", 1045 | " #(409.8d0) #(593.9d0) #(329.8d0) #(176.1d0) #(483.5d0) #(584.3d0) #(395.4d0)\n", 1046 | " #(187.3d0) #(485.1d0) #(669.2d0) #(421.0d0) #(216.1d0) #(509.1d0) #(827.7d0)\n", 1047 | " #(467.5d0) #(209.7d0) #(542.7d0) #(840.5d0) #(414.6d0) #(217.7d0) #(670.8d0)\n", 1048 | " #(848.5d0) #(437.0d0) #(209.7d0) #(701.2d0) #(925.3d0) #(443.4d0) #(214.5d0)\n", 1049 | " #(683.6d0) #(917.3d0) #(515.5d0) #(224.1d0) #(694.8d0) #(989.4d0) #(477.1d0)\n", 1050 | " #(233.7d0) #(730.0d0) #(1087.0d0) #(534.7d0) #(281.8d0) #(787.6d0)\n", 1051 | " #(1163.9d0) #(613.1d0) #(347.4d0) #(782.8d0))" 1052 | ] 1053 | }, 1054 | "execution_count": 25, 1055 | "metadata": {}, 1056 | "output_type": "execute_result" 1057 | } 1058 | ], 1059 | "source": [ 1060 | "(dataset-numeric-points ukgas)" 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "markdown", 1065 | "metadata": {}, 1066 | "source": [ 1067 | "## A little something extra, `R-datasets`\n", 1068 | "\n", 1069 | "One thing that I've always found handy in R is a standard, curated, extensive and documented series of datasets. Wouldn't it be nice to have access to these directly as datasets in CLML. The `R-datasets` system in `clml.extras` provides this capability. A particularly good use case for these datasets is to be able to follow along with examples and tutorials for R in CLML.\n", 1070 | "\n", 1071 | "#### Note\n", 1072 | "The `clml.extras` systems are not currently part of quicklisp so if you are following along with this tutorial and are expecting just to `(quickload :clml.extras.Rdatasets)` you can't till you clone the clml.extras repository [http://github.com/mmaul/clml.extras.git](http://github.com/mmaul/clml.extras.git) into your `quicklisp/local-projects` directory\n", 1073 | "\n", 1074 | "### Using Rdatasets\n", 1075 | "The Rdatasets package makes datasets included with the R language distribution available as clml datasets.\n", 1076 | "R datasets are obtained csv files on Vincent Centarel's github repository.\n", 1077 | "More information on these datasets can be found at \n", 1078 | "\n", 1079 | "Because type information is not included it may be necessary to provide a `csv-type-spec`\n", 1080 | "for the columns in the csv file." 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": 49, 1086 | "metadata": { 1087 | "collapsed": false 1088 | }, 1089 | "outputs": [ 1090 | { 1091 | "name": "stdout", 1092 | "output_type": "stream", 1093 | "text": [ 1094 | "To load \"clml.r-datasets\":\n", 1095 | " Load 1 ASDF system:\n", 1096 | " clml.r-datasets\n", 1097 | "\n", 1098 | "; Loading \"clml.r-datasets\"\n", 1099 | "\n" 1100 | ] 1101 | }, 1102 | { 1103 | "data": { 1104 | "text/plain": [ 1105 | "(:CLML.R-DATASETS)" 1106 | ] 1107 | }, 1108 | "execution_count": 49, 1109 | "metadata": {}, 1110 | "output_type": "execute_result" 1111 | } 1112 | ], 1113 | "source": [ 1114 | "(ql:quickload :clml.r-datasets)" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "code", 1119 | "execution_count": 27, 1120 | "metadata": { 1121 | "collapsed": false 1122 | }, 1123 | "outputs": [ 1124 | { 1125 | "data": { 1126 | "text/plain": [ 1127 | "T" 1128 | ] 1129 | }, 1130 | "execution_count": 27, 1131 | "metadata": {}, 1132 | "output_type": "execute_result" 1133 | } 1134 | ], 1135 | "source": [ 1136 | "(use-package :clml.r-datasets)" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "code", 1141 | "execution_count": 28, 1142 | "metadata": { 1143 | "collapsed": false 1144 | }, 1145 | "outputs": [ 1146 | { 1147 | "data": { 1148 | "text/plain": [ 1149 | "DD" 1150 | ] 1151 | }, 1152 | "execution_count": 28, 1153 | "metadata": {}, 1154 | "output_type": "execute_result" 1155 | } 1156 | ], 1157 | "source": [ 1158 | "(defparameter dd (get-r-dataset-directory))" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "code", 1163 | "execution_count": 54, 1164 | "metadata": { 1165 | "collapsed": false 1166 | }, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/plain": [ 1171 | "\"Package Item Title \n", 1172 | "------------------------- ------------------------- ------------------------- \n", 1173 | "datasets AirPassengers Monthly Airline Passenger Numbers 1949-1960 \n", 1174 | "\n", 1175 | "datasets BJsales Sales Data with Leading Indicator \n", 1176 | "\n", 1177 | "datasets BOD Biochemical Oxygen Demand \n", 1178 | "\n", 1179 | "datasets Formaldehyde Determination of Formaldehyde\"" 1180 | ] 1181 | }, 1182 | "execution_count": 54, 1183 | "metadata": {}, 1184 | "output_type": "execute_result" 1185 | } 1186 | ], 1187 | "source": [ 1188 | "(subseq (inventory dd :stream nil) 0 505)" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "code", 1193 | "execution_count": 57, 1194 | "metadata": { 1195 | "collapsed": false 1196 | }, 1197 | "outputs": [ 1198 | { 1199 | "data": { 1200 | "text/plain": [ 1201 | "\"\\\"\n", 1202 | "R: Biochemical Oxygen Demand\n", 1203 | "\n", 1204 | "\n", 1205 | "\n", 1206 | "\n", 1207 | "BODR Documentation\n", 1208 | "\n", 1209 | " Biochemical Oxygen Demand \n", 1210 | "\n", 1211 | "Description\n", 1212 | "\n", 1213 | "The BOD data frame has 6 rows and 2 columns giving the\n", 1214 | "biochemical oxygen demand versus time in an eval\"" 1215 | ] 1216 | }, 1217 | "execution_count": 57, 1218 | "metadata": {}, 1219 | "output_type": "execute_result" 1220 | } 1221 | ], 1222 | "source": [ 1223 | "(subseq (dataset-documentation dd \"datasets\" \"BOD\" :stream nil) 0 200)" 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": 38, 1229 | "metadata": { 1230 | "collapsed": false 1231 | }, 1232 | "outputs": [ 1233 | { 1234 | "data": { 1235 | "text/plain": [ 1236 | "BOD" 1237 | ] 1238 | }, 1239 | "execution_count": 38, 1240 | "metadata": {}, 1241 | "output_type": "execute_result" 1242 | } 1243 | ], 1244 | "source": [ 1245 | "(defparameter bod (get-dataset dd \"datasets\" \"BOD\" :csv-type-spec '(double-float double-float double-float)))" 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "code", 1250 | "execution_count": 39, 1251 | "metadata": { 1252 | "collapsed": false 1253 | }, 1254 | "outputs": [ 1255 | { 1256 | "data": { 1257 | "text/plain": [ 1258 | "#\n", 1259 | "DIMENSIONS: | Time | demand\n", 1260 | "TYPES: UNKNOWN | UNKNOWN | UNKNOWN\n", 1261 | "NUMBER OF DIMENSIONS: 3\n", 1262 | "DATA POINTS: 6 POINTS\n" 1263 | ] 1264 | }, 1265 | "execution_count": 39, 1266 | "metadata": {}, 1267 | "output_type": "execute_result" 1268 | } 1269 | ], 1270 | "source": [ 1271 | "bod" 1272 | ] 1273 | }, 1274 | { 1275 | "cell_type": "code", 1276 | "execution_count": 40, 1277 | "metadata": { 1278 | "collapsed": false 1279 | }, 1280 | "outputs": [ 1281 | { 1282 | "data": { 1283 | "text/plain": [ 1284 | "#\n", 1285 | "DIMENSIONS: | Time | demand\n", 1286 | "TYPES: NUMERIC | NUMERIC | NUMERIC\n", 1287 | "NUMBER OF DIMENSIONS: 3\n", 1288 | "NUMERIC DATA POINTS: 6 POINTS\n" 1289 | ] 1290 | }, 1291 | "execution_count": 40, 1292 | "metadata": {}, 1293 | "output_type": "execute_result" 1294 | } 1295 | ], 1296 | "source": [ 1297 | " (pick-and-specialize-data bod :data-types '(:numeric :numeric :numeric))" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "markdown", 1302 | "metadata": {}, 1303 | "source": [ 1304 | "##Conclusion\n", 1305 | "\n", 1306 | "The iPython notebook and source for this tutorial can be found in the [clml.tutorials https://github.com/mmaul/clml.tutorials.git](https://github.com/mmaul/clml.tutorials.git) github repository.\n", 1307 | "\n", 1308 | "###Stay tuned to [clml.tutorials](https://mmaul.github.io/clml.tutorials/) blog or [RSS feed](https://mmaul.github.io/clml.tutorials/feed.xml) for more CLML tutorials..\n", 1309 | "\n" 1310 | ] 1311 | } 1312 | ], 1313 | "metadata": { 1314 | "kernelspec": { 1315 | "display_name": "SBCL Lisp", 1316 | "language": "lisp", 1317 | "name": "lisp" 1318 | }, 1319 | "language_info": { 1320 | "name": "common-lisp", 1321 | "version": "1.2.7" 1322 | } 1323 | }, 1324 | "nbformat": 4, 1325 | "nbformat_minor": 0 1326 | } 1327 | -------------------------------------------------------------------------------- /CLML-Datasets.lisp: -------------------------------------------------------------------------------- 1 | 2 | (ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net 3 | :clml.hjs ; Need clml.hjs.read-data for dataset 4 | :iolib 5 | :clml.extras.eazy-gnuplot 6 | :eazy-gnuplot 7 | )) 8 | 9 | (defpackage #:datasets-tutorial 10 | (:use #:cl 11 | #:cl-jupyter-user ; Not needed unless using iPython notebook 12 | #:clml.hjs.read-data 13 | #:clml.hjs.meta ; util function 14 | #:clml.extras.eazy-gnuplot)) 15 | 16 | 17 | (in-package :datasets-tutorial) 18 | 19 | (defparameter dataset (read-data-from-file 20 | (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/cars.csv") 21 | :type :csv :csv-type-spec '(integer integer))) 22 | 23 | (dataset-points dataset) 24 | 25 | (head-points dataset) 26 | 27 | (dataset-dimensions dataset) 28 | 29 | (make-numeric-and-category-dataset 30 | '("cat 1" "num 1") ; <-- Column names 31 | (vector (v2dvec #(1.0d0)) (v2dvec #(2.0d0))) ; <-- Numeric data 32 | '(1) ; <-- Indexes of numeric column 33 | #(#("a") #("b")) ; <-- Category Data 34 | '(0) ; <-- Indexes of category data 35 | ) 36 | 37 | (pick-and-specialize-data dataset :data-types '(:numeric :numeric)) 38 | 39 | (let ((ds (pick-and-specialize-data dataset :data-types '(:numeric :numeric) 40 | :store-numeric-data-as-matrix t))) 41 | (print ds) 42 | (dataset-numeric-points ds)) 43 | 44 | (pick-and-specialize-data (read-data-from-file 45 | (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/UKgas.sexp")) 46 | :data-types '(:category :numeric)) 47 | 48 | (let ((ds (read-data-from-file 49 | (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/simple1.csv") 50 | :type :csv 51 | :csv-type-spec '(double-float double-float string) 52 | :missing-values-list '("NA") 53 | ))) 54 | (format nil "~A~%~A~%" ds (dataset-points ds))) 55 | 56 | (let ((ds (pick-and-specialize-data 57 | (read-data-from-file 58 | (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/simple1.csv") 59 | :type :csv 60 | :csv-type-spec '(double-float double-float string) 61 | :missing-values-list '("NA") 62 | ) 63 | :data-types '(:numeric :numeric :category) 64 | ))) 65 | (format nil "~A~%~A~%~A~%" ds (dataset-numeric-points ds) (dataset-category-points ds)) 66 | ) 67 | 68 | (defparameter ukgas (pick-and-specialize-data (read-data-from-file 69 | (clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/UKgas.sexp")) 70 | :data-types '(:category :numeric))) 71 | 72 | (copy-dataset ukgas) 73 | 74 | (multiple-value-list (divide-dataset ukgas :divide-ratio '(3 1) :random t)) 75 | 76 | (make-bootstrap-sample-datasets ukgas :number-of-datasets 3) 77 | 78 | (dataset-cleaning ukgas :outlier-types-alist '(("UKgas" . :std-dev)) 79 | :outlier-values-alist '((:std-dev . 1)) 80 | :interp-types-alist '(("UKgas" . :zero))) 81 | 82 | (add-dim ukgas "mpg" :numeric :initial-value 0.0d0) 83 | 84 | (concatenate-datasets ukgas ukgas) 85 | 86 | (write-dataset ukgas "gasgas.csv") 87 | 88 | (!! ukgas "UKgas") 89 | 90 | (dataset-numeric-points ukgas) 91 | 92 | (ql:quickload :clml.r-datasets) 93 | 94 | (use-package :clml.r-datasets) 95 | 96 | (defparameter dd (get-r-dataset-directory)) 97 | 98 | (subseq (inventory dd :stream nil) 0 505) 99 | 100 | (subseq (dataset-documentation dd "datasets" "BOD" :stream nil) 0 200) 101 | 102 | (defparameter bod (get-dataset dd "datasets" "BOD" :csv-type-spec '(double-float double-float double-float))) 103 | 104 | bod 105 | 106 | (pick-and-specialize-data bod :data-types '(:numeric :numeric :numeric)) 107 | -------------------------------------------------------------------------------- /CLML-Time-Series-Part-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# CLML Time Series Tutorial Part 1" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "> (C) 2015 Mike Maul -- CC-BY-SA 3.0" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This document is a the first is a series of tutorials illustrating the use of the CLML.time-series package. In fact it is a first in a series of series of tutorials illustrating the use of CLML. " 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "###Caveat\n", 29 | "Anyone wishing to run this notebook or the code contained there in must take note of the following:\n", 30 | " - This time series cleaning section of this tutorial relies on the github version of [`CLML` https://github.com/mmaul/clml.git](https://github.com/mmaul/clml.git) or a quicklist-dist `CLML`> than 20150805 \n", 31 | " - The plotting portion of this code requires the system [`clml.extras` https://github.com/mmaul/clml.extras.git](https://github.com/mmaul/clml.extras.git) which is not currently in quicklisp.\n", 32 | " - While the above git repositories are not in quicklisp they be loaded by `quickload` by placing the repositories in $HOME/quicklisp/local-projects" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Time Series, what and why \n", 40 | "\n", 41 | "A time series is a set of data points collected over a given period of time. Examples of time series are stock ticker data, sensor data and netflow data. Generally one wants to preform some sort of analysis on a time series and or use previous performance to forecast future performance or else use past performance to detect anomalies new data. \n", 42 | "\n", 43 | "The CLML.time-series system contains functionality to manipulate, analyze time series data. CLML.time-series has a definite opinion on what a time-series is. We will see that after we load some data.\n", 44 | "\n", 45 | "Lets get started by loading the system necessary for this tutorial and creating a namespace to work in.\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "To load \"clml.utility\":\n", 60 | " Load 1 ASDF system:\n", 61 | " clml.utility\n", 62 | "\n", 63 | "; Loading \"clml.utility\"\n", 64 | "....\n", 65 | "To load \"clml.hjs\":\n", 66 | " Load 1 ASDF system:\n", 67 | " clml.hjs\n", 68 | "\n", 69 | "; Loading \"clml.hjs\"\n", 70 | "\n", 71 | "To load \"clml.time-series\":\n", 72 | " Load 1 ASDF system:\n", 73 | " clml.time-series\n", 74 | "\n", 75 | "; Loading \"clml.time-series\"\n", 76 | "\n", 77 | "To load \"iolib\":\n", 78 | " Load 1 ASDF system:\n", 79 | " iolib\n", 80 | "\n", 81 | "; Loading \"iolib\"\n", 82 | ".....\n", 83 | "To load \"clml.extras.eazy-gnuplot\":\n", 84 | " Load 1 ASDF system:\n", 85 | " clml.extras.eazy-gnuplot\n", 86 | "\n", 87 | "; Loading \"clml.extras.eazy-gnuplot\"\n", 88 | "\n", 89 | "To load \"eazy-gnuplot\":\n", 90 | " Load 1 ASDF system:\n", 91 | " eazy-gnuplot\n", 92 | "\n", 93 | "; Loading \"eazy-gnuplot\"\n", 94 | "\n" 95 | ] 96 | }, 97 | { 98 | "data": { 99 | "text/plain": [ 100 | "(:CLML.UTILITY :CLML.HJS :CLML.TIME-SERIES :IOLIB :CLML.EXTRAS.EAZY-GNUPLOT\n", 101 | " :EAZY-GNUPLOT)" 102 | ] 103 | }, 104 | "execution_count": 1, 105 | "metadata": {}, 106 | "output_type": "execute_result" 107 | } 108 | ], 109 | "source": [ 110 | "(ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net\n", 111 | " :clml.hjs ; Need clml.hjs.read-data to poke around the raw dataset\n", 112 | " :clml.time-series ; Need Time Series package obviously\n", 113 | " :iolib\n", 114 | " :clml.extras.eazy-gnuplot\n", 115 | " :eazy-gnuplot\n", 116 | " ))" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 2, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [ 126 | { 127 | "data": { 128 | "text/plain": [ 129 | "#" 130 | ] 131 | }, 132 | "execution_count": 2, 133 | "metadata": {}, 134 | "output_type": "execute_result" 135 | } 136 | ], 137 | "source": [ 138 | "(defpackage #:time-series-part-2\n", 139 | " (:use #:cl\n", 140 | " #:cl-jupyter-user ; Not needed unless using iPython notebook\n", 141 | " #:clml.time-series.read-data\n", 142 | " #:clml.time-series.anomaly-detection\n", 143 | " #:clml.time-series.exponential-smoothing\n", 144 | " #:clml.extras.eazy-gnuplot)\n", 145 | " (:import-from #:clml.hjs.read-data #:head-points #:!! #:dataset-dimensions)\n", 146 | " (:import-from #:clml.time-series.util #:predict)\n", 147 | " (:import-from #:clml.hjs.read-data #:read-data-from-file)\n", 148 | " )\n" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 3, 154 | "metadata": { 155 | "collapsed": false 156 | }, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "#" 162 | ] 163 | }, 164 | "execution_count": 3, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "(in-package :time-series-part-2)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "##Data and Datasets\n", 178 | "\n", 179 | "We are going to look at how time series data can be used by CLML.\n", 180 | "First lets get some data...\n" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 4, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [ 190 | { 191 | "data": { 192 | "text/plain": [ 193 | "DATASET" 194 | ] 195 | }, 196 | "execution_count": 4, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "(defparameter dataset (read-data-from-file \n", 203 | " (clml.utility.data:fetch \n", 204 | " \"https://mmaul.github.io/clml.data/sample/msi-access-stat/access-log-stat.sexp\")))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "CLML's main unit of currency in working with data is the dataset. The dataset is a hierarchy series of classes that contain datapoints and metadata. They are similar to dataframes in R or data-tables in Python." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 5, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "text/plain": [ 224 | "#\n", 225 | "DIMENSIONS: date/time | hits\n", 226 | "TYPES: UNKNOWN | UNKNOWN\n", 227 | "NUMBER OF DIMENSIONS: 2\n", 228 | "DATA POINTS: 9068 POINTS\n" 229 | ] 230 | }, 231 | "execution_count": 5, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "dataset" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "CCLML has a number of different specializations of dataset such as\n", 245 | " - `unspecialized-dataset` untyped and unspecialized data\n", 246 | " - `numeric-dataset` dataset containing numeric (`double-float`) data\n", 247 | " - `categor-dataset` dataset for categorical (`string`) data\n", 248 | " - `numeric-and-category-dataset` dataset containing a mixture of numeric and categorical data\n", 249 | "\n", 250 | "Most relevant to this tutorial\n", 251 | " - `time-series-dataset` dataset containing time-series data\n", 252 | " \n", 253 | "Datasets can be created directly or can be created by reading them from a file. Supported data formats or CSV and SEXP.\n", 254 | "In this case the `read-data-from-file` function is reading a data set from a file. The file in this case is a file that is obtained with the `fetch` function, which downloads and caches a file from a location on a local files system or a URL.\n", 255 | "\n", 256 | "Lets take a look at the data, it apparently is from a hit counter.\n", 257 | "`head-points` gives us the first 5 rows of a dataset ( if we wanted all the rows in a dataset we would have used `dataset-points`" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 6, 263 | "metadata": { 264 | "collapsed": false 265 | }, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/plain": [ 270 | "#(#(\"12/May/2008 03:00-03:59\" 210) #(\"12/May/2008 04:00-04:59\" 265)\n", 271 | " #(\"12/May/2008 05:00-05:59\" 219) #(\"12/May/2008 06:00-06:59\" 284)\n", 272 | " #(\"12/May/2008 07:00-07:59\" 287))" 273 | ] 274 | }, 275 | "execution_count": 6, 276 | "metadata": {}, 277 | "output_type": "execute_result" 278 | } 279 | ], 280 | "source": [ 281 | "(head-points dataset)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": { 287 | "collapsed": false 288 | }, 289 | "source": [ 290 | "Examining the data it looks like hits collected hourly. `time-series-datasets` can be created with the `time-series` function. \n", 291 | "\n", 292 | "Now Lets put this in a turn this into a time-series-dataset so we can do stuff with it." 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 7, 298 | "metadata": { 299 | "collapsed": false 300 | }, 301 | "outputs": [ 302 | { 303 | "data": { 304 | "text/plain": [ 305 | "MSI-ACCESS" 306 | ] 307 | }, 308 | "execution_count": 7, 309 | "metadata": {}, 310 | "output_type": "execute_result" 311 | } 312 | ], 313 | "source": [ 314 | "(defparameter msi-access (time-series-data dataset :range '(1) :time-label 0 :frequency 24 :start '(18 3)))" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 8, 320 | "metadata": { 321 | "collapsed": false 322 | }, 323 | "outputs": [ 324 | { 325 | "data": { 326 | "text/plain": [ 327 | "#\n", 328 | "DIMENSIONS: hits\n", 329 | "TYPES: NUMERIC\n", 330 | "NUMBER OF DIMENSIONS: 1\n", 331 | "FREQUENCY: 24\n", 332 | "START: (18 3)\n", 333 | "END: (395 22)\n", 334 | "POINTS: 9068\n", 335 | "TIME-LABEL: date/time\n" 336 | ] 337 | }, 338 | "execution_count": 8, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "msi-access" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "This is the point where we will talk about CLML.time-series's definite opinions about time series. Time series in CLML.time-series are [discrete](https://en.wikipedia.org/wiki/Discrete_time_and_continuous_time). In `CLML.time-series`'s opinion time series have a regular frequency. (This implies that time series data must have a reading at each period. However `CLML.time-series` does support missing values which will be covered in a later part of this series) The representation of frequency is a important, especially when comparing time-series points at regular intervals. The `FREQUENCY` slot specifies the number of datapoints per cycle. The `START` slot indicates the starting time index and frequency interval. The measurements are contained in the points slot and are represented as a vector of `ts-point` objects. Another useful thing to know is the slot accessor prefix is `ts-`\n", 352 | "\n", 353 | "In fact in the dataset we just created if you look at the raw dataset above you will see there are no time specifiers in the data (there are labels however but they are not used in computations). This can actually be very important if your time-series has literally astronomical ranges. Some time-series libraries/databases encode the index as seconds or milliseconds from some fixed point in time. Doing that then constricts the ability of the time series to represent times to the range of the datatype being used to encode the time index. To be fair CLML.time-series in effect is doing the same thing however the time index is relative and the time indices can range from 0 to `most-positive-fixnum` (~4.6e18 in SBCL) given a datapoint is defined by the time and frequency interval (which also range from 0 to `most-positive-fixnum` the number of theoretically possible datapoints in a time series is `most-positive-fixnum` squared (in SBCL this would be greater than 2.0e35)\n", 354 | "\n", 355 | "Lets look at the points in the dataset to see how they are represented." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 9, 361 | "metadata": { 362 | "collapsed": false 363 | }, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/plain": [ 368 | "#(#S(CLML.TIME-SERIES.READ-DATA::TS-POINT\n", 369 | " :TIME 18\n", 370 | " :FREQ 3\n", 371 | " :LABEL \"12/May/2008 03:00-03:59\"\n", 372 | " :POS #(210.0d0))\n", 373 | " #S(CLML.TIME-SERIES.READ-DATA::TS-POINT\n", 374 | " :TIME 18\n", 375 | " :FREQ 4\n", 376 | " :LABEL \"12/May/2008 04:00-04:59\"\n", 377 | " :POS #(265.0d0))\n", 378 | " #S(CLML.TIME-SERIES.READ-DATA::TS-POINT\n", 379 | " :TIME 18\n", 380 | " :FREQ 5\n", 381 | " :LABEL \"12/May/2008 05:00-05:59\"\n", 382 | " :POS #(219.0d0))\n", 383 | " #S(CLML.TIME-SERIES.READ-DATA::TS-POINT\n", 384 | " :TIME 18\n", 385 | " :FREQ 6\n", 386 | " :LABEL \"12/May/2008 06:00-06:59\"\n", 387 | " :POS #(284.0d0))\n", 388 | " #S(CLML.TIME-SERIES.READ-DATA::TS-POINT\n", 389 | " :TIME 18\n", 390 | " :FREQ 7\n", 391 | " :LABEL \"12/May/2008 07:00-07:59\"\n", 392 | " :POS #(287.0d0)))" 393 | ] 394 | }, 395 | "execution_count": 9, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "(subseq(ts-points msi-access) 0 5)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "he `ts-point` class encodes each measurement maintaining the time and frequency interval, a label (which is just a string, ad the actual measurements. The measurements in the `pos` slot are stored in a vector arbitrary length. Looking back to **IN[8]** you can see when we gave `time-series` a start time of 18 , and a start frequency interval of 3 we can see by examining the `ts-points`s how this is actually represented. Another useful thing to know is that the accessor prefix of `ts-point` is `ts-s-`\n", 409 | "\n", 410 | "`time-series-datasets` can also be created programattically.\n", 411 | "Some examples are:\n", 412 | "\n", 413 | " (make-constant-time-series-data '(\"a\") (vector (clml.hjs.meta:make-dvec 1)))\n", 414 | " (make-constant-time-series-data '(\"price\") (vector (v2dvec #(43.2d0)) (v2dvec #(44.0d0)) (v2dvec #(1049.0d0))))\n", 415 | "\n", 416 | "## Plotting Datasets \n", 417 | "\n", 418 | "Lets plot our data with the `clml.extras.eazy-gnuplot:plot-dataset`. \n", 419 | "\n", 420 | "Some quick things to note about the `plot-dataset` method:\n", 421 | " - The only required arguments are `dataset` and `y-col`\n", 422 | " - The terminal defaults to `:wxt :persist` so for use in a notebook we specify a **PNG** terminal\n", 423 | " - The `svg` function is used to render the plot in the notebook\n", 424 | " - `eazy-gnuplot` is used as the plotting library all plotting arguments follow gnuplot and `eazy-gnuplot`'s conventions\n", 425 | " - The `:range` argument specifies the start and end of the points to display\n", 426 | " - The `:frequencies` argument is a list of frequencies to plot, handy for observing behavior over specific intervals.\n" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 10, 432 | "metadata": { 433 | "collapsed": false 434 | }, 435 | "outputs": [ 436 | { 437 | "data": { 438 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAMAAAACDyzWAAABR1BMVEX///8AAACgoKD/AAAAwAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAAAAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSAQBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg/yC+vr7f398/Pz9fX18fHx+/v7+fn58nJydvb29TU1MvLy/Dw8M3NzcAALvZ4hOXAAAACXBIWXMAAA7EAAAOxAGVKw4bAAAX+UlEQVR4nO2d7aKrqpJF9Tl8ORTte3u37/+7xY/EFUGRKBRmjHN2kmVRWFRmVFC0KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgB6XLoii/r+eCKiAFZTd/6MxX2Na6LOvu7/epQ6r1LtjaltbrP8bIik6XurOVPfAG2ZSLvOrha+7qQQ6qG77olYA6XQVU+13BtebHbWTR6nb850PILwYSUTbTZqVrhq9Zq/Gzata60O38hQ7boFr9fR+3SaouxwXL+1jty/215RrdPvTWlgNT6WE15VLFsLB5lalbYx7D7JalZVu/ax0/lFMNZv2T9zoakEs5y0u341f4Wvwq0NVFPX7V7fB1Dn+835dtkjGbBcv7H/9uKKFMKfOh05sNXrm8lI16V7EqVk2bYz1Yh4oWr9qsu1tVb2Q8BFbVS53raEAuZTFuQDptvrZWN52aFy+MMjMf6lmdr/dlm1S+qlpXW6xKt/X8odsRYFvYqmpnOZZ/lo9Hrq9a5zKmBlWu6wT5lEZ7RTEf9qmu0WbrstqCmY3IqDWtpiXLe2nezTapmVW7vE/VFutSy+79cxe8FuC6ilcxZdwsAlzW/aq+dFUFsilH8RkRLl/u307IuH0cN4GfW6dyYvhUzQeFy3uxFuDyWn46r+xvhc1VvNybqrAKcFPrRoDraEAu5bj71X9GXpR+/dHNMuu2W8A/PdW2/vO+ErAq5m3VaD/YAi5VvPU7UX0eAy5/rau3VwWyGQ/wq3kLNx/erQRYz33k2nIM2H7Ws3p3HANWxwJcbdjWZcZecPXuBReWY0BrVSCbcRdazju6Yeerhu3Gexe8jMBMXZGl9/vuBauq9ugFT/1V03ne6wWbD0sVeiNuNUcwL6nnHvir+rUAR296wXkwfbnLp88zIfXrPEltHQccx07aecStXY8Dvg4Qd8cBPwX4qkrrzzLtUM97lMg2DviqavRuGQeENaN41EUnKdi1wlnqati7v7aoX4IA4Syq0WXjdy73GAQIAAAAAAAAAAAAAAAAAAAAAAAAAAAA8DTMrU/q8eZMqjZTsqZXgEjUyz1QiqZ5/w8Qh2q8xdn4USszBXF6BYjCdF/HctztTlOt/94nAuBWmtaorWrNXRKdAizh57lLgGPlavzk3gWzQfx57pRAOd5OZdgCNtXYA6k2nRAE+PPcLMDpTtnOYRgE+POklQAC/HkQICQFAUJSECAkBQFCUhAgJAUBQlIQICQFAUJSECAkBQFCUhAgJAUBQlIQICQFAUJSECAk5UYJmHnB60tRuSD1d+mdlhslYOYFr2cEW+YFI8DfoE8hwGle8Go6kmZS0q/SuzeBt0lgnhdcvCdk2qZl3rV2kES/sw++TQLTvOAjAd45MRSE0BcOAUaYF6zZBYNTgIabp2WuZwQzL/g36VevW24WIMMwkE6AxyDA59P/eduAAOFW+s2HvyBAuBUECCnpLZ/+gADhRnrrxzUIEG6kd3x+gwDhPnrnHy8QINxGv/PXAgKE2+h3/5xAgHAXn4JDgBAVBAgp2egNAUJELHKzKRABwj0gQEiJTWwIEKJhPeCzLESAcAf2UeeYAmzrUndHD6xGgE8lvQAH9bX66IHVCPCppBdgUYwCHD9oZsX9GK4L8LfLb5RAqdXRA6sR4EMRIcCiq70eWH1jBHAJ7ju7nPX4u/zWr18Vi9x4YHXunBag0yHiFnDohAxbQB5Y/QSuE+DWcuswTMsDq5/Azr2tnB7eFgai4QgECEk5LcC94p82BAhHIEBICgKElOzd39Tl4G9EgHDA7v0lXQ7eVgQIByBASMpZAR6URYBwjosF+Hk++Gw4l4IAMwABQkr618uJ8v4FECDss3+PcVd57wIIEPa5XIB/SyBA2OekAD0KIkA4AQKElBw85sNV3rvMzfOCeVJS7pwUoFe5KAKc5gXzwOrc6T/ePYt7F7p7XjBPy8ycjAU4zgsueGB13pwT4HmZ3jwvOLcHVp+fAftwjp605SruVSzCvGCd1y74/PSbp3OrAA03zwvO7IHVCPCT3vLJq7hvuZvnBec1DHP60svnc06AAaOFDES/OXvdxy+AAONxZsj1Z2R6kwDfJRHgizMDDr8iwN760aO0d1EEuHCqv4cAD0p7F0WAM6dS/TO95d7x+bi0b1EEONE7/7CXRoD7hb1rRoATCNDKGQH2CDCcfvdPS/HfEOCJn+U5+SHAv3wmDwFO+AqwPyu/d3UI0LBJ30E+EeB6eYD6CgS4xpLA3Zz+zCm7fuevZWFoIhDgC1sKEaDhUIDB8ntVhwAdUtrL7K8I8Lhr9k0WEOCMI4l7x9z75sdwKMCvkoAAJ5zH1vsuCNC65HTtCNCZxAPDDyjwXgFO3jdfkJrB84IRoJOj4dEvU3CvAM0l+Tk8L/hgT+te/nwBHo6OihbggMrhecEI0MmRAL/OgKnAUwJVPbw0pa7O1N9UGTwv+Gi4xbkYAUYUYGembzZmFmfnX30zlJX/vOBgAT5fgfcL0Pvrr00fotRKNbV37fWiVdnPCz4vwH7X+iAOBPh9+/23gKUyW8HOu3wxbf8K+c8LPjzl5l76ewI8edWk1xo8BTj8q40Klbdkxr2rEv+84IOL3PYWPl2AR2feYgpw6Hy05bD3VbX/LtgHBCiYAwFe0XxvAVZz/0Of6YR4gAAFE0GAQyWeEphHYJpTwzDH5CjAftf6IEQJ8CaSC/Aoi7tfwu8J8PLG+wrwXUrvlDoPApTL7uHHZY3vTwqwvVYyCFAu+wK8qu1eAixXsAVEgE5ryFp8JFDVb/21F614IrUAj9O4Oxj7aAUKEuB9QslQgJd3BKUSaQjKTwLXbvfOrv0+EKCTWBdi/PYwDAJ0IkeA5oKZVS/ksjX7rf1ePPL4WQQBuo0h/LQAffK4J8AnK1COAO8kPwFefUmcWBBgBBCgmz0BXtlsBHiu0K8I0Nmyfs8YAAI8VwgB7hkD8L0cS5/thGTwpCS/PCLAT8OlrfaTQHO+FzxNTBf9wOrvBfhcBQoTYNmogLqV8AdWeyZyb+TvFwV4bZvvPBdsJqYXQh5YvX91pbfrrwjQ3a40AqxDTgabiZlHAow1Md16J08E6GSnXd/cFPUD/69f1Z06W/s4MV3L2AVbD1x8E4kAfU0h+JyK+4NvxdPEdBEPrO7tQwfemdy7AvChCnyAAKeJ6RKGYVwXcCBAJxFb9fyBaOc1lAEC/GIzmhUI8Dr6zQfngsMqfkWAMRv1dAG6d58nsuwWMQL8locLcKcHiwBdRG3TowX4MWQVfB53R4BPVCACvIj9QbsAAX4xmp0RcVv0YAEe9DrO5Nk1knOymjxAgNdwsLs8leZfEmDkBiHAE3UhwOt5rACtedw7p+FR2VGdTyB2cxDgidoQ4PUgQP/a7C7PEmD01jxVgI5E7o3oHVaHAG8AAfpX9wMCjN+YhwrQlchQARb2SwqDqpIMArwIZyL3TmocVfh8ASZoyn0CVHVZJHtg9ZEAT2e699iq5s+jBFibG5prZT5Gnxd8eEU5ArSRoiV37oLLItUDq68XYOGejogAv+FmASZ6YDUCDCFJQ24WYJoHVu9mcmdEZd/tsGOTO/HHAG+eFl4ubzryLvhQgAGpfr4A0zTjZgEmeWD1QSZ3+hNhtSLAL7hPgOP2NckDqxFgCIla8cSB6NgCfIYCEeBVHGby6hs8IcAv+EkBXp3rJwgwVRt+UYDXJ/sBCkSAV+GRSQS4BQFeRZJMZq/AZA14nADTZBIBhoIAM17rhSDAq0iUydwViAAvIlUiMxdguvARYN7rvQgEeBXJMpm3AhHgRaRL5EVr7vvLz9P4rDX+KhcQoJQ19/0ivviNQIAXkXI/+M26/273EGC2a89UgJ+usZuRMm03zwuO/KCapD2B4JVvHBHgFYzzguM+LzhtT9TjOkTf5ZEb8kwBTvOCoz6sULYAXdfBWpYiwKvqjvq84NRDcfvrd83Gs3pFbUrSvCUW4JUTQ1Prz2M6svfe9jcEGGFesI62C04xfvsZgofNt8MRszVP3gJGe15wevkVfjM3e5fBt6rLSZu6m+cFxxqGEaE/v3sn9C6DZ12X81QBxly7DP15bs78znkgwIzWLkR/ngd0nifdYjUqcfIeIEAB3Y8Fvy5t7zJ41HU9CPBL5MjPe0yv3yl9UNn1IMDvkKQ/77MaXlddIcAc1i5Lf6cGmoPPHV9K6gRmLsDU6dvge2WVzw2SEKD8tadO3wbPcWa/yGO0LnUG8xZg6uxtWY+xfD29I0LzkmcQAV7MeHTXXzS36P72Jc9g1gJMnj0Ll05rQ4Cy1548e7dzdwvTZzBnAabP3u0gQMlrT5+9+7m5jelTmLEA0ycvAghQ7trTJy8Gt7ZSQArzFaCA5MUAAX7FjQ+sFpC8KNzZTgE5vFmAWpnXOyamC8hdHAIa6usiIYd3C3B6vWFWnITkxeF0S72fBCUhhzcL8LYHVkvIXSzOtrX3dBGRw5sFeNsDq0UkLxIn2+p1tev5am/g7onpy2pu2AWnT15MTrXW73p/MSm8uxd80wOrhWQvFmea23+8X1HnjdwswJseWC0kedE40d7e8unLKm8lz4FoKdmLhneD1wX3rgsTk8EsBSgme/EIG9pzesnJIALMBL82e86JEpTAHAUoKH3xCBzZ85wqnw4EmAuBI3ueU+WTkaEAReUvIoETiTcLZeUvBwH2H9wclFRCh5Y/MyYrf2IEaNfWLwtuw0Ei9gZdVlkUlk4RAnznZy1DtPfJV0PLczqlpTS5AJ3bPWmZEsBuSrzu9SEvq4kFKC8hovk8Gl7fCCRdVF+RfAsI35DyGa/XgAAfQM4HLAgQkoIAISkIEJISTwKW61ERIMSTgGVaMAKEeBLQ2zlJCBDiScAyKxMBQmIBws8TTYDasgsGiIZlWjBAPGzDMAAAAAAAAJFRAZZ4ThJiyDbwoOqiU2nX0IzbEs9JQgzZBh5UXXxarU5b4jlJiCHbwIOqi02r29OWeE4SYsg28KDqYmMiaZvqlCWek4QYJlPtdDppiecUVF1UVK3GX4LqmtrXEs9JQgzZBh5UXXwqvWyJP7fIbks8JwkxZBt4UHXxqbQa35UlSpclnpOEGIxpsFXN9qgpxBLPKai6+FT/Uxv0cDhQ+VriOUmIoajq0lzH1nRbpwBLPKeg6uLzH63G91Z3na8lnpOEGApVThdSbk0hlnhOQdXFZ9r7NI2qSuVrieckIYbBNPzncDpviecUVF18hsy3dWtC8rfEc5IQQ9E2rcMUYonnFFRdfDoz/FXpbntE6rbEc5IQw3g1ucMUYonnFFRddNT482/mj36WeE4SYsg28KDqEjBEYo5G26b8PCZ1W+I5SYgh28CDqotOp4paVU2nN5G4LfGcJMSQbeBB1SWg03WrrJG4LfGcJMSQbeBB1UVn2BabSNS2U+S2xHOSEEO2gQdVl4Dpl7CclPKzxHOSEEO2gQdVF53h1zBskVW1vUTCbYnnJCGGbAMPqi4Ftep0W9p65W5LPCcJMWQbeFB1sVFaFV1ZddvpAm5LPCcJMWQbeFB18enGM4S17fyB0xLPSUIM2QYeVF0Cht9Baw4H1OY6HbclnpOEGLINPKi66HS1+Sl0utkMDrkt8ZwkxJBt4EHVRcdcJlbp5r+N+pyh47bEc5IQQ7aBB1UXG6WbWlfmOp3POStuSzwnCTFkG3hQdfFpu7Z4/Qj+TBlwW+I5SYgh28CDqkuCWt43obgt8ZwkxJBt4EHVJaCbZ+ioansk67LEc5IQQ7aBB1UXnfnUYKeruvG1xHOSEEO2gQdVF58xlvHFOnfHaonnJCGGbAMPqi4+ZoaOVkWxfa6D2xLPSUIM2QYeVF18uuncjO42RwRuSzwnCTFkG3hQdQnQqlB1ZTsicFviOUmIIdvAg6qLjjkzU1mPCNyWeE4SYsg28KDq4tN2riMCtyWek4QYsg08qLoUzEcE3fZehm5LPCcJMWQbeFB10dHjEYHS7eYModsSz0lCDNkGHlRddMYjgrart2cI3ZZ4ThJiyDbwoOriMxwRmAtmLfcydFviOUmIIdvAg6pLwdAbmu5leMISz0lCDNkGHlRddMZQTlriOUmIIdvAg6qLznQvw4HN6Ljb8qXT69qgbyxfOm2uDXZbglqbTfLS005bYsvouNvylZNpvt3plOU7p+21wW5LUGtzSZ4U3KPjO+PmQU4GCQ/H2ukD2i0hrc0geTJwj44vFst3spi2v6WDwfbQB1M5bMZk/z3v9fPc1wbbLccpclsEJ08I8+i45flOi6XcjlrOpmI7oLkMtrsfjrU9LFksOw+mssSwVGcdVB2VuY1huTZ4G8NiscWwtHYbeM7Jk4IeR8dtxz96vn5CuYbUC8u3vzi5nyS1PSw5fDCVLYZXdZsYXk7bGF43i9rEsFhss8f03Nrt8ZTOOHmfYaViHB0fP9mH1M1dXpvPxL+dPts2WUxV5eeaXk+S2hyWHDyYyh7D2+kzhsniiGFetSWGZYH9BILVKefkfa4hGW2nVGF9vtNoaU2ylMVp+tFtNvyjxeyvrNfljiXOPGNqtNhjGKuzxjBarothapPVKePkbdaQEPfo+BCm2vwaZypdKdtyVQ+Ntt4MYrxZnfWAefWMqY3FGcNQnSOG6v+cMYyLrDF0B2cJHEf6uSbPEVgS9ofU7TlUzuHM6XFRVuu/14Gb3clpscfwzxnDP3cMyhnDTh6CnGQnTxLL6LjZlP+lc1penfutqZp/whan5YDF5iQ5hmwD34lBEm0ztqzbzllpK5dlwWJSrXI5LXNULWsSHUO2gR/HIAn3nJWd2SznnOabFoesKW0M2QZ+GIMcopxYUqFrSh1DkNO+RUQMQhi7X8p2JshtieckIYZsA9+pThLKOXfZbYnnJCGGbAPfqU4UWj1u0rUEJ+ExCOKBk64lOAmPQRIPnHQtwUl4DLJYrhU6Y4nnJCGGbAPfqU4QWs3XCp2wxHOSEEO2ge9UJ4jpYMFcX7a5+ZfT8qXTNLj1peVLp818JLclqLXCkycJc0VQq1vbvBmn5Ssn25SaAMt3TpaZSk5LUGuFJ08a7f9qZe8wuS2BTiYV1okzIZYvnKzzkdyWoNYKT54gsp5sw0wlH4uYS/Kt5DzZhplKb8te8mw+YtAq38k2zFTyS95nBKLIebLNXJiZSvbA38kTTcaTbVwxMFNpZWnF3B/GzTjPIMPJNsxUmi0ByRPFPNHF/FwsltJqWabTbE3LZBuL0zLZxuXktNhj+OeM4Z87BuWMYScPQU6CkieceaJLt+20dU7L0lSLaZ5sY3OaD1isTpJjyDbwTtAt2nZYJrpsO23LRBdbd27CYlom21iclsk2ljWJjiHbwOcYssDeadu3nHaaDmUC1pQ6hmwDz4Ph4NfRaXNbwpz+BThJiCHbwJXoO2a9UEunbTMO77bEc5IQQ7aBqyIPpg6TbRzebYnnJCGGfAPPgqnDZBuHd1viOUmIId/A82DsMFnH4d2WeE4SYsg38Exwj8O7LfGcJMSQb+BZ4L5j2M69xKI5SYgh38DzoHL+fNyWeE4SYsg38DyYx+FPWeI5SYgh38ABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIlCWh0W09i4KcJJjVbVl5VsU4Cyfqqo2KqvK1rYY4Ao+BVhvlDYu2S4G+I62LstajQLsaj18bs3u1lAUqtGlbpQppspmXjwWLbUa/QbvpcC7LIA3qtRt0Wqjqm4QVDHoqiuWLaIePnfl2PnoXosnARqhNmWtzWvzURbAm2YUVmVUVZeqGBVZzCqrxn7H9DoZVwLsTNHpVX+UBfBGz6p7H9zNAiuM6Ibd8bDjrc2CerHNdrV6/SgL4M2svPGtHQ7jlmO8cmXTr0GYlQCLv6+rsgD+rAQ4HMF1aq2nsnz1OppxW7crwHcPBcCblQD1uBdVmy2gYT4NcrgFBDjH6hhw0lC7PgZUc6l26unuCPBdFuAEq17wpMX6LbCpT9vqZjoNUuwJ8F0W4ASq1GoeB6yGzZxqGjOgN++OTde2NX8vZ0DMYrsA32UBzjCdCdFGROZcRmVUpKeXQZPTqRE174HHxQ4BLmUBAAAAAAAAAAAAAADgefw/GyXbJDjgXDgAAAAASUVORK5CYII=", 439 | "text/plain": [ 440 | "#" 441 | ] 442 | }, 443 | "execution_count": 10, 444 | "metadata": {}, 445 | "output_type": "execute_result" 446 | } 447 | ], 448 | "source": [ 449 | "(progn\n", 450 | " (plot-dataset msi-access \"hits\" :terminal '(:png)\n", 451 | " :range '(0 40) :title \"MSI Access Log - first 40 points\" :ytics-font \",8\" :xtics-font \",8\"\n", 452 | " :xlabel-font \",15\" :ylabel-font \",15\" :output \"msi_access_log_40.png\")\n", 453 | " (display-png (png-from-file \"msi_access_log_40.png\")))" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "Now lets look at the whole dataset. Since each `ts-point` has a label our x axis would get overwhelmed with labels, we use the `:xtic-interval` to specify that we only want labels displayed every 500 points." 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 13, 466 | "metadata": { 467 | "collapsed": false 468 | }, 469 | "outputs": [ 470 | { 471 | "data": { 472 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAMAAAACDyzWAAABZVBMVEX///8AAACgoKD/AAAAwAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAAAAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSAQBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg/yC+vr7f398/Pz9fX18fHx+/v7+fn58nJydvb29TU1MvLy8XFxfDw8M3Nzenp6c7OzsHBwcTExNHR0cPDw9PT08LCwtjY2MAALuS6J12AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAgAElEQVR4nO2diYK0KpKF8Tny5XCb29N/9yw9Mz7/uIGoEQck0yWrzrn3r8oyWILgE9EUNYaiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIr6sIp2/tAW/Y+mrIui7DcVQZI6p9j3PaN+hQqHV9kz05aNMbat2xCgtn5lFPsR56ifr6KahsC26pmp7fjZViFAdTMz2tZFade/6yGzLYtxg/s9Fuuzu1RTNoJJrVXMeNXNAGDjN/sEbWnKEaCmh6v/Y/ndpx7+jeZhg/u9yt/2KeyQavjQM3hNq6ivUWHG8amtB2aaumrtvNlpxGz4UM50+t9jvmpJG8LlP4+pm3L+0BJAaq1iYM+Yedpn26o/UDbhCDYMaSNrtZ22uN/F8Nv2mauZWvd7KtaEqdzhnYdgaqNihG+A0LGxPgkZx8dxCNwOdMWk/tNrnhS63yYE0P0s1tspalQxHn7r1ZUXW/s/2hmzdj8ChldnbFOufgcA26nAaQtHQGqjgYj6NY9w8/QuALCcz5FLYQ7YbMsJfitzwBcBpNYaD6HFy8xjYT+Ls81yCHZXYKZTEXf2u5wF21eZcBY8FNhbLM+CqZ3GU4/afdp+E1L670lK8TpgUdkh03T9rwmvA/oJIq8DUo/QeMC2OV/rUdQHVL76o7sfUSnqYtmqLqomno6iKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIr6ZjV1MSw5LIc1XuFPirpE0/N4qmr7P0VdognA2g6rCMOfFHWJbFHUdlybXax+UtQlKtvhSQARAAvq1+ssACfcanwI5oD463UaAvU4Alav8dwj+HlN7dS36DQEmnJ4SmjkMgwB/PW6FwEC+OtFAKlbRQCpW0UAqVtFAKlbRQCpW0UAqVtFAKlbRQCpW0UAqVtFAKlbRQCpW0UAqVtFAKlbRQCpW0UAqVtFAKlbRQCpW/UTAexOKZU6ReetihuX3N2yJoQAfpHOHAGb3XM5rlkVRwC/SCcCaEtzz7pgAvhFOhHAqnXL0/UnI5xSMQH8Ip0H4DAAJj2a4+M1E8Av0amP5jCvfgDkIZiK6DwAa9v/uOXRHATwi3QegGPJvAxDYfFCNHWrCCB1qwggdasIIHWrCCB1qwggdasIIHWrCCB1qwggdasIIHWrCCB1qwggdasIIHWrCCB1qwggdasIIHWrCCB1qwggdasIIHWrCCB1q84DsCrqlouSqIhOA/D1ss2Lz4ahIjoNwNrOP7kwnQI67/Fsr6Ju+GwYKqITATRtzWfDUECnPhumtgNuNQ/BFNRpAFb9CFjy2TBUROddhtldgOFlGGqvn3shmhh+hQjgWbqz/rvbfkAE8CwRwCQRwLNEAJNEAM8SAUwSATxLXwHg3UEigOeJACaJAJ4lApgkAniWCGCSCOBZIoBJIoBniQAmiQCepQ/Vn1XMcwHcVUgAzxIBTKrwpwHYGQJIAJNFAM8phgAmigCeUwwBTBQBPKcYApgoAnhOMTBTp3y+RATwMon1H3aqI4CZKouiqK5fE0IAo9X8EgBrO/y8/NEcBDBazW8BcPp59bpgAhit5pcAWIwH3MsfzXEygAcKJYCxyked93SsxrzK6x/N8d0A7s0/GsBTH80xVvDjDsEE8E1deBbcDiPg5Y/mGAHs3KdTin8nKQHcbjgNwKa+5dEcnwRQKIEAvqnfcSGaAKYazwUwJYBfB2BCF/5GADvwF0hKAA/rCwHspI0iWwTwYhHAWPGHAYy3nQC+VfuXA6h8lovv0gBcLj0ZAnhMjwBQTUgA39SXAHgsCPcB2O03pRULAVyVuKuxexqAhzIQwPdKPAIgGs2iAHZCyunPHYCSA+8AeJRAAvgBABP6VEv3GwBMzhwVAUwq4jsA7HRnfwOAr9KMrx98Hak/rgcC2MF03wIgitI3AtgOd81Uw80z7REHoiKABDCppHK4l6Cora3KIw5EJQGYQBjYMgEIWSGAsj+KEweMCYkzASzsMAq2yelT9dMB7HZFPh3A7qEAmmGZmzX9MHjEgaRynwGgxAABxJmj+hyA/clHU/RHX1t+zSEY9MFnAFQHzmVLFwew29MTFkEAJ73m84/6gpOQHw5gF3z2P9QaCeCs+QpMdcFlmEjsCKBU048H8CQRwMlXtcZUALt1ET8OwCVVDVJt1BYmaU2IBqDs/vMBXFB5AoA4Xl8HYHNgxCyLtEdzEECfjACKKgKlj4CvAdY6ti5YirrmaxgejYR8ACMFTyV+EMCOAA5KAfBVLvw1qZXbaig7+miOtwDc9a8MoBDk3Z/XAdgtH72vPsPWsxQA1639FgDdh6NzwGRVTQqARdf/H3FxbXsKgJuueyiA3cawLeJjAIrbhRCtvEh/NEfyuLdoLNzWpx6CcwEUA/MoAN2WGwCUmFT+iG2PADh+PvMyTGHij+boto5pALpUEMDuUQB2BHD3VwaAwyAZnIVE0wc5Ey7DfDmAkrMpAHbHAdyVtfKLAGbpMQB2PwfALnQ1uC9DdHVPm7BJ/SPYdBRA7+Hd34SsAfTRMkZzXwAwRO1rAOzuAzBsxOcBVALYLelWHhLA9wDs9mk7V6QHcNXjEoCbxmUD6D51y4dsAOV+IIAYwG2nCgXtov4mgJuhkADKiQngJtG65K8CUPJrB2B3EMDOb3kkgFV9yUnIewB2ZwG41HMDgJ6Mxdlte8LZpgDgdrr6hQBWF5wFO5duBbAT43cugN3XASj3zD755wAsKpuU7qjeAdB/FgHsrgZwVV0cwE4GcGmKb9AhAH2pjwXQR+cYgEmpjutBAPqRaVVbWMCXArgbo08CsNsn/xyAZcaXwSk6DuBysDkdwG3nRAAMUy8ArgtKA3Bp2UcA9E0LPF9w86FbOygDexuAtmxtUsKDwgCGCAVTJ7flJAB9PD8L4EKI+70DcDcZfDCAW8w2AC7OLqlyASzWiqY/okQAu6cCGNITpF76/qcD2C2bkgDsvgtA36bvAjDo+0wAl0Z+EkCf5lwAg9auMuYCeKY+BOASRQTgUtqbADpcTgWwcw1aAHTFEMCPKQ9A1zefB7B7LIBLux8KYNhZBNB3yfIjB8DOhDllAAOilvLfArBbAPTFfwbABZUgmi50K+fX/bAJVhxA584q41cDGARRBbATAQyDtQHWlW4QgCE8AYBh/65cXAPY+WA/EcCQthMA3GT8GgD3I9XSXT8BwFXGswDcOREFcL2HZAHosYMALjEZtj8BQA/Uet9IANCn2+72AMAlxNPGDwK46sMQwFX/vgvgZv9JBLA7HUAf398BYNhxQYi/EUBfhApguOO9A6D3/DIAg1IuBrApi7pNW5R0KYBB1KeNkwsHAdx29PaT3/IdAO72Ht9o87UA9vQ1dfzZMEs/PxVAX8DnAezCVLcDuIDkIuPLX8XrSwA0w0vTU54NcwhA304PYBcFcInotwA4NzIo4QcC2J0OYFHblGfDLJH6IIA+nhjATgKw8/nSAVxxpAO4T38AwO6JAIZlPQ1A05ZRALuuEAAMY7GP5ycADDeJAHbfBOAaJNmvNACXTe8DuFRk3F8bAD9/c0Egawbc6sgheAVNAOB202UArpi5GsBNI88EcE1bBoBBTJb+2G66cwTsT0L6ETD2bJgYbRDATQ9GAOwOAriUeghAedO7AC6f3gOwuxfAzodiSt6dfBmmiV6GuRTATRHmlwG41JgL4CoAS0yeCWCSCGDo69ZDgR4dwH0RFwEYdNa3A9idAGCnA9ilA7hE1pi7AfSEOO2LEDMGNRJApy8HUO/o7aZrAJRiItW4AlAJMAFM2fQ1AC7u7HFYGUMPxeJdOxbtK9L9ygAwqC4CYBCmLwJwz8WBTckAhhdrfBGHAewIIAHUe1ABsFMB7AhgJoCra1u5AHYEMA3AoKxfBKCvJKguEcD5oNMRwEQAN0YI4ATBLwbQR3o2vgNgRwAJoB7gpXj/KQZgF4TJBT4splti6NL/XgC7hwLYhR7uy3oSgB0BJIAE8B0VYsvDD98BYKe5/90A+ma7ikIAu0wAN+kJYA6A27o7zf1bAZSL6B4AYPf7ANyX2j0MwPWWZwLoj58LgFJrFwAXV1fuEEAHYPc7AdyEIgpgZ9YB+MkAbkJGAD8LYLcFUHHaRSYLQF/1DwAw7JMkAFebIgA6GwGU/CKAm21G7JJw0+kAdgSQAIL0uQCCjjgXwE5I5QwfBbBb1ZgGYLjJVfRFAKauCYm0POyT3wSgGIqgjYG2GZ4FYLd4eS2Aw6q4+KM5oi1fBTYG4G4TAXTVfAhAt+kLAOxl44/miLZ8FVgCCAEU3PGmDwO4LlUsy1fmvbwcwOqV8GSESMuDbUIzCaBY1r6IXwlg1ZqnAvgmPT8BQKF8EFYjeY8A7NIAPPHRHKbs+XvqITiHnmATAVyXKvaMr8x7ee0IOIx/JuHRHJGWB9vEZv4QALudqx8BcPnw+wAcn3xkEy/D7INn9j0hNvNGAMUh5ysBFCtabTKi948GMEnfA+DqYHkMQMGd7waw80JF+Mq8l08GEESdAJ4EYKSi1SYjBFOq0oTW7wFw4/KmJRgaFLxuGzM5FTIK3Yt6/AkAih8yAXSfTOh95/XLAITneQQQfOgIoKwnALhUfR2AS53bhhwGEA98QkXuDwJoCOCeldAd0a9MAFc1agDqATai9wTwJAAxbRcBKNT45QB2UkUEcKn6KgDDOndldTubeS6A+yIIIAGMA7j5IIVi88kY81EAl4w/G8CUVEEwzgVQpGxXlrRpB2BAoFDYDQDuM/rtzs9OauS3AriPlHkqgKHTEhD7VPtNYkWzUKGoImEYPhNAtaJvADAlUiJtwiYAoDjU5gKIh6G9IICCg17nALit0giREwuFAErF/1YAxR7PBDCM/6kAriqadRGAUlhFv3bbZzfF5N8PIJ4WppR6FEAh1Q0Ajgf0Ud8BoJL80QBK7upNxwAuNgigUGoSgPvyPwGgUMBtAOKwEkB5XMoHEKSH7ogzy10qSUdTnQGgWM/7AIZR3Bf/WABFd/WmpwKYVtGSPhNAwcOPa+Ivu/wbAQwS/BwAo8HYbLoMwNP0IwDsfgqA8WBsNiUCKBSblv58nQhgWIn6QSrr5wCINyEjAhDOvOFIkHZScSV/rmffrzOpRdL0VkguWu8B0JaFSXw2TGIw5BYcskEA00p/DIBzx14EYGLAHgRg2fRlpz0b5iS9D2BiqUm2E0QAsYrUhekn6XoAL5bxP86tYvXpowB2ZwOY9miOk3QOgA/SBb4TwDd07KT5C3UpgEmVPg7A+mGH4J+lWxqIK5UBBBlOBjDt2TC/T59B54EAHtZ5AE4Phzl6GeaXiAA6Pe1C9C/Rj58dJIsA3iIC6EQAbxEBdCKAt+g6AJ+O+oMAfHqoPqkHA3hxNxDAW0QAnX48gM/EmgA6EcBbFPHqg04TwCiA594WcFk8D1X0aQD122oJ4OcBPHQXHAHML+oKPQFAE2u5ENN3ADwH3tzE6Xddgz1VLs+oljTXrtAPAfDQCJALIHbwHQC3eRUA9UBhAM36z2TXsDuf0TMBxMR9OYCR1uhluSVIm3jp41wcwLTo/F4A5cjFAcTHnXcA1A+OHwZQ8jsZwE3cHgKgWMxDAFSoOQRgDq3JEdv0pZDQbDfsSNkVZaS8AYDbjJcCiHdiPV9SiuXjMwDcLu9CkTNSd4+lrTp02z1Cz0WjtvNo95KkbqnMWYxH6yCA68YKxB8A0IQWY/YJY659BECd9qWCpwHoI2dcIH3UTWfcL+MRm3M6AI3vgMMA6iOHqycOoAkBNCEHq/Tevqp3B6DZEmfM1sslQ7CD+uCFAK6KUl1bUoXP+RIAXtcvBAyUvkr8KADNEiAPoGPQTFHz/7nuCwA0C6K75yr7dEt3SyPHPp6+813tq07xGZbdYu+k/2eC3Gb5MwDQ70BjMSZov9/qQdQAXKI1f9oCuKlcQEQB0KwyYAAlHp8IoItRGBcjAjgj4P+5jy7YkxYAfX7HhdkMtUGUV3X6kceVvQUwGGt9b60B9D6Fw7txrnSLx75bXAGrZi0FLI3sgiYtAAQtC4tYCgxC4atf4OkW35cylr15DWAIvgDgCrFgx1h60yzWKwEUloQMAM5hCmK8CvXs8lbd6lcIYBDtTS+YIABLL5gww+LNCsClSwPazDLArCpfbXKjVuCMAODixa6AlftBiR6RoIAwa7f5FZDoC5v3ts5l3Pg0A+jj5oK3ps3tip1PsQNwsTovl/31QgCFJ3OEy4S7bdCCQEXLVlKAjJ3yOdgUdt4uVbfftM0Y3Tpatq2MN3ZVrJb8UERA1VIn+Ii4w4XxA9uC6rKrLR9Xdc278HUA1vtlwWc/nTARwLcLS0p1fCf5cu0PT1Ki6wAUHoxwI4BnFXa4yp8PINbNAFK/XpcBWAuHYIq6TMKTOSjqOkmXYSiKoiiKoiiKoi6WBaZXqxub/qQ6p1BY6vWF5rb/lEJhqTfE5hK9au3STFu/Sv2yjW2rMqNQXOr1hea2/5RCYak3xOYaNbUVt7+G7YVsm1Q3hwuNl3p1ofntP6VQWOrlsblCjeLA2CL4zYnVXdcKjZd6VqGwVMmYUqh6aBsKlYyuUP2YiEq9OuCXaIy/NBEoB7/qVpxdtOWg+qXNIMZCS8k2lyoZXaHIU1nOVeRpoRxpRlcl41yobfflek+1w9foqmR0nko2VKqzadO5oULZBrox1ovny5Z2DJU4EaiHh5q/5NnFOKwbOSMs1JUqGV2hfTh3AXGFSrbFVcE4F9rWVvBmLlU2ToUOEyjgqV6obPSeCjZUqm+GMJ3zrRCnerXejagXL9KrdqPKfnRp62oajaTZxRKsfUZU6FKqmNGO0ajK/ZRlLFSxTYWqGftNTdEa6XvwsVTFOBb6GrwsZU9HiSC5pu2NS/NzCFSmc2OF2lQPdSPqxYvkdwLBg6YdbfKRD2Xsbb3xVYk5+1JVY5+x7WkQ8yHbUCjKaHvLS7xUgYx9ob2lt71ET6eRQ5pnIOMY1JyMvU2dziEb7EbUixfp9W9+5qVP5yRb2+oZX8NIZOuqlQvVjcNQVLZyja9Stw1Sja3p+8WOd70J/VoC4zD22XFUkTzt2bRaocCIMqJS22WqK8RGt01SuhH14kX6q7bj76bez7bddKaRJvhu75GMtphuPRQzGmg0ZaHUaIENZhwrLBSbXYx7a21Huxib+VAq5vuzGPcVoozQWE/TOdkbb9tlC7txZw16EVwfP1WTC1VlX8rsYrDBjKJtOHapGYGxqcuorZWOXz5js8vb+gr3tsX4KoR5sLftZmVTYtm2GKVWJGUUjNN0Tu4pbxMqXLpR7GJrNNtFGmYQ5TCLlWclzTQHb8VJs2psqkbPuBj3F8RewwUcbOt35/2ZqTO2db0/32tejWpzRvG7BG+TzpONQTajtz8ho2gcZqVaT822scadMaGLRdtFaqvxNL0VTzb+PvvVSGdYFTAi22y0/a65j1bU1pTiKe1obGsrGpHNTO234vdVs008UUQ22P4xY1HJ1wqREfQUjNvYjUpG1P0XyS7xt1vbn2mbfPZpgRHZRosdrp0I520Rm1UXt9hpLmSkyxHINrffDv2wv8rmj1L7EQLZYrHpMw4TCWnKBowW9NSwQYvb0I0oo2q7SsNFjnY4NhbSxHi+0i5lREaYsSmL2pqXOO4j2wDQcFSTjTNAtVQltA31NcGFsY2tVylN0pHNt190dMgoflcUMaKegnFDGZHtGrXWlPZVteJJVO/dcKVTnK8gI8z4avtJznjydsw2Hu7Uy1av+ZrrUdtwrayS+Zu3Kq0ANtd+KwExXZwbr4keM6KegnFDGZHtKrV12Vjl+9S6tErnYCPM2KvqZzpK5+m2YSY3vPtYLHW+HivyiWxjjX3JVr7o3jZVb5NqRLap/QZE4FWX+jIxzQh6ysxxk+++QhlxoVeoH4IHB8Td1bgg2uNGZ5PaZgurdo5u6zumVnunP5mrhmYctQ3Hr2HAqqU62+GrCaVGZBvtPYQvxdhX1wywKLu9YsQ9NcStqGRyQcZI91+iaQeQe336PqB8STZodDZ5dlXp6AKbbXzv7NW8ej96s9QFyDaMrMPgqoweoEZsK4fDcCEb66ap21IJOTCinpripk4KQBcD2zXqd4Jpd5WMtR0xUgIJjLOt/yV8ZfVnxlPIh2y+d6xoHI/Sr/+QL+3rNjtOLRul1LlG5I2kYUbWFoV8hbi3NUWr9Dkwop56/TXErZ/pSoWijLD7r9K0u8rf2vf/vZRAIuNkG74tl3io7Th0ivOyiG3oHWXUGUH6z/0XGzHb+B1h22o7Ul+jfPhGtrZ+/etfypDU25rhKw9lRAZGvadedTW2Ao/I8k6GbJdo2l1fcg/Mw5i8gyDjaOvhlO57HPFsi7rPfMg294481xnRff39JX41j2zTN2/yacpYo3L4RjZj/ud/7fB9rJHuBahaPW7IOPeUcNlygGgcDgZyhUuXtd7FsPuv0dQD+/vfBjXTzUJiIKGxt/2zVO9QHe75tP0ZqBWqBDbcdXVbDf5Ioy6yDd+8jd+zGKHv+hrVOSKymf/7x9gU5ZaO+YApXtgERkRZW43d0YoVoi5GtqvU46/2gDV6IJGx/ZsrS7rrcPxqohVvCEI22HXN3/7Ljt+wC6Musg0avzP+b3nC0OiwINsYGbnGdjpgKkHFRp2ycTjQmgi6GNquUT+XHnugUu6V0gIJjdb9FgC0061bIrrI5uY6yh1I9XQvgLhmANiGk5C+8X+T21gPDWnkGpGtj4xao7/hRqoQGftDvtUo64cDvYmoiyPdf4HGDld7AAQSG/VFNWV/KK2UKAObm+toXWcrfdRFtqHP1QlDO35jr/Gg24JVX8LSh/nWIDluwAgpQ80HXYy7/wLZuhqWSWntQoGERn1RTTNcw1BqRDYT6TrrWySMusAG+66pWhAb3YZrREGFRkiZViHq4kj3X6Gm34+z2hUx6otqmrp4aTUi2yDYdWiZJ7I5/w7DkusNrBAZYUb1oIO6ONL9F8m634faFTHChTOoxlzbclFf9ka1wWac4k12UCMBH1OIqzWzm3GRctuFjH5RzSejDG3Im+xmPMubqE1brZndjGuU3S5o7DfPS70/HGXYdepiWuBpNOOzvNFt+kEnuxnX6J12qcZ5qbdY6DtR1ruu9TdhCd4AG2zGKd5kBxUGHB10sptxjbLbhYzjUm+l0Owow64z8Lif0owPegprzA4qzFjjg05uM65QdruwcS5UWliZG2U8uUTeJDRDPGm4zZvDGeFBJ60Ze1+uUp3bLmScCu1PsPQnFx3vHmRD3sSboU2E7vEmIyM86CQ1Q8p3jbLbhYxjoc1w85z25KKMKMOuQ97EmqFOhG7xJi+jP+iIX/WlNOM2pbULPQZQMA7PERq+7dGeXJQVZewpdBU2Qz9p+IA34KGMoi03Yz0PZdJXGynNuFWge+r4wVQxjncYVMolplxYkA15Az090RsZCGTLzTjNJYb9XXwtYRTrW1Xr3RM/mGrGEcCMGrNtyBvo6XneKEBAWHIzjnOJYcURnAdqWEt5rhPqntjBVDVODyfKqDEbJOQNbMZZ3qhAQFiyM9qyn0wot1jFsJbyXCjYPfhgqhsbv9UeqzEbJOwqsJ3jDQIC2XIzuvsqRUWxvluge+DBFBpnya+3yIQlbsvy9ARvEBAQltyML3iLKcRazXWZQPf4g6m64Gj8ZPXCxfOsXFiQDbmKPT3Hm7m/xds9dNtiFJuh26x/MqLdmiJYyw24Ur57pC/X5j1cvH3RH2n11/jIL+lAQCBvkA26Cj1N8uagzQEhP/Ye2DxJ4mIDYAvckproxj/B1fs1d0/k5gh095h2PUl7SQcCAnmT5unRFw4leHPY5gxW/b4/fi8AivjBJvoBErzy625FAoJuX5SfkYde0hFklWt8p+tyXjiEvcm0ofcmvfWiqpOaeKv0gPgX7miPuxs5k6YstfqSjjDrMW+gzd8Xb/XqpFcjRbxxNs2q5UPvTQJvf8pebOB8yWzivdLfseW+L5RX47hxTn41kjXaGzxGqW+gAu/R8p6K50XWfdjPvJYXDikjsvpWM2dTXgc2vThKMNRWf2+Ss0m7tW8G6A0pbmlNfOY80AUEv2NLvOrsnii5v84ZvKRjL/gGqtrfxqF2nbg/LF3nfV57qrwaCb6ADE0m/Iu7pMkVem+Ss8FmgN5QbnVOaOJD54GRd2xZo70aaQyIcp3TvaRjyC5krNU3UKH3aCFPXfdonqqvRjLoBWRoMuHziac34L1JyzVgvRmKNxbfzR5t4jPngegdWy266vr6h36dc1pcblrtpnF8q4YSSdd18qq+MYU80xlfOGTRqY1SKppMIOQn+0uu0IylKbt1C4zw1v+EJj53Hhj/TkQcyv7M1zklYzMBpNw0XqKpOIgkXPbrDqZCVlvbaVSWig3eaiZ4qk8mxgcoaZ3qrv+KNeZ+1YlOYXATkasPUPw7EW0oG8c/2TglsPL7E+p5Prc3TXM9JZJ+tJLyzUca0Rs4Iv8FSkWTiT8AeXfmLNeIdmtorPW4wSb+8e+afaTctX1xKKvAUAav+5foEs98X/xRWNxMXF8Ppl3KhiMyKBV7oyMf5JdOGuBurRtR3CJNnPiTmb9djrLDQxkyWrSMZ57PHYYFr9AtI67CtaOf9gbVCHdrZERxQxXiXfcxOj6UQWPtL6p87iurFuX7A7zBI3KmNxB5WGOswhxvYIXzMz/vXJUeVc5QhoxwGQ+KVq4NeoObkVkjQh57k7tbQxvuKXP3qnSk/KEMGOEynlxYYJShq8h2uTfZuzXMCJsYG5HvVfZQBo3wu886ExZky15Yebk32bs1suGDDtxZ7ledOZRB41SoXOGz1gtf7s3DVqzfr/yhDBinQgezUOa564WP2i735twV65Ip6GLR15uVPZQh43DhYLwIKn8XnglLwqH9sKc3eJO7W0MbqjCys9yvlA0TwBQAAAVBSURBVKHsqNEMl7aM8l14nQkLsiFvsKdXe5O9WyMbPOhEmL9ddcJQJj4v3hnlRo9JxFs7c2GB3YpchZ6e6416eifbFqP49H7dhg86NTwG3C0YEDMNZeramHGc0+46U9eJACBQ1yFb1FXV0yRvDtpcjepXG31G5TuRuVRxwRWwzRXKB53Y0epmwYB4HVypFF0ngtDVui5mi7uqrqnC3mTZ4DXg5t9121Sq0gxkM9oNMFMXK1meIRgQY46vVEpZJ6IDwZVKeSuV0DqRx46Ao85YqWTVWzvDrMe8gbbvW6kkBhVFHL13IrpO5Mk6Y6USWicyiiuVrlsn8nTVekCyVyq5WztFcaVSdLmLaMxeJ2KfPSyeslKpXW57VdD91SuVLAqqaoSPxIfrRMStj9EpK5XcrZ2tdPj69SuVQFBRxNF7J2xt1RY+XyAgy6qFfZyDlUra1EO+GnP2SqWdN6575BH55JVK+wQoqNBYW7dORDg+gIPO0wXa7FctyEPZvFJJnaYrV2NGIJRIjlFWYAlWKmmzUtmbFjXjL1BqsFJpZ/sDkF9WKgk1RoKqG/06EcEGDzoPF2qzX7UgDWV+pZI4TTf61RgUSQSLX+6gHNpVb/xyD3FEBqVib3Tkl/ySNyioyOjXieiPg4o8KuqZggEZE6ALy4rRmvnwJT29HUUSwYIgmw+KwFXNlosu9AbWGLNlZ4x+BfBgaSChoQwap8OX8sYCUCOytSjfH+ANHpEzvYHIoxqzg5pd6MMF24WGMmScDl/ytcJzooy8wc3IrBEhn+aNJGRMssmvyXm0oO9wKAPG8fClvnggExYcZeQqst3lDcyIeiOjpx4u4DscymLjnP7GglxYkA15Az29yRuYEfVGRk89XMh3OJQhI3p6ezYsMMrIG9iMe7yBGVFv5PTUw4UPmJkvX4Gvt8iFBUc59x0y13szCMKCjFk99XDlvqkHvqXCv97iYI35NuQNasbl3lz/TqWHC/kOhzL8Gp/MGrNBQt5ATy/3Jn+3zu2phwsGCw1leJzLrDEfJOAN9PRyb7J36+yeeriu9z0XlnM8vdyb7N36mymjKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKOoHqyiiSeo6OSlFHVScqqZ4pSalqKPaUvXaUfYqGmkzRX1CWwDLHWnjlv1minpPTVkUpR0BbMu6/9wMh9tBxtiqLurKDslsUc2bx6RFbcd8fW6XYElLUcmyRd2Yph6oanugTM9Va9yIWPef22I8+Wj95gnAAdSqKOvhZ7VJS1HJqkawXgNVZWHNSKSZKXuN5x3Tz8kYANgOSaef9SYtRSWrnqlbJnczYGaArj8c9wfecthQOttst8HPTVqKStZM3vir6adxbo5XBLbaX4QJADTrn0FaikpXAGA/g2ttyFNR+LOOahzrIIDLGQpFJSsAsB6PonY3Ag6avwaJjoAUdUzBHHBiqAnngHZO1UxnugDAJS1FHVBwFjyxWC6ATee0TV1NX4MYBOCSlqIOyBa1na8DvvphzlbVcEFvPhwPp7bN8Lf7BmTYLAO4pKWoI5q+CakHiIbvMl4DRfX0o2dy+mrEzkfgcbMCoEtLURRFURRFURRFURRFURRFURRF/Tz9P4AM57XOpRvOAAAAAElFTkSuQmCC", 473 | "text/plain": [ 474 | "#" 475 | ] 476 | }, 477 | "execution_count": 13, 478 | "metadata": {}, 479 | "output_type": "execute_result" 480 | } 481 | ], 482 | "source": [ 483 | "(progn\n", 484 | " (plot-dataset msi-access \"hits\" :terminal '(:png )\n", 485 | " :title \"MSI Access Log\" :ytics-font \",8\" :xtics-font \",8\"\n", 486 | " :xlabel-font \",15\" :ylabel-font \",15\" :xtic-interval 500 :output \"msi_access_log.png\")\n", 487 | " (display-png (png-from-file \"msi_access_log.png\")))" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "##Cleaning Datasets\n", 495 | "\n", 496 | "Sometimes data may have missing values or outliers. It is not unusual to have a broken or malfunctioning sensor generating your data. We have a way of dealing with that.\n", 497 | "\n", 498 | "The `time-series-dataset` class has a `ts-cleaning` method which can clean missing values an outliers. Lets look at the documentation:\n", 499 | "\n", 500 | " TS-CLEANING names a generic function:\n", 501 | " Lambda-list: (D &KEY)\n", 502 | " Derived type: (FUNCTION (T &KEY) *)\n", 503 | " Documentation:\n", 504 | " - return: \n", 505 | " - arguments:\n", 506 | " - d : \n", 507 | " - interp-types-alist: \n", 508 | " a-list (key: column name, datum: interpolation(:zero :min :max :mean :median :mode :spline)) | nil\n", 509 | " - outlier-types-alist: \n", 510 | " a-list (key: column name, datum: outlier-verification(:std-dev :mean-dev :user :smirnov-grubbs \n", 511 | " :freq)) | nil\n", 512 | " - outlier-values-alist : \n", 513 | " a-list (key: outlier-verification datum: the value according to outlier-verification) | nil\n", 514 | " - comment:\n", 515 | " Same as /dataset-cleaning/ in read-data package.\n", 516 | "\n", 517 | "Lets give it a try. In particular lets set the threshold for outliers to 5 standard deviations and set the interpolation method to mean." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 15, 523 | "metadata": { 524 | "collapsed": false 525 | }, 526 | "outputs": [ 527 | { 528 | "data": { 529 | "text/plain": [ 530 | "C-MSI-ACCESS" 531 | ] 532 | }, 533 | "execution_count": 15, 534 | "metadata": {}, 535 | "output_type": "execute_result" 536 | } 537 | ], 538 | "source": [ 539 | "(defparameter c-msi-access \n", 540 | " (ts-cleaning msi-access :outlier-types-alist '((\"hits\" . :std-dev)) \n", 541 | " :outlier-values-alist '((:std-dev . 5)) \n", 542 | " :interp-types-alist '((\"hits\" . :mean))))" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 16, 548 | "metadata": { 549 | "collapsed": false 550 | }, 551 | "outputs": [ 552 | { 553 | "data": { 554 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAMAAAACDyzWAAABZVBMVEX///8AAACgoKD/AAAAwAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAAAAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSAQBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg/yC+vr7f398/Pz9fX18fHx+/v7+fn58nJydvb29TU1MvLy8XFxfDw8M3Nzenp6c7OzsHBwcTExNHR0cPDw9PT08LCwtjY2MAALuS6J12AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAgAElEQVR4nO2di4KrOpOdxXPwcsJA/smcmVwmCc8fBEgIqFoC2VzsXuuc3e2mdCmVPoTACIyhKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiKIqiqC9XcWJqitLUVkVRv2wmgEXri3Ebmqouiqo1y8LqDK/I91/Rq2qMbcq6yQTQ01UVjuXGGNvW7QKgti6Pu0UA/4jaCaCyHPu8rQvHj7Gv/sOr/1A3dVHZ2OTGzKrxAL7GIbB99RvqIV2fdwFQX8JU1VhQ9Husqh+C3Qb/eyw38tBX6/IRzB9T1YaPwzG0Hwndv3677UFyJPVMlFVsamrbH2s9gBNdtSPSWaPCRrXVWEnTF9R/nn/PVQ2Jwu9l/rZPYl0y96Fn8KQ4UDeptuGj69sBlfY1bbG1I6z/vTC546wJI6AZhqd+IC0cU6/WzoVNNTjOzJQt/u3L80kXbIU/xtqq6UNLAH9MxfJjYc3IXdhUbEyD1QYAh4N4T6HbYNv+yD2QMo9gbkxzsNV23OB/h/JeE7WvQG+cf0zmj+88BP+aahs+DpSN6j+VdT18KDamIqQef/XwOQg9GquTkGGAbOrtQBdXNc79/O841VxbsdxO/YZWc8Aw9r1Ka5cjYDwsxiOgo2953msj3NqJs3YzAsYXZ2xTLX7HANu4RI6AvyZ3RuFUvsY5oD+NGDq6iQEMpsVkbECzrMdPU5IYwAlwdyayngPOZyxGGB+9YTkHLAngr6kcTmz9dUB3WmrdSW8/ptm2qu0MYDC1/XGyqSMAy6I001jozp2baDxs/DhXNw718ex3PgseykufBQ8lump5FvyDWn4T0ndx4f5oKrexH3ACgMG0vA5o/EmL+E1IOML3ZAnXAX1Vw4ZmcR0wzBB5HZB6hIYjtq1TySjqHFVlf3iPzpko6lK5LwdfTTodRVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEU9WA1deGWFFZuDVf8k6Iu0fi4nddr/T9FXaIRwNq6RYLxT4q6RLYo6vGRAcXiJ0Vdoqp1C/0TABbUn9dZAI641fgQzAHxz+s0BOphBHyVw7lH9POa2qlv0WkINO7hPKnLMATwz+teBAjgnxcBpG4VAaRuFQGkbhUBpG4VAaRuFQGkbhUBpG4VAaRuFQGkbhUBpG4VAaRuFQGkbhUBpG4VAaRuFQGkbhUBpG4VAaRuFQGkbtV5q+KGJXdcE0JhnYlAs3kuB1fFUSudiICtDNcFUwmdiMCr9cvT9ScjnFc79R06DwE3AO56NMdpHlAP17ndX/YDIA/BVELnIVDb/gcfzUFhnYfAUDIvw1BYvBBN3SoCSN0qAkjdKgJI3SoCSN0qAkjdKgJI3SoCSN0qAkjdKgJI3SoCSN0qAkjdKgJI3SoCSN0qAkjdKgJI3SoCSN0qAkjdKgJI3SoCSN2q8xB4FXXLRUlUQqchUJa2KflsGCqh0xCo7fSTC9MpoPMez1YWdcNnw1AJnQigaWs+G4YCOrX7a+twq3kIpqBOQ+DVj4AVnw1DJXQeApsLMLwMQ231uxeiuxPLpj4mAviL9d/d9gMigL9Y/91tPyAC+Iv13932AyKAv1j/3W0/IAL4i/XvrvvuIBHA36yfAD6g9rtjSwB3iQD+Yv0E8AG13x1bArhLBPAX6yeAD6j97th+qP6sYp4L4KZCAvjw+gngmfp87Z3w6R4RwF0VEsCzRAB3VUgAzxIB3FUhATxLBHBXhQTwLD0WwE75fIkI4GUS6z/sVEcAM1UVRfG6fk0IAUxW80cArK37efmjOQhgspq/AuD48+p1wQQwWc0fAbAYDriXP5rjZAAPFEoAU5UPOu/pWI0pq+sfzfHdAG7NPw3g6U9mueHRHAQwWc1zAHQ67yy4dSPg5Y/mIIDJav4IgE19y6M5PgmgUAIBfFO8EJ1VVk6hBHBXAL8OwN1d+LcA7MBfICkBPKwvBLCTNopsEcCLRQBTxR8GsJPLUSohgIf15QAqn+Xiu30AdoYAZuoRAKoJCeCb+hIAjwXhPgC77abDnmwBXJS4qbF7GoCHMhDA90o8AiAazZIAdkLK8c8NgJID7wB4lEAC+AEAd/Splu4vALg7c1IEcFcR3wFgpzv7FwAsKzO8frA8Un9aDwSwg+m+BUAUpW8EsHV3zbzczTPtEQeSIoAEcFdJlbuXoKitfVVHHEhKAnA3YeKWLvwkgDtc/RYAC+tGwXZ3+r36dQC7te3xAHYPBdC4ZW7W9MPgEQd2lfsMACUGCCDOnNTnAOxPPpqiP/ra6msOwaAPPgOgOnDOW7o0gB0B3KNyOv+oLzgJ+XEAu+hz+KHWSAAnTVdgXhdchknEjgBKNf08gCeJAA4b3gewWxbxcwDOqWqQaqW2MLvWhGgAyu4/H8AZlScAiOP1dQA2B0bMqtj3aA4CODu9rpEADmki7R8BSwdrnVoXLEVd8zUOj0ZCPoCJgscSPwhgRwCd9gBYVjN/zd7K7cuVnXw0x1sAbvpXBlAI8ubP6wDs5o/+53sAdl8JoP9wdA64W69mD4BF1/+fcHFpewqAq657KIDdyrAu4mMAituFEC282P9ojt3j3qyhcFufegjOBVAMzKMA9FtuAFBiUvkjtT0B4PD5zMswhUk/mqNbO6YB6FNBALuoL+8HsCOAm78yAHSDZHQWkkwf5dxxGebLAZSc3QNgdxzATVkLvwhglh4DYPc7AHaxq938TYvo6pY2YZP6R7TpKIDBw7u/CVkCGKJljOa+AGCM2tcA2N0HYNyIzwOoBLCb0y08JIDvAdht03a+yADgosclAFeNywbQf+rmD9kAyv1AADGAq46QCtpE/U0AV0MhAZQTE8BNotj8VQBKfm0A7A4C2IUtjwTwVV9yEvIegN1ZAM713ABgIGN2dt2eTbuXAK6nq18I4OuCs2Dv0q0AdmL8zgWw+zoA5Z7ZJv8cgMXL7kp3VO8AGD6LAHZXA7ioLg1gJwM4NyU04xCAodTHAhiicwzAXamO60EAhpFpUVtcwJcCuBmjTwKw2yb/HIBVxpfBe3QcwAixswFcd04CwDj1DOCyoH0Azi37CIChaZHnM24hdEsHZWBvA9BWrd2V8KAwgDFC0dTJbzkJwBDPzwI4E+J/bwDcTAYfDOAasxWAs7NzqlwAi6WS6Y9oJ4DdUwGM6YlSz33/6wB286ZdAHbfBWBo03cBGPV9JoBxMz4HYEhzLoBRaxcZcwE8Ux8CcI4iAnAu7U0APS6nAtj5ZswA+mII4MeUB6Dvm88D2D0WwLndDwUw7iwCGD7NP3IA7EycUwYwImou/y0AuxnAUPxnAJxRiaLpA7ZwftkPq2ClAQzdEGf8agCjIKoAdiKAcbBWwPrSDQIwhicCMO7fhYtLALsQ7CcCGNN2AoCrjF8D4HakmrvrFwBcZDwLwI0TSQCXe0gWgAE7COAcE7f9CQAGoJb7xg4AQ7rFbo8BnEM8bvwggIs+jAFc9O+7AK72n50AdqcDGOL7NwCMOy4K8TcCGIpQAYx3vHcADJ5fBmBUysUANlVRt/sWJV0KYBT1cePowkEA1x29/hS2fAeAm70nNNp8LYA9fU2dfjbM3M9PBTAU8HkAuzjV7QDOIPnIhPIX8foSAI17afqeZ8McAjC0MwDYJQGcI/otAE6NjEr4QQC70wEsarvn2TBzpD4IYIgnBrCTAOxCvv0ALjjSAdymPwBg90QA47KeBqBpqySAXVcIAMax2MbzEwDGm0QAu28CcAmS7Nc+AOdN7wM4V2T8XysAP39zQSRrHG514hAchyUGcL3pMgAXzFwN4KqRZwK4pC0DwCgmc3+sN905AvYnIf0ImHo2TIo2COCqBxMAdgcBnEs9BKC86V0A50/vAdjdC2AXQjEm706+DNMkL8NcCuCqCPPHAJxrzAVwEYA5Js8EcJdOAjDq1D8D4LaIiwCMOuvbAexOALDTAez2AzhH1pi7AQyEeG2LEDNGNRJAry8HUO/o9aZrAJRiItW4AFAJMAHcs+lrAFw7Jrh/AMAlgduKdL8yAIyqSwAYhemLANxycWDTbgDjizWhiMMAdgSQAOo9qADYqQB2BDATwMW1rVwAOwK4D8CorD8EYKgkqm4ngNNBpyOAOwFcGSGAIwR/GMAQ6cn4DoDdjwK47lQCuKoxAaCPZig+fEoB2EVh8oGPi+nmGPr0fxfA7qEAdrGH27KeBGBHAAkgAXxHhdjy+MN3ANhp7n83gKHZvqIYwC4TwFV6ApgD4LruTnP/VgDlIroHANj9PQC3pXYPA3C55ZkAhuPnDKDU2hnA2dWFOwTQA9j9TQBXoUgC2JllAH4ZwFXICOBnAezWACpO+8hkARiq/gEA4z7ZBeBiUwJAbyOAkl8EcLXNiF0SbzodwI4AEkCQPhdA0BHnAtgJqbzhowB2ixr3ARhv8hV9EYB714QkWh73yV8CUAxF1MZI6wzPArCbvbwWQLcqLv1ojmTLF4FNAbjZRAB9NR8C0G/6AgB72fSjOZItXwSWAEIABXeC6cMALksVywqVBS8vB/BV7ngyQqLl0TahmQRQLGtbxJ8E8NWapwL4Jj2/AKBQPgirkbxHAHb7ADzx0Rym6vl76iE4h55oEwFclir2TKgseHntCOjGP7Pj0RyJlkfbxGb+CIDdxtWPADh/+HsADk8+sjsvw2yDZ7Y9ITbzRgDFIecrARQrWmwyovePBnCXvgfAxcHyGICCO98NYBeEigiVBS+fDCCIOgE8CcBERYtNRgimVKWJrd8D4MrlVUswNCh43TpmcipkFLoX9fgTABQ/ZALoP5nY+y7ojwEIz/MIIPjQEUBZTwBwrvo6AOc61w05DCAe+ISK/B8E0BDALSuxO6JfmQAuatQA1ANsRO8J4EkAYtouAlCo8csB7KSKCOBc9VUAxnVuyuo2NvNcALdFEEACmAZw9UEKxeqTMeajAM4ZfxvAPamiYJwLoEjZpixp0wbAiEChsBsA3GYM272fndTIbwVwGynzVABjpyUgtqm2m8SKJqFCUUXCMHwmgGpF3wDgnkiJtAmbAIDiUJsLIB6GtoIACg4GnQPgukojRE4sFAIoFf9XARR7PBPAOP6nArioaNJFAEphFf3abJ/cFJN/P4B4Wrin1KMACqluAHA4oA/6DgCV5I8GUHJXbzoGcLZBAIVSdwG4Lf8TAAoF3AYgDisBlMelfABBeuiOOLPcpJJ0NNUZAIr1vA9gHMVt8Y8FUHRXb/peAPdVNKfPBFDw8OMa+csu/0YAowS/A2AyGKtNlwF4mn4CwO5XAEwHY7VpJ4BCsfvSn68TAYwrUT9IZf0OgHgTMiIA4cwbjgT7Tiqu5M/37Pt17mqRNL0VkovWewC0VWF2PhtmZzDkFhyyQQD3lf4YAKeOvQjAnQF7EIBV05e979kwJ+l9AHeWust2ggggVrF3YfpJuh7Ai2XCj3OrWHz6KIDd2QDuezTHSToHwAfpAt8J4Bs6dtL8hboUwF2VPg7A+mGH4N/SLQ3ElcoAggwnA7jv2TB/T59B54EAHtZ5AI4Phzl6GeaPiAB6Pe1C9B/Rz88OdosA3iIC6EUAbxEB9CKAt+g6AJ+O+oMAfHqoPqkHA3hxNxDAW0QAvX4ewGdiTQC9COAtSnj1QacJYBLAc28LuCyehyr6NID6bbUE8PMAHroLjgDmF3WFngCgSbVciOk7AJ4Db27i/Xddgz1VLs+oln2uXaEfAfDQCJALIHbwHQDXeRUA9UBhAM3yz92uYXc+o2cCiIn7cgATrdHL8kuQVvHSx7k0gPui83cBlCOXBhAfd94BUD84fhhAye/dAK7i9hAAxWIeAqBCzSEAc2jdHbFVXwoJzXrDhpRNUUbKGwG4zngpgHgn1vPtSjF/fAaA6+VdKHJG6u6htEWHrrtH6Llk1DYebV6S1M2VeYsJaB0EcNlYgfgDAJrYYsw2Ycq1jwCo0z5X8DQAQ+SMD2SIuumM/2UCYlNOD6AJHXAYQH3k8PWkATQxgCbmYJE+2Bf1bgA0a+KMWXs5Z4h20BC8GMBFUaprc6r4OV8CwMv6hYCB0heJHwWgmQMUAPQMmjFq4T/ffRGAZkZ081zlkG7ubmnk2MYzdL6vfdEpIcO8W2ydDP9MlNvMf0YAhh1oKMZE7Q9bA4gagHO0pk9rAFeVC4goAJpFBgygxOMTAfQxiuNiRAAnBMI//9EHe9QMYMjvuTCroTaK8qLOMPL4stcARmNt6K0lgMGneHg33pVu9jh0iy9g0ay5gLmRXdSkGYCoZXERc4FRKEL1Mzzd7Ptcxrw3LwGMwRcAXCAW7Rhzb5rZeiWAwpIQB+AUpijGi1BPLq/VLX7FAEbRXvWCiQIw94KJM8zeLACcuzSizcwDzKLyxSY/akXOCADOXmwKWLgflRgQiQqIs3arXxGJobBpb+t8xpVPE4Ahbj54S9r8rtiFFBsAZ6v3ct5fLwRQeDJHvEy4WwctClSybCUFyNgpn6NNcedtUnXbTeuMya2DZd3KdGMXxWrJD0UEVC11QoiIP1yYMLDNqM672vxxUde0C18HYL1dFnz20wl3Avh2YbtSHd9Jvlzbw5OU6DoAhQcj3AjgWYUdrvL3AcS6GUDqz+syAGvhEExRl0l4MgdFXSfpMgxFURRFURRFUdTFssBUtrqx6U+qcwqFpV5faG77TykUlnpDbC5RWWuXZtq6rPTLNrZ9VRmF4lKvLzS3/acUCku9ITbXqKmtuL102wvZNqpuDheaLvXqQvPbf0qhsNTLY3OFGsWBoUXwmxOru64Vmi71rEJhqZJxT6Hqoc0VKhl9ofoxEZV6dcAv0RB/aSJQOb/qVpxdtJVTXWoziKHQSrJNpUpGXyjyVJZ3FXlaKEeawVXJOBVq2225wVPt8DW4Khm9p5INlept2nTOVSjbQDemevF82coOoRInArV7qHkpzy6GYd3IGWGhvlTJ6Avtw7kJiC9Uss2uCsap0La2gjdTqbJxLNRNoICneqGyMXgq2FCpoRnCdC60Qpzq1Xo3ol68SGXtR5Xt6NLWr3E0kmYXc7C2GVGhc6liRjtE41VtpyxDoYptLFTN2G9qitZI34MPpSrGodDSeVnJng4SQfJN2xrn5ucQqEznhgq1qR7qRtSLFynsBIIHTTvY5CMfytjbemP5EnP2parGPmPb0yDmQzZXKMpoe0spXqpAxr7Q3tLbStHTceSQ5hnIOAQ1J2NvU6dzyAa7EfXiRSr/W5h56dM5yda2esbSjUS2frVyobrRDUVVK9dYVrrNSTW2pu8XO9z1JvRrBYxu7LPDqCJ52rNptUKBEWVEpbbzVFeIjW4bpXQj6sWL9K/aDr+bejvb9tOZRprg+71HMtpivPVQzGig0VSFUqMFNphxqLBQbHY2bq21HexibKZDqZjvn9m4rRBlhMZ6nM7J3gTbJlvcjRtr1Ivg+vipGl14vWypzC6cDWYUbe7YpWYExqaukrZWOn6FjM0mbxsq3NpmY1kI8+Bg28zKxsSybTZKrdiVUTCO0zm5p4JNqHDuRrGLrdFsF8nNICo3i5VnJc04B2/FSbNqbF6NnnE2bi+Ile4CDrb1u/P2zNQb27renu81ZaPavFH8LiHYpPNkY5DN6O3fkVE0ulmp1lOTbahxY9zRxaLtIrWv4TS9FU82/n3yq5HOsF7AiGyT0fa75jZaSVtTiae0g7GtrWhENjO234rfV0028UQR2WD7h4zFS75WiIygp2Dchm5UMqLuv0h2jr9d2/4Zt8lnnxYYkW2wWHftRDhvS9isurjFjnMhI12OQLap/db1w/YqWzhKbUcIZEvFps/oJhLSlA0YLegpt0GLm+tGlFG1XSV3kaN1x8ZCmhhPV9qljMgIMzZVUVtTiuM+sjmA3FFNNk4A1VKV0Obqa6ILYytbr0qapCNbaL/oqMsofleUMKKegnFDGZHtGrXWVLZ8teJJVO+du9IpzleQEWYs236SM5y8HbMNhzv1slU5XXM9anPXyl4yf9NWpRXA5ttvJSDGi3PDNdFjRtRTMG4oI7JdpbauGqt8n1pXVukcbIQZe736mY7SebrNzeTcu4/FUqfrsSKfyDbU2Jds5YvubfPqbVKNyDa234AIlHWlLxPTjKCnzBQ3+e4rlBEXeoX6Idg5IO6uxgfRHjd6m9Q2W1i1c3Rb3zG12jv9ydzLNeOozR2/3IBVS3W27qsJpUZkG+w9hKVi7KtrHCzKbq8YcU+5uBUvmVyQMdH9l2jcAeReH78PqErJBo3eJs+uXjq6wGab0DtbNWXvR2+WugDZ3MjqBldl9AA1YlvlDsOFbKybpm4rJeTAiHpqjJs6KQBdDGzXqN8Jxt1VMtZ2wEgJJDBOtv6X8JXVPxOeQj5kC71jReNwlC7/h3xpX7fZYWrZKKVONSJvJLkZWVsU8hXi3tYUrdLnwIh6qvyXi1s/05UKRRlh91+lcXeVv7Xv/yuVQCLjaHPflks81HYYOsV5WcLmekcZdQaQ/uf2i42UbfiOsG21HamvUT58I1tbl//1X8qQ1Nsa95WHMiIDo95TZf0aWoFHZHknQ7ZLNO6updwD0zAm7yDIONh6OKX7Hgc826LuMx+yTb0jz3UGdMt/L8Wv5pFt/OZNPk0ZalQO38hmzP/5v9Z9H2ukewFerR43ZJx6Srhs6SAahgNHrnDpsta7GHb/NRp7YHv/m1Mz3iwkBhIae9t/Vuodqu6eT9ufgVqhSmDDXVe3L+ePNOoim/vmbfiexQh919eozhGRzfy//xiaotzSMR0wxQubwIgoa19Dd7RihaiLke0q9firPWCNHkhkbP/NlyXddTh8NdGKNwQhG+y65t/+lx2+YRdGXWRzGr4z/t/yhKHRYUG2ITJyje14wFSCio06ZcNwoDURdDG0XaN+Lj30wEu5V0oLJDRa/1sA0I63bonoIpuf6yh3INXjvQDimgFgcychfeP/TW5j7RrSyDUiWx8ZtcZww41UITL2h3yrUdYPB3oTURcnuv8CDR2u9gAIJDbqi2qq/lD6UqIMbH6uo3WdfemjLrK5PlcnDO3wjb3Gg26LVn0JSx+mW4PkuAEjpAw1H3Qx7v4LZOuXWyaltQsFEhr1RTWNu4ah1IhsJtF1NrRIGHWBDfZd82pBbHQbrhEFFRohZVqFqIsT3X+Fmn4/zmpXwqgvqmnqotRqRDYn2HVomSeyef8Ow5LrDawQGWFG9aCDujjR/RfJ+t+H2pUwwoUzqMZc23xRX/ZGtcFmnOJNdlATAR9SiKs1s5txkXLbhYxhUc0nowxtyJvsZjzLm6RNW62Z3YxrlN0uaOw3T0u9Pxxl2HXqYlrgaTLjs7zRbfpBJ7sZ1+iddqnGaam3WOg7Uda7rg03YQneABtsxineZAcVBhwddLKbcY2y24WMw1JvpdDsKMOuM/C4v6cZH/QU1pgdVJixxged3GZcoex2YeNUqLSwMjfKeHKJvNnRDPGk4TZvDmeEB519zdj6cpXq3HYh41hof4KlP7noePcgG/Im3QxtInSPNxkZ4UFnVzOkfNcou13IOBTauJvntCcXZUQZdh3yJtUMdSJ0izd5GcNBR/yqb08zbtO+dqHHAApG9xwh922P9uSirChjT6GrsBn6ScMHvAEPZRRtuRnraSiTvtrY04xbBbqnTh9MFeNwh8FLucSUCwuyIW+gpyd6IwOBbLkZx7mE29/F1xImsb5Vtd496YOpZhwAzKgx24a8gZ6e540CBIQlN+Mwl3ArjuA8UMNaynOdUPekDqaqcXw4UUaN2SAhb2AzzvJGBQLCkp3RVv1kQrnFKoW1lOdCwe7BB1Pd2ISt9liN2SBhV4HtHG8QEMiWm9HfVykqifXdAt0DD6bQOEl+vUUmLGlblqcneIOAgLDkZizhLaYQazXXZQLdEw6m6oKj4ZPVCxfPs3JhQTbkKvb0HG+m/hZv99Bts1Fshm6z4cmIdm1KYC034EqF7pG+XJv2cPH2xXCk1V/jI7+kAwGBvEE26Cr0dJc3B20eCPmx98AWSBIXGwBb5JbURD/+Ca7er6l7EjdHoLvHtOtJ2ks6EBDIm32eHn3h0A5vDtu8warf96fvBUARP9jEMECCV37drURA0O2L8jPy0Es6oqxyje90Xc4Lh7A3mTb03qS3XlR1UhNvlR6Q8MId7XF3A2fSlKVWX9IRZz3mDbSF++KtXp30aqSEN96mWbV86L1J4O1P2YsNvC+ZTbxX+ju2/PeF8mocP87Jr0ayRnuDxyD1DVTgPVrBU/G8yPoP25nX/MIhZURW32rmbcrrwMYXRwmG2urvTfI2abcOzQC9IcVtXxOfOQ/0AcHv2BKvOvsnSm6vc0Yv6dgKvoGqDrdxqF0n7g9z1wWfl54qr0aCLyBDk4nw4i5pcoXem+RtsBmgN5RbnXc08aHzwMQ7tqzRXo00BES5zulf0uGyCxlr9Q1U6D1ayFPfPZqn6quRDHoBGZpMhHzi6Q14b9J8DVhvhuKNxXezJ5v4zHkgesdWi666lv+hX+ccF5ebVrtpHN+qoUTSd528qm9IIc90hhcOWXRqo5SKJhMI+dFeyhWaoTRlt26BEd76v6OJz50Hpr8TEYeyf6brnJKxGQFSbhqv0FQcRBIu+/UHUyGrre04KkvFRm81EzzVJxPDA5S0TvXXf8Uac7/qRKcwuInI1Qco/Z2INpQN459sHBNY+f0J9TSf25rGuZ4SyTBaSfmmI43oDRyR/wVKRZOJfwDy/sxZrhHt1tBY63GDTfwnvGv2kfLX9sWh7AWGMnjdv0KXeKb74o/C4mfi+now7VI2HJFBqdgbHfkov3TSAHdr3YjilmjiyJ/M/O3ylB0eypDRomU803zuMCx4hW6VcBWuHf20N6hGuFsjI4obqhDvuo/R8aEMGutwUeVzX1m1KN8/wBs8Imd6A5GHNaYqzPEGVjg98/POVelJ5QxlyAiX8aBo5dqgN7gZmar420gAAAXdSURBVDUi5LE3ubs1tOGeMnevSkfKH8qAES7jyYUFRhm6imyXe5O9W8OMsImpEfleZQ9l0Ai/+6wzYUG27IWVl3uTvVsjGz7owJ3lftWZQxk0joXKFT5rvfDl3jxsxfr9yh/KgHEs1JmFMs9dL3zUdrk3565Yl0xRF4u+3qzsoQwZ3YWD4SKo/F14Jiw7Du2HPb3Bm9zdGtpQhYmd5X7tGcqOGo27tGWU78LrTFiQDXmDPb3am+zdGtngQSfB/O2qdwxl4vPivVFu9JBEvLUzFxbYrchV6Om53qind7JtNopP79dt+KBTw2PA3YIBMeNQpq6NGcY57a4zdZ0IAAJ1HbIlXVU93eXNQZuvUf1qo8+ofCcylSouuAK2qUL5oJM6Wt0sGJCggyuVkutEELpa16VsaVfVNVXYmywbvAbc/HfdNpaqNAPZjHYDzNjFSpZnCAbEmOMrlfasE9GB4EqlvJVKaJ3IY0fAQWesVLLqrZ1x1mPeQNv3rVQSg4oijt47kVwn8mSdsVIJrRMZxJVK160TebpqPSDZK5X8rZ2iuFIpudxFNGavE7HPHhZPWanUzre9Kuj+6ZVKFgVVNcJH4sN1IuLWx+iUlUr+1s5WOnz9+ZVKIKgo4ui9E7a2agufLxCQedXCNs7RSiVt6iFfjTl7pdLGG9898oh88kqlbQIUVGisrV8nIhwfwEHn6QJtDqsW5KFsWqmkTtOVqzEDEEokhygrsEQrlbRZqexNi5rxL1BqtFJpY/sHID+vVBJqTARVN4Z1IoINHnQeLtTmsGpBGsrCSiVxmm70qzEokgiWsNxBObSr3oTlHuKIDErF3ujIz/klb1BQkTGsE9EfB5V4VNQzBQMyJEAXlhWjNdPhS3p6O4okggVBNh0UgauaLRdd6A2sMWXLzpj8CuDB0kBCQxk0jocv5Y0FoEZka1G+f4A3eETO9AYij2rMDmp2oQ8XbBcaypBxPHzJ1wrPiTLyBjcjs0aE/D5vJCHjLpv8mpxHC/oOhzJgHA5f6osHMmHBUUauIttd3sCMqDcyeurhAr7DoSw1zulvLMiFBdmQN9DTm7yBGVFvZPTUw4V8h0MZMqKnt2fDAqOMvIHNuMcbmBH1Rk5PPVz4gJn58hX4eotcWHCUc98hc703ThAWZMzqqYcr90098C0V4fUWB2vMtyFvUDMu9+b6dyo9XMh3OJTh1/hk1pgNEvIGenq5N/m7dW5PPVwwWGgow+NcZo35IAFvoKeXe5O9W2f31MN1ve+5sJzj6eXeZO/W30wZRVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVEURVE/rKJIJqnr3Ukp6qDSVDVFuTcpRR3VmqpyQ1lZNNJmivqE1gBWG9KGLdvNFPWemqooKjsA2FZ1/7lxh1snY+yrLuqXdcls8Zo2D0mL2g75+tw+wZyWonbLFnVjmtpR1fZAmZ6r1vgRse4/t8Vw8tGGzSOADtRXUdXu52uVlqJ26zWAVTqqqsKagUgzUVYO5x3jz9EYAdi6pOPPepWWonarnqibJ3cTYMZB1x+O+wNv5TZU3jbZbfRzlZaidmsib/jV9NM4P8crIlsdLsJEAJrlzygtRe1XBGA/g2ttzFNRhLOO1zDWQQDnMxSK2q0IwHo4itrNCOg0fQ2SHAEp6piiOeDIUBPPAe2UqhnPdAGAc1qKOqDoLHhksZoBG89pm/o1fg1iEIBzWoo6IFvUdroOWPbDnH293AW96XDsTm0b97f/BsRtlgGc01LUEY3fhNQOIvddRukoqscfPZPjVyN2OgIPmxUAfVqKoiiKoiiKoiiKoiiKoiiKoijq9/T/AXs80tNFciChAAAAAElFTkSuQmCC", 555 | "text/plain": [ 556 | "#" 557 | ] 558 | }, 559 | "execution_count": 16, 560 | "metadata": {}, 561 | "output_type": "execute_result" 562 | } 563 | ], 564 | "source": [ 565 | "(let ((png-file \"clean-msi-access-log\")) \n", 566 | " (plot-dataset c-msi-access \"hits\" :terminal '(:png)\n", 567 | " :title \"Cleaned MSI Access Log\" :ytics-font \",8\" :xtics-font \",8\"\n", 568 | " :xlabel-font \",15\" :ylabel-font \",15\" :xtic-interval 500\n", 569 | " :yrange '(0 8000) \n", 570 | " :output png-file)\n", 571 | " (display-png (png-from-file png-file)))" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "Notice the datapoint near 13 July 2008 15:00 to 15:59 that previously spiked to over 7000, is now more reasonable.\n", 579 | "\n", 580 | "##Conclusion\n", 581 | "I would like to thank Fredreric Peschanski the creator of [`fishbowl`](https://github.com/fredokun/fishbowl-repl) which provides common lisp support for iPython. I would also like to thank Masataro Asai the creator of [`eazy-gnuplot`](https://github.com/guicho271828/eazy-gnuplot/). I would like to thank the creators of iPython and project [Jupyter](http://jupyter.org/) a truly cross platform mechanisim for th presentation of code and content. Finally I would like to thank github for [providing the ability to view notebooks inside github repositories] (http://blog.jupyter.org/2015/05/07/rendering-notebooks-on-github/)\n", 582 | "\n", 583 | "The iPython notebook and source for this tutorial can be found in the [clml.tutorials https://github.com/mmaul/clml.tutorials.git](https://github.com/mmaul/clml.tutorials.git) github repository.\n", 584 | "\n", 585 | "###Stay tuned to [clml.tutorials](https://mmaul.github.io/clml.tutorials/) blog or [RSS feed](https://mmaul.github.io/clml.tutorials/feed.xml) for Part II which will cover prediction, anomaly-detection." 586 | ] 587 | } 588 | ], 589 | "metadata": { 590 | "kernelspec": { 591 | "display_name": "SBCL Lisp", 592 | "language": "lisp", 593 | "name": "lisp" 594 | }, 595 | "language_info": { 596 | "name": "common-lisp", 597 | "version": "1.2.7" 598 | } 599 | }, 600 | "nbformat": 4, 601 | "nbformat_minor": 0 602 | } 603 | -------------------------------------------------------------------------------- /CLML-Time-Series-Part-1.lisp: -------------------------------------------------------------------------------- 1 | 2 | (ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net 3 | :clml.hjs ; Need clml.hjs.read-data to poke around the raw dataset 4 | :clml.time-series ; Need Time Series package obviously 5 | :iolib 6 | :clml.extras.eazy-gnuplot 7 | :eazy-gnuplot 8 | )) 9 | 10 | (defpackage #:time-series-part-2 11 | (:use #:cl 12 | #:cl-jupyter-user ; Not needed unless using iPython notebook 13 | #:clml.time-series.read-data 14 | #:clml.time-series.anomaly-detection 15 | #:clml.time-series.exponential-smoothing 16 | #:clml.extras.eazy-gnuplot) 17 | (:import-from #:clml.hjs.read-data #:head-points #:!! #:dataset-dimensions) 18 | (:import-from #:clml.time-series.util #:predict) 19 | (:import-from #:clml.hjs.read-data #:read-data-from-file) 20 | ) 21 | 22 | 23 | (in-package :time-series-part-2) 24 | 25 | (defparameter dataset (read-data-from-file 26 | (clml.utility.data:fetch 27 | "https://mmaul.github.io/clml.data/sample/msi-access-stat/access-log-stat.sexp"))) 28 | 29 | dataset 30 | 31 | (head-points dataset) 32 | 33 | (defparameter msi-access (time-series-data dataset :range '(1) :time-label 0 :frequency 24 :start '(18 3))) 34 | 35 | msi-access 36 | 37 | (subseq(ts-points msi-access) 0 5) 38 | 39 | (progn 40 | (plot-dataset msi-access "hits" :terminal '(:png) 41 | :range '(0 40) :title "MSI Access Log - first 40 points" :ytics-font ",8" :xtics-font ",8" 42 | :xlabel-font ",15" :ylabel-font ",15" :output "msi_access_log_40.png") 43 | (display-png (png-from-file "msi_access_log_40.png"))) 44 | 45 | (progn 46 | (plot-dataset msi-access "hits" :terminal '(:png ) 47 | :title "MSI Access Log" :ytics-font ",8" :xtics-font ",8" 48 | :xlabel-font ",15" :ylabel-font ",15" :xtic-interval 500 :output "msi_access_log.png") 49 | (display-png (png-from-file "msi_access_log.png"))) 50 | 51 | (defparameter c-msi-access 52 | (ts-cleaning msi-access :outlier-types-alist '(("hits" . :std-dev)) 53 | :outlier-values-alist '((:std-dev . 5)) 54 | :interp-types-alist '(("hits" . :mean)))) 55 | 56 | (let ((png-file "clean-msi-access-log")) 57 | (plot-dataset c-msi-access "hits" :terminal '(:png) 58 | :title "Cleaned MSI Access Log" :ytics-font ",8" :xtics-font ",8" 59 | :xlabel-font ",15" :ylabel-font ",15" :xtic-interval 500 60 | :yrange '(0 8000) 61 | :output png-file) 62 | (display-png (png-from-file png-file))) 63 | -------------------------------------------------------------------------------- /CLML-Wine-pca-k-means-and-hierarchical-clustering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# CLML, Wine, PCA, K-Means and Hierarchical Clustering" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "> (C) 2015 Mike Maul -- CC-BY-SA 3.0" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This document is part series of tutorials illustrating the use of CLML." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "###Caveat\n", 29 | "Anyone wishing to run this notebook or the code contained there in must take note of the following:\n", 30 | " - This tutorial relies on the github version of [`CLML` https://github.com/mmaul/clml.git](https://github.com/mmaul/clml.git) or a quicklist-dist `CLML`> than 20150805 \n", 31 | " - The plotting portion of this code requires the system [`clml.extras` https://github.com/mmaul/clml.extras.git](https://github.com/mmaul/clml.extras.git) cloned after 2015-09-10, which is not currently in quicklisp.\n", 32 | " - While the above git repositories are not in quicklisp they be loaded by `quickload` by placing the repositories in $HOME/quicklisp/local-projects" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Introduction\n", 40 | "This article will discuss two clustering techniques, k-means and Hierarchical Clustering.\n", 41 | "Clustering is an unsupervised learning technique, with the goal bing to group to samples into a given number of partitions.\n", 42 | "Clustering uses uses the similarity between examples and groups examples based of their mutual similarities.\n", 43 | "\n", 44 | "## Wine\n", 45 | "\n", 46 | "Wine such a multideminensional beverage. A feast for the senses, taste, smell even for the eyes. It also pairs well with statistical analysis techniques and even Lisp. What does wine and statistical analysis have to do with one another you may ask. Well the Wine dataset at it the 'goto' dataset used in just about every introduction cluster analysis. The Wine dataset is a chemical analysis of three types of wines wines grown in a region of Italy. The dataset contains an analysis of 178 samples, with 13 results of chemical assays for each sample. It is small enough yet contains enough complexity to be interesting. \n", 47 | "\n", 48 | "While were on the subject of wine if you find yourself in eastern Washington, I highly recommend stopping by the [Parisisos del Sol](www.paradisosdelsol.com) winery, good wine, down to earth atmosphere and a really interesting and knowledgeable proprietor. \n", 49 | "\n", 50 | "But back to the cluster analysis...\n", 51 | "\n", 52 | "Lets get started by loading the system necessary for this tutorial and creating a namespace to work in." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 1, 58 | "metadata": { 59 | "collapsed": false, 60 | "scrolled": true 61 | }, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "To load \"clml.utility\":\n", 68 | " Load 1 ASDF system:\n", 69 | " clml.utility\n", 70 | "\n", 71 | "; Loading \"clml.utility\"\n", 72 | "....\n", 73 | "To load \"clml.hjs\":\n", 74 | " Load 1 ASDF system:\n", 75 | " clml.hjs\n", 76 | "\n", 77 | "; Loading \"clml.hjs\"\n", 78 | "\n", 79 | "To load \"clml.pca\":\n", 80 | " Load 1 ASDF system:\n", 81 | " clml.pca\n", 82 | "\n", 83 | "; Loading \"clml.pca\"\n", 84 | "\n", 85 | "To load \"clml.clustering\":\n", 86 | " Load 1 ASDF system:\n", 87 | " clml.clustering\n", 88 | "\n", 89 | "; Loading \"clml.clustering\"\n", 90 | "\n", 91 | "To load \"iolib\":\n", 92 | " Load 1 ASDF system:\n", 93 | " iolib\n", 94 | "\n", 95 | "; Loading \"iolib\"\n", 96 | ".....\n", 97 | "To load \"clml.extras.eazy-gnuplot\":\n", 98 | " Load 1 ASDF system:\n", 99 | " clml.extras.eazy-gnuplot\n", 100 | "\n", 101 | "; Loading \"clml.extras.eazy-gnuplot\"\n", 102 | "\n", 103 | "To load \"eazy-gnuplot\":\n", 104 | " Load 1 ASDF system:\n", 105 | " eazy-gnuplot\n", 106 | "\n", 107 | "; Loading \"eazy-gnuplot\"\n", 108 | "\n" 109 | ] 110 | }, 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "#" 115 | ] 116 | }, 117 | "execution_count": 1, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "(progn\n", 124 | " (ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net\n", 125 | " :clml.hjs ; Need clml.hjs.read-data to poke around the raw dataset\n", 126 | " :clml.pca\n", 127 | " :clml.clustering\n", 128 | " :iolib\n", 129 | " :clml.extras.eazy-gnuplot\n", 130 | " :eazy-gnuplot\n", 131 | " ))\n", 132 | " (defpackage #:wine\n", 133 | " (:use #:cl\n", 134 | " #:cl-jupyter-user\n", 135 | " #:clml.hjs.read-data\n", 136 | " #:clml.utility.data\n", 137 | " #:clml.hjs.vector\n", 138 | " #:clml.hjs.matrix\n", 139 | " #:clml.hjs.k-means\n", 140 | " #:clml.pca\n", 141 | " #:clml.clustering.hc\n", 142 | " #:eazy-gnuplot\n", 143 | " ))\n", 144 | ")" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 2, 150 | "metadata": { 151 | "collapsed": false 152 | }, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "#" 158 | ] 159 | }, 160 | "execution_count": 2, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "(in-package :wine)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "##Take time to get to know your wine\n", 174 | "\n", 175 | "This tutorial illustrates how to use clustering, as well as how to use Principal Component Analysis in support of k-means and also hierarchical clustering in CLML. Just like swilling a bottle of wine is not a good thing neither is turning loose analysis techniques willy nilly on data. It is necessary to understand your data first, as the saying goes, garbage in garbage out...\n", 176 | "\n", 177 | "Another take away from this tutorial is working with diverse data. Our dataset comes UCI Machine learning archive, the convention there is a headerless CSV file with seperate file ending in `.names` containing the a description of the dataset and the column names.\n", 178 | "\n", 179 | "So armed with some information about the origin and meaning of the data, the orignization and the location we can begin.\n", 180 | "The dataset is located at http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine\n", 181 | "\n", 182 | "Well almost ready another thing that is necessary is to asses the data so we can ingest it properly. This is something that generally needs to be done manually or can be skipped with prior knowldege of the dataset. \n", 183 | "\n", 184 | "The `wine.names` file is rather verbose so We will just list the column names:\n", 185 | " 1. Alcohol\n", 186 | " 2. Malic acid\n", 187 | " 3. Ash\n", 188 | " 4. Alcalinity of ash \n", 189 | " 5. Magnesium\n", 190 | " 6. Total phenols\n", 191 | " 7. Flavanoids\n", 192 | " 8. Nonflavanoid phenols\n", 193 | " 9. Proanthocyanins\n", 194 | " 10. Color intensity\n", 195 | " 11. Hue\n", 196 | " 12. OD280/OD315 of diluted wines\n", 197 | " 13. Proline \n", 198 | "\n", 199 | "So lets take a peek at the data." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 3, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "UCI-WINE" 213 | ] 214 | }, 215 | "execution_count": 3, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "(defparameter uci-wine \n", 222 | " (read-data-from-file\n", 223 | " (fetch \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\")\n", 224 | " :type :csv))" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 4, 230 | "metadata": { 231 | "collapsed": false 232 | }, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "\"#(#(1 13.2 1.78 2.14 11.2 100 2.65 2.76 .26 1.28 4.38 1.05 3.4 1050)\n", 238 | " #(1 13.16 2.36 2.67 18.6 101 2.8 3.24 .3 2.81 5.68 1.03 3.17 1185)\n", 239 | " #(1 14.37 1.95 2.5 16.8 113 3.85 3.49 .24 2.18 7.8 .86 3.45 1480)\n", 240 | " #(1 13.24 2.59 2.87 21 118 2.8 2.69 .39 1.82 4.32 1.04 2.93 735)\n", 241 | " #(1 14.2 1.76 2.45 15.2 112 3.27 3.39 .34 1.97 6.75 1.05 2.85 1450))\"" 242 | ] 243 | }, 244 | "execution_count": 4, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "(format nil \"~A\" (head-points uci-wine))" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": { 256 | "collapsed": false 257 | }, 258 | "source": [ 259 | "So is is all numeric data, The first column from the dataset definition is the class so we don't want to give that to our\n", 260 | "clustering algorithm as that would be cheating. The numeric columns should be of type double float\n", 261 | "\n", 262 | "The k-means, and hierarchal clustering implementations require a numeric-dataset which we create by using `pick-and-specialize data`" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 18, 268 | "metadata": { 269 | "collapsed": false 270 | }, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "WINE-CLASSIFICATIONS" 276 | ] 277 | }, 278 | "execution_count": 18, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "(let ((wine-unspecialized (read-data-from-file\n", 285 | " (fetch \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\")\n", 286 | " :type :csv\n", 287 | " :csv-type-spec '(integer\n", 288 | " double-float double-float double-float\n", 289 | " double-float double-float double-float\n", 290 | " double-float double-float double-float\n", 291 | " double-float double-float double-float\n", 292 | " double-float)\n", 293 | " :csv-header-p ( list \"Class\"\n", 294 | " \"Alcohol\" \"Malic acid\" \"Ash\"\n", 295 | " \"Alcalinity of ash\" \"Magnesium\" \"Total phenols\"\n", 296 | " \"Flavanoids\" \"Nonflavanoid phenols\" \"Proanthocyanins\"\n", 297 | " \"Color intensity\" \"Hue\" \"OD280/OD315 of diluted wines\"\n", 298 | " \"Proline\")\n", 299 | " )))\n", 300 | " (defparameter wine\n", 301 | " (pick-and-specialize-data \n", 302 | " wine-unspecialized \n", 303 | " :range '(1 2 3 4 5 6 7 8 9 10 11 12 13) :data-types (make-list 13 :initial-element :numeric )))\n", 304 | " (defparameter wine-with-classifications\n", 305 | " (pick-and-specialize-data \n", 306 | " wine-unspecialized \n", 307 | " :data-types (make-list 14 :initial-element :numeric )))\n", 308 | " \n", 309 | " (defparameter wine-classifications (loop for r across (dataset-points wine-with-classifications) \n", 310 | " when (= (elt r 0) 1d0) count r into one \n", 311 | " when (= (elt r 0) 2d0) count r into two \n", 312 | " when (= (elt r 0) 3d0) count r into three \n", 313 | " finally (return (list (list 1 one) (list 2 two) (list 3 three)))))\n", 314 | "\n", 315 | ")" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": { 321 | "collapsed": false 322 | }, 323 | "source": [ 324 | "## Letting the wine breathe\n", 325 | "The process of letting wine breathe is to exposure to the surrounding air. Letting allowing wine to mix and mingle with air and surrounding temperature. Similarly our data needs to \"mix and mingle\" so that the scales become more uniform which is helpful to clustering algorithms.\n", 326 | "\n", 327 | "So we need to consider if the data needs to be standardized, generally the answer will be yes, unless the \n", 328 | "columns are all of the same units and have lov variance. Specifically with this datasets since the columns are using different units scaling is necessary. Our our implementation of `k-means` can do our standardization for us however we will need to do it for the principal component analysis. The `standardize` function used here will be the z-score which is $(c - \\mu) / \\sigma$.\n", 329 | "\n", 330 | "## Savoring the wine\n", 331 | "Savoring, the the process of detecting the essential properties generally using taste, we wiil however use principal component analysis instead. \n", 332 | "\n", 333 | "So what columns should be evaluated? We could evaluate all of them that might be okay, and would work in this case. However in the case of large datasets it may be necessary to reduce the dimensionality to decrease execution time. It could also be the case that some relatively unimportant columns could be throwing off the results.\n", 334 | "\n", 335 | "#### Principal Component Analysis\n", 336 | "Well one technique we can use to select the relevant columns for analysis is principal component analysis. Principal Component Analysis is a set of orthogonal transformations that reduce dimensionality in a dataset. One of the things that PCA can tell us is the magnitude of the contribution of each column to the variation in the data. \n", 337 | "\n", 338 | "PCA is one of those algorithms that needs regular data so we are going to standardize the dataset first. \n", 339 | "\n", 340 | "Lets compute a few things we will be needing.\n", 341 | " - `standardized-wine`: The standardized copy of wine dataset\n", 342 | " - `pca-result`: The result of the PCA analysis of the standardized wine dataset" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 6, 348 | "metadata": { 349 | "collapsed": false 350 | }, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "PCA-RESULT" 356 | ] 357 | }, 358 | "execution_count": 6, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "(progn \n", 365 | " (defparameter standardized-wine (copy-dataset wine))\n", 366 | " (setf (dataset-numeric-points standardized-wine) (standardize (dataset-numeric-points standardized-wine)))\n", 367 | " \n", 368 | " (defparameter pca-result (princomp standardized-wine))\n", 369 | ")" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "One of the methods of evaluating the choice of principal components is the 'elbow' method. The elbow method involves choosing the number of components that lie in the elbow of the graph. In the graph below three or four would be a reasonable choice. \n", 377 | "\n", 378 | "Specifically `contributions` returns the variances of the components." 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 7, 384 | "metadata": { 385 | "collapsed": false 386 | }, 387 | "outputs": [ 388 | { 389 | "data": { 390 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAMAAAACDyzWAAABMlBMVEX///8AAACgoKD/AAAAwAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAAAAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSAQBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg/yC+vr6fn58fHx+/v79fX1/f398/Pz+/PMttAAAACXBIWXMAAA7EAAAOxAGVKw4bAAAPu0lEQVR4nO3dCXKjOhRAUSgvg81JDPvfQsw8xgbzxNNwT1Xnd9qxlPa/jQFjJcsAAAAAAAAAAAAAAAAAAAAAAAAAAF7Jh1/HtwBH8k5TtL83TZ7bqh5usYuvWt+y0qyH+3I7sNYXUxfvAqumzOqysGX3R8YW0xetb1mzq8/2AdrdnwCzfPqvGVIp+vBsOaWzvMXY3Jr+5rypuy1o1X/WPwWXTX97Pgze3Z73o6zvmdXvzWpzuFVFOsYAbdaY5Z+bZv6DxS3mvRWs2y1hm07RjAPk7dNzF2B7e/v1Y4DTr/092y8zPEMnbn4KtquN0TuVaRO4uKVpn4XLpt3Uve81HXa0n/WpmfH2cfDx1z/3ROKGgxCz6cGMm6juaxZf3j135pvAFp/Vw+92Ae7vWVWGJ+DkzXGttoDd3tq4CVzcko8f/wvw8PbjW7LCsg+YvDnA5T6g6TeMudne0rVY289bwN3t+eE9289L9gETNwdYDhu6opqSGw4RFrfMe3LjnTcBtncsq77E8ngfcDUtO4KJWwTQnefrzvbNhx/l5hbT3WoWGdnx+KNPrRmOdZuiPc0y3D4eBa/vyVEw1lug6fWO6Tl37GN+JWQ8mzdlVNrVE+54HvBdX38msb19dR5wvifnAQEAAAAAAAAAAAAAAAAAAAAgBVX3JjHWOYES3rsAVQQIVZYAocmObzcENDRF3b+tGlBTbo6Cc2DyRIH5x08FR2aolIc60h2DbJdy8vPbZ6jQhzpyuA/o57fPUKEPdaSu7P4o2M9vn6FCH8rrOeEpAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqAoQqfwN8Of4u4AVvA3y9KDAF3gbIFjANBAhV/gZIgUkgQKgiQKgiQKjyOEAKTAEBQpWTAM046vES0QSIiYsAazuOerxCLwFi4iLAppQJkAIT4CDAwkyjHi8RTYCYyAfYrgY4BXi4RDQBYiIeYN1u9aan4MMlogkQE/EAq2I76naJ6Oz00sAEGDcna0QPS08Xyz/afsnpwSgwem5ORI+jHi4RTYCYOQwwv70PSIDxcxvg4RLRBIiZz68FZxQYPwKEKgKEKgKEKs8DpMDYESBUESBUESBUESBU+R4gBUaOAKGKAKGKAKHK+wApMG4ECFUECFUECFX+B0iBUSNAqCJAqCJAqCJAqAogQAqMGQFCFQFC1QMBmu0cBIiJ+wDnFaN/npMC4+U+wHnF6J/nJMB4OQ9wsWL0z3MSYLxcB7hcMfrnOQkwXo4DXK0Y/fucFBgtxwHuV4zOflkamACj5GSN6IMp1itGswXEwhMnonkKxr8IEKrCCJACoxXCa8EZAcaLAKGKAKEqkAApMFYECFUECFUECFWhBEiBkSJAqCJAqCJAqCJAqAomQAqMEwFCFQFCFQFCVTgBUmCUCBCqCBCqCBCqAgqQAmNEgFBFgFBFgFAVUoAUGCEChCr5AAubWzP8vurWxrJScxJgfMQDrIo6qxvTf9LUonMSYHzEAyzbD/Ww0SNAfOFmH3AM0MoGSIHxcRJgXZn+N7Zs5h1CgTkJMDouAszzavhd894hLHcFEiAmbreAnXJ3FPzz2tQEGBV3i5TXq+YEVkgdUWBsxAM07YcxwO4YpP1ZNVJzEmBsHJ4HzOX3AQkwOvJPwZXNbTkMXVdW9CiYAKMT1GvBGQVGhwChigChigChigChKrQAKTAyBAhVBAhVBAhVwQVIgXEhQKgiQKgiQKgKL0AKjAoBQhUBQhUBQhUBQlWAAVJgTAgQqggQqggQqkIMkAIjQoBQRYBQ5TrA4mBpBALExHGAqxWj5eakwGg4DnC1YrTcnAQYjSf2AQkQ/3ogwPV6qSJzEmA03Ac4rxgtNycBRkNlC3h/aWAKjIG7NaLX5PcBCTAajgM07QcCxL/CPA9IgNFw/RQ8rRgtPCcFRiLI14IzAowGAULV+RjK97Ppfz+A1dWc/yPASJyOwVhTv7+42J5VdjnnJxQYh9MxtIcS7Y+eEaiHADE5HUO++PXUnJ8QYByubgF3P3vV5ZyfEGAcLu4D7n/yoMs5PyHAOFw4Cm7y3Dbbk8pu5/yEAqMQ6nlAAowEAULV6Rjq/uee+3IimgAjcTqG4ZIW0zw452cUGINL5wEv3UFgzs8IMAbnzwP2z73bi0udzvkZAcbg/HnApqyz2p/zgAQYhyvnAa1X5wEpMArhnoYhwCgQIFQRIFRdOBGd50JvJCZATM6fiC4EXgO5OOc3FBi+yyein5zzGwIM39UT0Y/O+Q0Bhu90DKXIKcBrc35DgOE7vwXMTx6EFIu1EKruHttX7+SezSkweOKnYVarwRxfvEWAmFyN4evbMk33VcNGjwDxxfkYqu5J2BZnvnYM8PjIhQAxOR1DVdXvLzbnroguhkxt2ex/TIjk0z4Fhu7SaZj3F5szS3NMl023Z6/3V3ARICaXTkTbcxekbtaP2b2XXW5p4NeLAgN2KYR2CzhsBT+rq+1u4vYukltAAgzc+X1A0+3afX9T0vKEdbfDWG7vInnqhwADd+1tmVX+9SCkMouhne8DUmDoxE9ED6+X9E/WdXXwwzIJELOQL0jtUWDQCBCqzsWQT0+tPl0RPaDAkIW/BSTAoAV9RfSAAgMW9BXRAwIM2Pkroiv/rogeUWC45K+IFpzzLAIMVwQHIRkFBkz8imgHc35HgMFyc0W01JynUWCo3FwRLTTneQQYKidXREvNeQEFBsrFFdFic15AgIGSvyJacM4rKDBM8ldEC855BQGGSfyKaMk5L6HAIJ2MQW5lovNzXkSAQToZQ2HH5V6em/MqCgzRhZ8XLPEjQq7NeQ0BhuhCDGIJunr9mQIDdCmGspF4JY4AMbsYw3sr+Pic51FgeC5vAb09DZMRYIiu7QNKPAG7vAaRAoPj+ih4uWL05TkvI8DgnIzB/HgecLVi9MU5f0GBoXH8SohpP2yvoCFATJ54T8iDAVJgaJ4IsNgcuxAgJg8EuLuCy+mcFBgW9wEWu4v45daIPkCA4XAawmC/YrTr6CkwKK4DPDp8JkBMHAdYmefnpMCQOI5hXjH6uTkJMCRxrA2zRoEBIUCoijFACgwIAUJVlAFSYDgIEKriDJACg0GAUBVpgBQYCgKEqlgDpMBAECBURRsgBYaBAKEq3gApMAgECFURB0iBISBAqIo5QAoMAAFCVdQBUqD/CBCq4g6QAr1HgFAVeYAU6Dv5GOpqHrPq1kXY/mQHAsREPAZjzTzm8c/WfHSrS4F+E4+hqjMCxGkuYpjHtPoBUqDfHAdYNvn+x4sQICZuA2yKOit3BT585E2BPnMbYKfcHQU/sDTwAgH6ylUI2zG/fe4aBXrMbYDdMUj56I9pOECAHnMYYO7JPiAF+kw8hmFV6H7ourLqR8Hv/l4U6K3YXwvuEKC/kgiQJ2F/JRIgBfoqlQAp0FPJBEiBfkonQAr0UkIBUqCPUgqQAj2UVIAU6J+0AqRA7yQWIAX6JrUAKdAzyQVIgX5JL0AK9EqCAVKgT1IMkAI9kmSAFOiPNAOkQG8kGiAF+oIAoSrVACnQE8kGSIF+SDdACvRCwgFSoA9SDpACPZB0gBSoz3UMyyXLn5rzAgrU5jiG1ZLlD815CQUqcxzDasnyh+a8hgJ1uY/B8wApUBcBUqAqAqRAVSoBPrtI+XcUqOOZEPzfAlKgIgLsUKAWAuxRoBLHMcxLlj83529YR1pH2q8FL7xeLKavgQBHbX2vgfb3khACPEKHjyHAD9ggukeAJyxCJEZhBHgBm0N5BHjNi/M1sgjwOhIURIC/4JlYDAH+iARlEODPSFACAd7AM/F9BHgPCd5EgHeR4C0EeB8J3kCAEtgZ/BkBCiHB3xCgGBL8BQEK4pn4OgKURYIXEaC0d4JEeB4ByuOywQsI0IH5/U3a34n/CNApMvyGAB/Ae5v+Jx+Dsbk1w++rbl0E63zOMJDhAfEYSlt2vzpN/cicQeENdiviMVTm/cFU/ScE+B82hgPxGGzbXG0Xn7ifM0y8wa4lHkO+HNaWzbxD6G7OcLEZdBtgU9TvHULjes6gJZ6g2wA75e4o2Lc1opUluxl0EsJqH3CYZzuv9JzhSzVBV0fBxXAU3B2DlI3rOWOQ6mZQPIa6Ow9Y90OzD3hFkgnKx1DavD8Pnbc/K9NyFHxBggnyWrBfknsmJkDvpJUgAXoopc0gAfopmQQJ0FeJbAYJ0GMpvL+JAL0W/zWsBOi3Lr7XRPvbkUeAQYmvRAIM03qjGHCPBBi6V9hP0AQYgyG9V4AxEmCk9jH62SQBJuC1of39LBFgMubutkEum3y6TgJES20TSYBYWZ/5dj8fAeID9xtFAsQprk58EyCuEd4kEiCum5eAvR0iAeKemx0SIET8ukEkQIhahHgqRgKEE2c3h65jMAdLIxBgGnzYAq5WjH5oToTEcQyVyeYVox3MyVAM9dHBaoG+fvsMFfpQ/w+fH/2h4PgMxVAfhydAhnpgqP+H3wYITNwGeLgPCDylMtm8YjTwtHnFaEBDOa4YDQAAAAAAAADAM44u0v9RIXqW2wi9Ll43uRV68bFs8ub+X7Cuhr/Z/Yd+Gur+Qz8Nlck99GccXqT/m6qo3/+3jcRQWfuqocyj0H5HtczL36Z6/wUrc3cUO/z/vf/QT0Pdf+inoTK5h/6UymT7i/R/044kd6lNU8o8CkXRfjQSQ8lcS/SueKjGZDcf+mmodqR739k0VCb30J8ifYGW1FCFEbo2UuA5cyT2WOVywy0eJKmhxB76C7PKTdhvcG4rG6lvyrY7bjLX/5hG4ik4Wz/oN/+Wi7vffeiHoeQe+guzik1oGpFh6nbrIPNNtYcNpVCBVZ7nEnsrTgK8/dD3Qwk+9BdmlZpQ6lrXqv3HLBRg+xRcivy76LaAEkdZLgK8/9CPxzPZowFK7gPWlczz7/Q2FYnh+sNMkUfU431AiYd++Ach99CfUplMbMMluL/fkvln2P2fKUX+hQkHWJns/kM/PEgSD31++FvXBC/Sr4zEKDOh84DWvJ+BjcRQwgchIg/9ouW7dAIUvEh/2HiLveNE6FGQefmiH0risZrf/nh7uGmo+w/9+k2ZLBcEAAAAAAAAAAAAAAAAAABwHtde4kGmyXNbLa8cJkA8p2rKrC6L5YXwBIjHmOEdbe0aAuPSVPnwq/tgu7dpvD+072S3w38BIYs3x5n3VrB9W9o6wDa4wrYfmuETmZUegNbiPZDNuGDCOsD3n9b9h/ETne8UUVq+87V7k3m+DTBbfMg3dwFuWmwBp4VZCBCPWewDTstsECAeMy6EUVSbfcD3H5cECOe6M4DdeUDTLcxi+sKaol3KPCNAuDa/ErI8D9gupN8toEWAAAAAAAAAAABo+gO+7fcsDNfsvAAAAABJRU5ErkJggg==", 391 | "text/plain": [ 392 | "#" 393 | ] 394 | }, 395 | "execution_count": 7, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "(let ((png \"wine-pca-contributions.png\"))\n", 402 | " (clml.extras.eazy-gnuplot::plot-series (contributions pca-result) \n", 403 | " :term '(:png) :output png :plot-title \"PCA Contributions\" :ylabel \"Variance\" :xlabel \"Column\" :series-title \"\")\n", 404 | " (display-png (png-from-file png)))" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "One of the primary uses of PCA is to reduce dimensionality of data. In this case it allows us to make the decision that first four columns are what drive the majority of the variation in the data. So knowing this we will be truncating the data set to four columns. While this is not that critical in the current dataset which has only thirteen dimensions, it can be critical when dealing with datasets with hundreds or thousands of columns. \n", 412 | "\n", 413 | "So lets partition off the dataset so it only contains the principal columns, as we will need that later." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 8, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "TRUNCATED-STANDARDIZED-WINE" 427 | ] 428 | }, 429 | "execution_count": 8, 430 | "metadata": {}, 431 | "output_type": "execute_result" 432 | } 433 | ], 434 | "source": [ 435 | "(defparameter truncated-standardized-wine (make-numeric-dataset \n", 436 | " (map 'list #'dimension-name (subseq (dataset-dimensions wine) 0 4)) \n", 437 | " (map 'vector (lambda (r) (subseq r 0 4)) (dataset-points standardized-wine))))" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "## On to the wine tasting, a blind tasting\n", 445 | "So for wine to be judged impartially it should be served without context and evaluated solely on its own merits. This is really what we are doing we already stripped the classifications from the dataset, and we aren't even tasting the wines ourselves. Out goal is to try match the wine varietal type with the preexisting classifications.\n", 446 | "\n", 447 | "#### Deciding on the clusters\n", 448 | "We need to have some idea of how the data will be grouped. In some cases this will be known, in others it will not. In this case we know theere are three different classifications.\n", 449 | "\n", 450 | "But what if we didn't know? \n", 451 | "\n", 452 | "One method is to look at the within group of sums of squares among the contributions of the principal components of the dataset. We calculate the first value storing it in `wss1` the remaining values can be calculated by taking the sum of squares of the distances returned from running k-means (as the `distance-between-point-and-owner` slot in the `pca-result` instance returned from k-means)on the selected principal components from the previous PCA analysis. " 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 12, 458 | "metadata": { 459 | "collapsed": false 460 | }, 461 | "outputs": [ 462 | { 463 | "data": { 464 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAMAAAACDyzWAAABMlBMVEX///8AAACgoKD/AAAAwAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAAAAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSAQBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg/yC+vr6fn5/f398fHx8/Pz+/v79fX18J+Mu0AAAACXBIWXMAAA7EAAAOxAGVKw4bAAASn0lEQVR4nO2dC5ajKhQA5fQy3Bwg7n8LI2iM5jOBBORXdc7L8DrKzXRqQFC4wwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGWitMj9EaBw5KSFno338WY5XkzS72Cx1xt0GvTDPC2OKCl8DZTr8Vp6Hb23f2GnQTfIaf1znIdBG9thyqVBlMPmjn0RZhJ3cbRyf6j5foA9T5vlZVJ7ta4SIxYeT1vaQjEZcQigZtsGD48fQE3iUCO0ySTvZTEr+/Wb5crNHAVcGi+zG6j3pvJwgBpGbV82mx8qOZ9m31scPJw/SbVIOG8fwNj37eH2o93+fUCr6EMT47ph21Uughz9kttPHEbPcj3n0EQuzdX6sh10ruR82vreUUCH0rcPYMPJ+fAWNIx4LAuriTr6sf5E345Scukxp1MTOQwPOp0reXGaehJwEKcPsISbb8pCw6wt4HqtdjBCPPp1bo/caOJ/Ap4qeXHaKcDSf+vDB1hxP+YasHkmuRVu37++NUDvWsCV5f//I+C5kofTnprYeVTq8AGOhxuuARvH3C4Cb9///fJNrZdq65XZbsK0jiY2k8xrAZ+vAe+nuZIUw8P55vQBdrgQbJ1R2zGonG62uEGqHfNOo50IGW6j4JsW0h1vj7gfMAwPAu6V7P7cT5OL8m66ZT9/+eHyAZYfrybaEbQdT0+SUXAPyEmIZYg63Gw5TMNpY/vD8zzg/ZbG/YBheBDwPJl4Ps0GdPOA+/m2/lmN4v4B1vkY5gHBkqQXpGsFXxAQsoKAAAAAAAAAAAAAAAAAAAAAAAAA0ABq3p5dc1tK2MLtefN7ASAVUsttqcJsN5CQh70j9gJAMuZtsdZ9ratbwGP3jtgLACl5EPC5AJCSrQueti74Yd8JVjdAYjbFZiHEPCAgXM2hBZzkJwEFdE8aAT2vAeNFj1YTH+nSiuL3ia9GweN8KKSJ3vKvlo8UXuE+CFFu+k8dCmmit/yr5SP517Z3624r7uFlIUn0ln+1fKRkMCjuHgSErCAgZAUBISsICFlBQMgKAkJWEBCygoCQFQSErCAgZAUBISsICFlBQMgKAkJWEBCygoCQFQSErCAgZAUBISsICFlBQMgKAkJWsirwJ/5yhocCoAWErGQWkBawdxAQspK7C8bAzkFAyEpuATGwcxAQspJdQAzsGwSErCAgZCW/gBjYNQgIWSlAQAzsGQSErJQgIAZ2DAJCVooQEAP7JbKAat4qVJPQLjGc1ELLU+FFdATslrgCSi3XCm2qVmVTExqXIs4cCq+iI2C3xBVwVluF42hf5ZYkU86HwsvoGNgribJlTremzjNfMAL2Sqp8wWYSk9r/VxwKr6NjYKckElAsTaCxBiIg/JdUAtou2EwfBdyyCw8Y2COHrz9ipe51He4K72tABOyVRALOdhRs9Db4HedD4V10DOySRAIqLZceWNqCnf5Th8K76AjYJXEFFOLWrdtRsOuGjRb6XHgTHQG7pIx7wQ4M7BEEhKwUJCAG9ggCQlZKEhADOwQBIStFCYiB/YGAkBUEhKyUJSAGdgcCQlYKExADewMBISulCYiBnYGAkJXiBMTAvkBAyEp5AmJgVyAgZAUBISsFCoiBPYGAkJUSBcTAjkBAyEqRAmJgPyAgZKVMATGwGxAQsoKAkJVCBcTAXkBAyEqpAmJgJyAgZKVYATGwDxAQslKugBjYBQgIWUFAyIq3gGpyWagn9fnQWNExsAO8BZykS8I6TtdFR8AO8BZwOVAJ9ekENd/fX5OnSy20PBX8o2Ng+4QIKPWnE6SW+/tKu4xJLkOXORQCoiNg+4R0wTYDofxvFzyre4WTsUWXo1DOh0JIdAxsnoBBiLD2TI/p3t5VOEpX9E7X+hIEbJ5E2TL3ZNXi9kNxfNc7Oga2TrJ0rQoBwQN/Ac1sRxUf5wHXCl3Cag8BP+VsR8CW+fj1H1kGuEo85zx/rvRetRDjj9eAGNg83gLaGRQ3F+hf4T4KHudDISw6AjZOyDzgoU/1qtAWlZv+U4dCYHQMbJvQFtDo/x209bzHuo0W6/TzXgiLjoBtE3gNaJ7upiWPjoFNEzAKnoTQH+eh40dHwKYp+HnAGxjYMggIWQkaBeeJjoAt4z8KjvoodFB0DGwYbwHNHHf8ERAdARvGvwUU50m+K6NjYLtUMAhBwJYJFfDjveAk0TGwWfx9ml0nrMcc0RGwWbwFnN16D3nluuADGNgqQdMw4nlZ0UXREbBVgiai9fMjpRdFR8BWCWoBbws9ro+Oga3ifw0oh2EcP6wLThYdAVslaHOiYb50c6ITGNgmVUxEWxCwTaoREAPbBAEhK/7TMBt5pmEsGNgigS2gibsoBAG7J/hhBFpAiEnwNWCmiWgLBjZIoE/qyj2iH0HABgkdhGSbiLZgYHvUMw0zIGCLVCUgBrZH8Dxg1HVJCNg9/ssyJ6MGJZ82uLom+g0MbA1vBbYZaJNxFDwgYHsEb82RcR5wQMD28G8B1/mXzC0gBraG/zWgtteAGTaoPIOAjeGvgJzsPLTMFH0HA9uirnnAAQFbozoBMbAtghYlXZwx/TUI2BT+o2B5ecb0N2BgS4TMA3pkTE8V/QgCtkSIgJ8zpieLfgQBWyKkC/6cMX1Q81bheMuLJJeCPBW+iX4CAxsiYBDikTFdarlWOI9qOUO6+Wv3373wVfQTCNgQcadh3B6CFmlf7AImlyRTzodChOgY2A7R5wEPFZ7SBH+ZL/glCNgOKQUcxz1/9aEQIzoGNkNCAd14BQHhv6QTcM2O/kHAbx/wx8AWiJ53xlW6/qHmdTv9JNeACNgOfgoIf1e2427TNW7wO86HQnj0V2BgI/gpoE2ggE43i3LTf+pQCI/+CgRsBD8FlOeazP2QraDsBPR2S2QvBEd/BQI2QvCipDzRn8HANvBfE5I1+jMI2Ab+CphJCD3JTNGfwcAm8FZAlrEq7g4CNoG3AtsA1uTbIfURDGyBynZGOIKALeDfAm57w5TTAmJgC9R7DYiATRC2M0JJo2AEbIL6FqYfwMD6QUDIStUCYmD9ICBkpW4BMbB6EBCyErIwPXKOhpDob8HAyvG/EzJG3ZgtMPpbELByqn0gdQMBKyf0aZhM0d+DgXXj/0R03FzpgdHfg4B14y+gLnIQgoGVU/sgBAErp/ZByICBdVP9IAQB66b6QciAgVXj3wJGz1YdEv2/IGDFVH4v2IGAFdOCgBhYMQgIWfGfhtkoaFnmHQyslkAFIo+FEbB7QhV43GT32uhvwcBaCVagwGmYAQHrJVABVUC61pdgYKWEDkKmqHfk4rWAfxhYJ01MwwzWwGhVwZW0IiCdcKUErIpzK+MK7YIHBKwUbwUm6bLPlDoIGTCwTkIeSFU270eZ0zAODKyQEAGlDjkhanQ/MLA+Qrpgm39Q/r8LVvNWodRi3Uz1ufBNdD8QsD5CtuawmQb/fy9YarlWaFxmOPOq8FV0TzCwOuIqMKtjskI5vyqkiz5gYH1EV2Ct8DlNcNx8wW/BwMpIJOD++lxIGh0Ba6M1ATGwMjILGHuZ3YCBFZHi6899DThgYF34KyAnLfTnJ/IPo+BxflX4LnoACFgT3gr4pupaK1Ru1k+9KnwVPQgMrAj/nRHWtu//a0LumyeYpbl0ZzwXvokeBgbWQ/DuWCXfC97BwGrwvxe8pWst93GsAwhYDQG7Y0l7DVjmuuAnMLAWgndGiDp1k25BAAZWQjtrQh7AwDpAQMhKswJiYB20sTvWSzCwBtrYHes1GFgBjeyO9RIErIBGdsd6DQaWTyu7Y70GA4unkd2x3oGBpdPuNIwDAUuncQExsHT8FTCTEHqSmaJ/DQaWTfQnotNE/wEMLBr/J6LX0YepaB5wBQGLps0nok9gYMmErgmprwXEwKJp/xpwwMCSCVkXXOUo2IKA5dL6POAKBhZLHwJiYLEEj4LzRP8ZDCyU0HnATNF/BgELxX9d8Bz1WejA6L+DgWXi3wJGXxQcEj0CGFgknQxCLBhYIggIWQnfmiPmvZBr9cfAAvEfhGi7OZHUSkVcmXlx+4uB5eG/PZt0f8hpGQ9fHz0SGFgc3zyOFe+BGATsni+26I24Nv3yIRAGlkbY41h2m8qnre6viB4NDCyMsDQNYpKLg/FuymWYBMLAsuhoHnAFAcuiOwExsCz6ExADi6JDATGwJFIpYCax3jGRert5txcuiP5/ELAgEikgZzUom5/QuBRx5lC4IPonMLAcEimwJ2d1STLlfChcEP0jGFgM3gqoKeSB1Jz5gn3AwFLwfxhhDJl/ltPWBYtblL3wVfTYIGAppFoVNy9t5TwUKyAGlkKiVXGuBZzkRwFjLzLxBwPzE/T1h2UIKf0acMDAQki0Ku48Ch7nQ+Gr6AlAwCJINQ94G4QoN/2nDoULovuBgSWQ7E6IFuus83Phguh+YGAB9HgveAcD84OAkBVPBcSdDNGTgYHZ6boFxMD8dC4gBuYGASErvQuIgZnpXkAMzAsCYmBWEBABs4KAGJgVBBwwMCcIaMHAbCCgBQGzgYAODMwFAq5gYCYQcAMD84CAGwiYBwS8gYFZQMAdDMwBAt7BwAwg4B0EzAACHsDA60HAIxh4OQh4AgOvBgFPIODVIOCZPxS8FgQ884eB14KAD/zh4KUg4Cv+kPAqEPAdOHgJCPgfcDA9CPh/cDAxCPgRHEwJAvrAoCQZCOgLDiYBAQPAwfggYBg4GBkEDAYHY4KA38CgJBqpFFCT0C4vl9RCy1PhguhXgINRSKSAzZSpbGY44zJ0mUPhguhXgYO/k0iBcbSvcstRKOdD4YLoF4KDP5JIgT23a7npWqPBBeEvJFJAm0lMaq+/vIzpkcHBb0mkgFiaQGMN7ETAAQe/JJWAtgs200cBYyefywsOhpHw61+Hu6KLa8ATOBhIIgVmOwo2ehv8jvOhcEH0vDAoCSHVPKCWSw8sbcFO/6lD4YLo+bEOYqEPqRSwo2DXDRst9LlwQfQSoCH0gnvByfhDQg8QMDVI+F8Q8AqQ8C0IeBVI+BIEvBIkfAIBrwYJTyBgDpBwBwFzgYQOBMwJEiJgdjqXEAFLoGMJEbAUOpUQAUuiQwkRsDQ6kxABS6QjCRGwVDqREAFLZpWwaQ8RsHT+/ppuCxGwfKx9zWqIgDXRoIYIWB9NaYiAtdKIhghYN9VriIAt8FevhwjYEDVqiIDNUZeGCNgotWiIgE1TvoYI2AG7hgWqiIDdUOZQGQF74n5XuRgPEbBT/goREQE7J7eHCAiWbB4iIBy4vmNGQHjBdR4iILznAg8RED6SsmNGQPDm7mE8FxEQQvmL2SIiIHzB6t5fBBUTKiBd3VILLU+Fa6LDlfx97WI6BZS2dRuXIs4cCtdEh1wEqphOgcnYul2STDkfCmmiR6uJjxStIq9mMZmAo3R1X5UvuPtv+9qaQit672K6bJlr3eIWZS+kic63fWlNP1T0oGIiAZVt8BCwsIoK+0jOQpHmPovLmO4hIHRPEv9uZo0frgEBErKPgsf5UAC4CCugctN/6lAAuAjXvRst1unnvQAAAAAAAAAAAAAAcA1qjnUneow1yT2+WDfwNTLGX292N9Wj3ENXk9Ax7oVGfIrATGLKdndC6ijf0GCfvVHLb1cWVJFlXZPwK1O0u5f2L6ai3Y1/fLj9y0qW37d7SiAHS/BIAkr7EuNJGxOrIsu6JuHnWqIJONpH5Nbf1e/Eua2f/RGpmA+DRft7RKpoW5PwK/Ge34ja100yRi1NCbj+A/+dSF3CbU3Cr+jlMinOZamtKVZ7av92EZBTzi54iCqgjPMrWa6wo1wl7WsSfmVaLktNFAPt5b6JZGCs1tQOsXI+IxpPwHiPusb5F7mvSYiCidFLCStNnKYrVq/pWsBJxqnsG2J9Q2qO1P+6yqJ827c1CXGI05nHqina5U4z14CxegRpX+L9QmL89VyXGaXdcv9Io7Sl0X7frQgY7TI26jxgadeAaqnFTDFqijY0zzsIiTehvtUU4bcyR103EKUFnKPdnIl32yHixS3rNAAAAAAAAAAAAAAAAAAAAAAAAAAA/CE1I+RATkLo+d0qzUirSQHeME9mUGbU5rWApO6BpMjNsHG8py5zCVSWdnFSw7ow+5bMWxu7odH2FkAEJrkXTwLan8tt+449mbewPfX+FsDvHFYwngQUhx9Oclj3OxPx1o4DOMS5uAs4z1LtP7ytz14Pvr0F8DvvWkC7Get0S6p8WyG92bq9BfA7k9yLZwGXNu+2hZs+HnF/C+B3bluJjvPe2953Tt0avX2/gof+GuB37AzgsM8DTqOdZRnuQ1337vKixn1Hy9tbADE43QmxO9W7HarMNtlntHbzgOJ+r8QwDwgAAAAAAAAAAAAAcA3/AGsaH7coBUUnAAAAAElFTkSuQmCC", 465 | "text/plain": [ 466 | "#" 467 | ] 468 | }, 469 | "execution_count": 12, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "(let ((wss1 (* (- (length (components pca-result)) 1) \n", 476 | " (loop for v across (subseq (contributions pca-result) 0 4) sum v)))\n", 477 | " (comp-ds (make-numeric-dataset '(\"pc1\" \"pc2\" \"pc3\" \"pc4\")\n", 478 | " (map 'vector (lambda (r) (subseq r 0 4)) (components pca-result))))\n", 479 | " (png \"group-sum-of-squares.png\"))\n", 480 | " (clml.extras.eazy-gnuplot::plot-series \n", 481 | " (coerce (cons wss1 \n", 482 | " (loop for n from 2 upto 8 ; could be up size of dimensions the far end is generally irrelevant\n", 483 | " for k-means-n = (k-means n comp-ds :standardization nil \n", 484 | " :random-state (make-random-state-with-seed 100))\n", 485 | " collect (loop for x across (clml.hjs.k-means::pw-distance-between-point-and-owner k-means-n) \n", 486 | " sum (* x x)))) \n", 487 | " 'vector)\n", 488 | " :term '(:png) :output png :plot-title \"Group Sum of Squares\" :ylabel \"In group sum of squares\" \n", 489 | " :xlabel \"Clusters\" :series-title \"\")\n", 490 | " \n", 491 | " (display-png (png-from-file png)))" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "Again we use the \"elbow\" method to make our selection (additionally in either selecting number of cluster or the principal components we can use the 85% precentile to make the choice point).\n", 499 | "\n", 500 | "But what is really cool is the elbow lies right on three cluster which matches the number of clusters that had been manually classified in the wine dataset. \n", 501 | "\n", 502 | "#### K-means\n", 503 | "So we kind of jumped into using `k-means` before talking about it...\n", 504 | "Simplified description of the K-means Algorithm:\n", 505 | "\n", 506 | "###### How it works\n", 507 | "1. Initial cluster are randomly chosen.\n", 508 | "2. The squared distance from each object to each cluster is computed, then the objects are assigned to the nearest cluster.\n", 509 | "3. New centroid are computed for each cluster – and each cluster is by the respective cluster centroid.\n", 510 | "4. The squared distance from each object to each cluster is computed, and the objects are assigned to the cluster nearest cluster with the smallest squared distance.\n", 511 | "5. Based upon the new membership assignment cluster centroids are recalculated.\n", 512 | "6. Steps 4 and 5 are repeated until stability is achieved.\n", 513 | "\n", 514 | "###### CLML bits\n", 515 | "The k in k-means is the number of clusters that k-means will be classifying.. K is required parameter in k-means but how should we decide what to set it to? Well we already know from the original dataset that there are three clusters.\n", 516 | "\n", 517 | "So for testing or teaching you will want to set a fixed `:random-state` which will allow you have repeatable results (It influences the choice of the seed clusters). \n", 518 | "But keep in mind if you set it all the time you might get caught on a local optima that you might have avoided if \n", 519 | "you had allowed the state to be random.\n", 520 | "\n", 521 | "The CLML implementation of k-means accepts a number of parameters for running k-means.\n", 522 | "\n", 523 | "* k-means arguments *\n", 524 | " - k: [integer], number of clusters\n", 525 | " - dataset: [numeric-dataset] | [category-dataset] | [numeric-or-category-dataset]\n", 526 | " - `:distance-fn`: #'euclid-distance | #'manhattan-distance | #'cosine-distance\n", 527 | " - `:standardization`: t | nil, whether to standardize the inputs\n", 528 | " - `:max-iteration`: maximum number of iterations of one trial (default 1000)\n", 529 | " - `:num-of-trials`: number of trials, every trial changes the initial position of the clusters. (Default 10)\n", 530 | " - `:random-state`: (for testing), specify the random-state of the random number generator\n", 531 | " - `:debug`: t | nil (for debugging) print out some debugging information\n", 532 | "\n", 533 | "* k-means results *\n", 534 | " - workspacet: points, clusters, distance infomation, etc.\n", 535 | " - table: lookup table for normalized vecs and original vecs, might be removed later.\n", 536 | "\n", 537 | "###### Running k-means\n", 538 | "Okay, so now that we know the number of clusters (well we've always known, but we also arrived at the same choice on our own), lets run k-means on our dataset." 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 17, 544 | "metadata": { 545 | "collapsed": false 546 | }, 547 | "outputs": [ 548 | { 549 | "data": { 550 | "text/plain": [ 551 | "#" 553 | ] 554 | }, 555 | "execution_count": 17, 556 | "metadata": {}, 557 | "output_type": "execute_result" 558 | } 559 | ], 560 | "source": [ 561 | " (progn (defparameter workspace nil) \n", 562 | " (defparameter table nil)\n", 563 | " (multiple-value-setq (workspace table)\n", 564 | " (k-means 3 truncated-standardized-wine :standardization nil \n", 565 | " :random-state (make-random-state-with-seed 1234))))" 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": {}, 571 | "source": [ 572 | "And we see out cluster assignment in the pretty print of the object. Lets compare roughly with the counts of the classification that came with the dataset. \n" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 11, 578 | "metadata": { 579 | "collapsed": false 580 | }, 581 | "outputs": [ 582 | { 583 | "data": { 584 | "text/plain": [ 585 | "((1 59) (2 71) (3 48))" 586 | ] 587 | }, 588 | "execution_count": 11, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "(loop for c across (!! wine-with-classifications \"Class\") \n", 595 | " when (= c 1.0) count c into one when (= c 2.0) count c into two when (= c 3.0) count c into three \n", 596 | " finally (return (list (list 1 one) (list 2 two) (list 3 three))))" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": {}, 602 | "source": [ 603 | "In comparison with the original classifications, we see our results are not too bad: (Mind you that the cluster numbers assigned are not important the grouping is however)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": 20, 609 | "metadata": { 610 | "collapsed": false 611 | }, 612 | "outputs": [ 613 | { 614 | "data": { 615 | "text/plain": [ 616 | "((1 59) (2 71) (3 48))" 617 | ] 618 | }, 619 | "execution_count": 20, 620 | "metadata": {}, 621 | "output_type": "execute_result" 622 | } 623 | ], 624 | "source": [ 625 | "wine-classifications" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "#### Hierarchical Clustering\n", 633 | "\n", 634 | "###### How it works\n", 635 | "The CLML implementation of Hierarchical Clustering uses agglomerative clustering (bottom up approach)\n", 636 | "A nice concise description of the process of agglomerative clustering from [improvedoutcomes.com](http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Agglomerative_Hierarchical_Clustering_Overview.htm) is as follows:\n", 637 | " - Assign each object to a separate cluster.\n", 638 | " - Evaluate all pair-wise distances between clusters (distance metrics are described in Distance Metrics Overview).\n", 639 | " - Construct a distance matrix using the distance values.\n", 640 | " - Look for the pair of clusters with the shortest distance.\n", 641 | " - Remove the pair from the matrix and merge them.\n", 642 | " - Evaluate all distances from this new cluster to all other clusters, and update the matrix.\n", 643 | " - Repeat until the distance matrix is reduced to a single element.\n", 644 | "\n", 645 | "###### CLML Bits\n", 646 | "The `distance-matrix`takes an optional argument `:distance-fn` which specifies the distance function used in the distance computations. The distance functions available are:\n", 647 | " - ecludian (Default)\n", 648 | " - manhattan\n", 649 | " - pearson\n", 650 | " - cosine\n", 651 | " - tanimoto\n", 652 | " - caberra\n", 653 | "\n", 654 | "The `cophonetic-matric` function evaluates the distance matrix and returns the merge and cophonetic matrix. `cophonetic-matrix` take a optional `:method` parameter to specify the method used to evaluate the distance matrix. The evaluation methods are as follows.\n", 655 | " - `hc-single`\n", 656 | " - `hc-complete`\n", 657 | " - `hc-average` (default)\n", 658 | " - `hc-centroid`\n", 659 | " - `hc-median`\n", 660 | " - `hc-ward`\n", 661 | "\n", 662 | "`cuttree` cuts the tree defined in the merge matrix into the specified number of pieces.\n", 663 | "\n", 664 | "###### Running Hierarchical Clustering" 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "execution_count": 16, 670 | "metadata": { 671 | "collapsed": false 672 | }, 673 | "outputs": [ 674 | { 675 | "name": "stdout", 676 | "output_type": "stream", 677 | "text": [ 678 | "Cut tree: #(1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", 679 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 2 2 3 3 3 1 3 3\n", 680 | " 2 1 3 1 3 1 3 3 3 3 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 2 3 3 3 3 3\n", 681 | " 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2\n", 682 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", 683 | " 2 2 2 2 2 2 2 2) \n", 684 | "Class counts:((1 65) (2 54) (3 59)) \n" 685 | ] 686 | }, 687 | { 688 | "data": { 689 | "text/plain": [ 690 | "NIL" 691 | ] 692 | }, 693 | "execution_count": 16, 694 | "metadata": {}, 695 | "output_type": "execute_result" 696 | } 697 | ], 698 | "source": [ 699 | "(progn\n", 700 | " (defparameter distance-matrix (distance-matrix (numeric-matrix standardized-wine)))\n", 701 | " (defparameter u nil) (defparameter v nil)\n", 702 | " (multiple-value-setq (u v) (cophenetic-matrix distance-matrix #'hc-ward))\n", 703 | " (defparameter ctree (cutree 3 v))\n", 704 | " (format t \"Cut tree: ~A ~%Class counts:~A ~%\" ctree\n", 705 | " (loop for x across ctree \n", 706 | " when (= x 1) counting x into one \n", 707 | " when (= x 2) counting x into two \n", 708 | " when (= x 3) counting x into three \n", 709 | " finally (return (list (list 1 one) (list 2 two) (list 3 three)))))\n", 710 | ")" 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "In comparison with the original classifications, we see our results are not too bad: (Mind you that the cluster numbers assigned are not important the grouping is however)" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 21, 723 | "metadata": { 724 | "collapsed": false 725 | }, 726 | "outputs": [ 727 | { 728 | "data": { 729 | "text/plain": [ 730 | "((1 59) (2 71) (3 48))" 731 | ] 732 | }, 733 | "execution_count": 21, 734 | "metadata": {}, 735 | "output_type": "execute_result" 736 | } 737 | ], 738 | "source": [ 739 | "wine-classifications" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "##Conclusion\n", 747 | "In general k-means is computationally more efficient. The quality of classification results of k-means vs hierarchal clustering is up for debate. In the second part of this tutorial we will explore cluster validation, which will let us evaluate our results more thoroughly.\n", 748 | "\n", 749 | "The iPython notebook and source for this tutorial can be found in the [clml.tutorials https://github.com/mmaul/clml.tutorials.git](https://github.com/mmaul/clml.tutorials.git) github repository.\n", 750 | "\n", 751 | "###Stay tuned to [clml.tutorials](https://mmaul.github.io/clml.tutorials/) blog or [RSS feed](https://mmaul.github.io/clml.tutorials/feed.xml) for part 2 of this tutorial 'CLML Wine and cluster validation' and for more CLML tutorials..\n" 752 | ] 753 | } 754 | ], 755 | "metadata": { 756 | "kernelspec": { 757 | "display_name": "SBCL Lisp", 758 | "language": "lisp", 759 | "name": "lisp" 760 | }, 761 | "language_info": { 762 | "name": "common-lisp", 763 | "version": "1.2.7" 764 | } 765 | }, 766 | "nbformat": 4, 767 | "nbformat_minor": 0 768 | } 769 | -------------------------------------------------------------------------------- /CLML-Wine-pca-k-means-and-hierarchical-clustering.lisp: -------------------------------------------------------------------------------- 1 | 2 | (progn 3 | (ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net 4 | :clml.hjs ; Need clml.hjs.read-data to poke around the raw dataset 5 | :clml.pca 6 | :clml.clustering 7 | :iolib 8 | :clml.extras.eazy-gnuplot 9 | :eazy-gnuplot 10 | )) 11 | (defpackage #:wine 12 | (:use #:cl 13 | #:cl-jupyter-user 14 | #:clml.hjs.read-data 15 | #:clml.utility.data 16 | #:clml.hjs.vector 17 | #:clml.hjs.matrix 18 | #:clml.hjs.k-means 19 | #:clml.pca 20 | #:clml.clustering.hc 21 | #:eazy-gnuplot 22 | )) 23 | ) 24 | 25 | (in-package :wine) 26 | 27 | (defparameter uci-wine 28 | (read-data-from-file 29 | (fetch "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data") 30 | :type :csv)) 31 | 32 | (format nil "~A" (head-points uci-wine)) 33 | 34 | (let ((wine-unspecialized (read-data-from-file 35 | (fetch "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data") 36 | :type :csv 37 | :csv-type-spec '(integer 38 | double-float double-float double-float 39 | double-float double-float double-float 40 | double-float double-float double-float 41 | double-float double-float double-float 42 | double-float) 43 | :csv-header-p ( list "Class" 44 | "Alcohol" "Malic acid" "Ash" 45 | "Alcalinity of ash" "Magnesium" "Total phenols" 46 | "Flavanoids" "Nonflavanoid phenols" "Proanthocyanins" 47 | "Color intensity" "Hue" "OD280/OD315 of diluted wines" 48 | "Proline") 49 | ))) 50 | (defparameter wine 51 | (pick-and-specialize-data 52 | wine-unspecialized 53 | :range '(1 2 3 4 5 6 7 8 9 10 11 12 13) :data-types (make-list 13 :initial-element :numeric ))) 54 | (defparameter wine-with-classifications 55 | (pick-and-specialize-data 56 | wine-unspecialized 57 | :data-types (make-list 14 :initial-element :numeric ))) 58 | 59 | (defparameter wine-classifications (loop for r across (dataset-points wine-with-classifications) 60 | when (= (elt r 0) 1d0) count r into one 61 | when (= (elt r 0) 2d0) count r into two 62 | when (= (elt r 0) 3d0) count r into three 63 | finally (return (list (list 1 one) (list 2 two) (list 3 three))))) 64 | 65 | ) 66 | 67 | (progn 68 | (defparameter standardized-wine (copy-dataset wine)) 69 | (setf (dataset-numeric-points standardized-wine) (standardize (dataset-numeric-points standardized-wine))) 70 | 71 | (defparameter pca-result (princomp standardized-wine)) 72 | ) 73 | 74 | (let ((png "wine-pca-contributions.png")) 75 | (clml.extras.eazy-gnuplot::plot-series (contributions pca-result) 76 | :term '(:png) :output png :plot-title "PCA Contributions" :ylabel "Variance" :xlabel "Column" :series-title "") 77 | (display-png (png-from-file png))) 78 | 79 | (defparameter truncated-standardized-wine (make-numeric-dataset 80 | (map 'list #'dimension-name (subseq (dataset-dimensions wine) 0 4)) 81 | (map 'vector (lambda (r) (subseq r 0 4)) (dataset-points standardized-wine)))) 82 | 83 | (let ((wss1 (* (- (length (components pca-result)) 1) 84 | (loop for v across (subseq (contributions pca-result) 0 4) sum v))) 85 | (comp-ds (make-numeric-dataset '("pc1" "pc2" "pc3" "pc4") 86 | (map 'vector (lambda (r) (subseq r 0 4)) (components pca-result)))) 87 | (png "group-sum-of-squares.png")) 88 | (clml.extras.eazy-gnuplot::plot-series 89 | (coerce (cons wss1 90 | (loop for n from 2 upto 8 ; could be up size of dimensions the far end is generally irrelevant 91 | for k-means-n = (k-means n comp-ds :standardization nil 92 | :random-state (make-random-state-with-seed 100)) 93 | collect (loop for x across (clml.hjs.k-means::pw-distance-between-point-and-owner k-means-n) 94 | sum (* x x)))) 95 | 'vector) 96 | :term '(:png) :output png :plot-title "Group Sum of Squares" :ylabel "In group sum of squares" 97 | :xlabel "Clusters" :series-title "") 98 | 99 | (display-png (png-from-file png))) 100 | 101 | (progn (defparameter workspace nil) 102 | (defparameter table nil) 103 | (multiple-value-setq (workspace table) 104 | (k-means 3 truncated-standardized-wine :standardization nil 105 | :random-state (make-random-state-with-seed 1234)))) 106 | 107 | (loop for c across (!! wine-with-classifications "Class") 108 | when (= c 1.0) count c into one when (= c 2.0) count c into two when (= c 3.0) count c into three 109 | finally (return (list (list 1 one) (list 2 two) (list 3 three)))) 110 | 111 | wine-classifications 112 | 113 | (progn 114 | (defparameter distance-matrix (distance-matrix (numeric-matrix standardized-wine))) 115 | (defparameter u nil) (defparameter v nil) 116 | (multiple-value-setq (u v) (cophenetic-matrix distance-matrix #'hc-ward)) 117 | (defparameter ctree (cutree 3 v)) 118 | (format t "Cut tree: ~A ~%Class counts:~A ~%" ctree 119 | (loop for x across ctree 120 | when (= x 1) counting x into one 121 | when (= x 2) counting x into two 122 | when (= x 3) counting x into three 123 | finally (return (list (list 1 one) (list 2 two) (list 3 three))))) 124 | ) 125 | 126 | wine-classifications 127 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # clml.tutorials 2 | Tutorials for CLML - Common Lisp Machine Learning and Statistics Systems 3 | 4 | ## Tutorials Blog 5 | All tutorials are availible on the web at the [clml.tutorials blog]( https://mmaul.github.io/clml.tutorials/) 6 | 7 | ## Available Tutorials 8 | Tutorials are available in HTML and ipynb (iPython Notebooks) 9 | 10 | - CLML Time Series Tutorial Part 1 [(HTML)](https://mmaul.github.io/clml.tutorials//2015/08/08/CLML-Time-Series-Part-1.html) [(ipynb)](https://github.com/mmaul/clml.tutorials/blob/master/CLML-Time-Series-Part-1.ipynb) 11 | - CLML Dataset Tutorial [(HTML)](https://mmaul.github.io/clml.tutorials//2015/08/29/CLML-Datasets.html) [(ipynb)](https://github.com/mmaul/clml.tutorials/blob/master/CLML-Datasets.ipynb) 12 | 13 | ## Using the ipynb notebooks 14 | The fishbowl interface to iPython is necessary if the tutorial notebooks are to be executed locally. Fishbowl can be found at https://github.com/fredokun/fishbowl-repl 15 | --------------------------------------------------------------------------------