├── .gitignore ├── 00-intro.ipynb ├── 01-dask.delayed.ipynb ├── 02-dask-arrays.ipynb ├── 03-dask-dataframes.ipynb ├── 04-schedulers.ipynb ├── 05-distributed-dataframes-and-efficiency.ipynb ├── 06-distributed-advanced.ipynb ├── 07-machine-learning.ipynb ├── README.md ├── environment.yml ├── prep_data.py ├── pycon_utils.py ├── requirements.txt ├── solutions ├── 00-hello-world.py ├── 01-delayed-control-flow.py ├── 01-delayed-groupby.py ├── 01-delayed-loop.py ├── 02-dask-arrays-blocked-mean.py ├── 02-dask-arrays-make-arrays.py ├── 02-dask-arrays-stacked.py ├── 02-dask-arrays-store.py ├── 02-dask-arrays-weather-difference.py ├── 02-dask-arrays-weather-mean.py ├── 03-dask-dataframe-delay-per-airport.py ├── 03-dask-dataframe-delay-per-day.py ├── 03-dask-dataframe-map-partitions.py ├── 03-dask-dataframe-non-cancelled-per-airport.py ├── 03-dask-dataframe-non-cancelled.py ├── 03-dask-dataframe-rows.py ├── 05-distributed-dataframes-memory-usage.ipynb └── client_submit.py └── static ├── fail-case.gif ├── ml-dimensions-color.png ├── ml-dimensions.png ├── sklearn-parallel-dask.png └── sklearn-parallel.png /.gitignore: -------------------------------------------------------------------------------- 1 | data/nycflights/* 2 | data/flightjson/* 3 | .ipynb_checkpoints/* 4 | .idea/ 5 | __pycache__ 6 | .env/ 7 | data/ 8 | mydask.png 9 | profile.html -------------------------------------------------------------------------------- /00-intro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ \n" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# Introduction\n", 18 | "\n", 19 | "Welcome to the Dask Tutorial.\n", 20 | "\n", 21 | "Dask is a parallel computing library that scales the existing Python ecosystem. This tutorial will introduce Dask and parallel data analysis more generally.\n", 22 | "\n", 23 | "Dask can scale down to your laptop laptop and up to a cluster. Accordingly, the tutorial comes in two pieces. In the first part, we'll use the environment you setup on your laptop to analyze medium sized datasets in parallel on your laptop.\n", 24 | "\n", 25 | "For the second half, you'll log into a [Pangeo](https://pangeo-data.github.io/) [Jupyterhub](https://jupyterhub.readthedocs.io/en/stable/) deployment that will provide you with your own Dask cluster to solve even larger problems using a cluster of machines." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "## Tutorial Structure\n", 33 | "\n", 34 | "Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.\n", 35 | "\n", 36 | "If you hasn't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is\n", 37 | "\n", 38 | "1. There are two modes: command and edit\n", 39 | "2. From command mode, press `Enter` to edit a cell (like this markdown cell)\n", 40 | "3. From edit mode, press `Esc` to change to command mode\n", 41 | "4. Press `shift+enter` to execute a cell and move to the next cell.\n", 42 | "\n", 43 | "The toolbar has commands for executing, converting, and creating cells.\n", 44 | "\n", 45 | "Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a \"magic\" cell that will load the solution. For example\n", 46 | "\n", 47 | "## Exercise: Print `Hello, world!`\n", 48 | "\n", 49 | "Print the text \"Hello, world!\"." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "# Your code here\n" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "%load solutions/00-hello-world.py" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "You'll need to run the solution cell twice; once to load the solution, and a second time to execute it." 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## Contents\n", 82 | "\n", 83 | "Now, let's officially start.\n", 84 | "\n", 85 | "- [Dask Delayed](01-dask.delayed.ipynb)\n", 86 | "- [Dask Arrays](02-dask-arrays.ipynb)\n", 87 | "- [Dask DataFrames](03-dask-dataframes.ipynb)\n", 88 | "- [Schedulers](04-schedulers.ipynb)\n", 89 | "- [Distributed DataFrames](05-distributed-dataframes-and-efficiency.ipynb)\n", 90 | "- [Advanced Distribued Techniques](06-distributed-advanced.ipynb)\n", 91 | "- [Scalable Machine Learing](07-machine-learning.ipynb)" 92 | ] 93 | } 94 | ], 95 | "metadata": { 96 | "kernelspec": { 97 | "display_name": "Python 3", 98 | "language": "python", 99 | "name": "python3" 100 | }, 101 | "language_info": { 102 | "codemirror_mode": { 103 | "name": "ipython", 104 | "version": 3 105 | }, 106 | "file_extension": ".py", 107 | "mimetype": "text/x-python", 108 | "name": "python", 109 | "nbconvert_exporter": "python", 110 | "pygments_lexer": "ipython3", 111 | "version": "3.6.5" 112 | } 113 | }, 114 | "nbformat": 4, 115 | "nbformat_minor": 2 116 | } 117 | -------------------------------------------------------------------------------- /01-dask.delayed.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ \n", 11 | "\n", 12 | "# Parallelize code with `dask.delayed`\n", 13 | "\n", 14 | "In this section we parallelize simple for-loop style code with Dask and `dask.delayed`.\n", 15 | "\n", 16 | "This is a simple way to use `dask` to parallelize existing codebases or build [complex systems](http://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask). This will also help us to develop an understanding for later sections." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## Basics\n", 24 | "\n", 25 | "First let's make some toy functions, `inc` and `add`, that sleep for a while to simulate work. We'll then time running these functions normally.\n", 26 | "\n", 27 | "In the next section we'll parallelize this code." 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "from time import sleep\n", 37 | "\n", 38 | "def inc(x):\n", 39 | " sleep(1)\n", 40 | " return x + 1\n", 41 | "\n", 42 | "def add(x, y):\n", 43 | " sleep(1)\n", 44 | " return x + y" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "%%time\n", 54 | "# This takes three seconds to run because we call each\n", 55 | "# function sequentially, one after the other\n", 56 | "\n", 57 | "x = inc(1)\n", 58 | "y = inc(2)\n", 59 | "z = add(x, y)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "### Parallelize with the `dask.delayed` decorator\n", 67 | "\n", 68 | "Those two increment calls *could* be called in parallel.\n", 69 | "\n", 70 | "We'll wrap the `inc` and `add` functions in the `dask.delayed` decorator. When we call the delayed version by passing the arguments, the original function isn't actually called yet.\n", 71 | "Instead, a *task graph* is built up, representing the *delayed* function call." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "from dask import delayed" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "%%time\n", 90 | "# This runs immediately, all it does is build a graph\n", 91 | "\n", 92 | "x = delayed(inc)(1)\n", 93 | "y = delayed(inc)(2)\n", 94 | "z = delayed(add)(x, y)" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "This ran immediately, since nothing has really happened yet.\n", 102 | "\n", 103 | "To get the result, call `compute`." 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "%%time\n", 113 | "# This actually runs our computation using a local thread pool\n", 114 | "\n", 115 | "z.compute()" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## What just happened?\n", 123 | "\n", 124 | "The `z` object is a lazy `Delayed` object. This object holds everything we need to compute the final result. We can compute the result with `.compute()` as above or we can visualize the task graph for this value with `.visualize()`." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "z" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# Look at the task graph for `z`\n", 143 | "z.visualize()" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "### Some questions to consider:\n", 151 | "\n", 152 | "- Why did we go from 3s to 2s? Why weren't we able to parallelize down to 1s?\n", 153 | "- What would have happened if the inc and add functions didn't include the `sleep(1)`? Would Dask still be able to speed up this code?\n", 154 | "- What if we have multiple outputs or also want to get access to x or y?" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "## Exercise: Parallelize a for loop\n", 162 | "\n", 163 | "`for` loops are one of the most common things that we want to parallelize. Use `dask.delayed` on `inc` and `sum` to parallelize the computation below:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "data = [1, 2, 3, 4, 5, 6, 7, 8]" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "%%time\n", 182 | "# Sequential code\n", 183 | "\n", 184 | "results = []\n", 185 | "for x in data:\n", 186 | " y = inc(x)\n", 187 | " results.append(y)\n", 188 | " \n", 189 | "total = sum(results)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "total" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "%%time\n", 208 | "# Your parallel code here..." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "%load solutions/01-delayed-loop.py" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "## Exercise: Parallelizing a for-loop code with control flow\n", 225 | "\n", 226 | "Often we want to delay only *some* functions, running a few of them immediately. This is especially helpful when those functions are fast and help us to determine what other slower functions we should call. This decision, to delay or not to delay, is usually where we need to be thoughtful when using `dask.delayed`.\n", 227 | "\n", 228 | "In the example below we iterate through a list of inputs. If that input is even then we want to call `inc`. If the input is odd then we want to call `double`. This `is_even` decision to call `inc` or `double` has to be made immediately (not lazily) in order for our graph-building Python code to proceed." 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "def double(x):\n", 238 | " sleep(1)\n", 239 | " return 2 * x\n", 240 | "\n", 241 | "def is_even(x):\n", 242 | " return not x % 2\n", 243 | "\n", 244 | "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "%%time\n", 254 | "# Sequential code\n", 255 | "\n", 256 | "results = []\n", 257 | "for x in data:\n", 258 | " if is_even(x):\n", 259 | " y = double(x)\n", 260 | " else:\n", 261 | " y = inc(x)\n", 262 | " results.append(y)\n", 263 | " \n", 264 | "total = sum(results)\n", 265 | "print(total)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "%%time\n", 275 | "# Your parallel code here...\n", 276 | "# TODO: parallelize the sequential code above using dask.delayed\n", 277 | "# You will need to delay some functions, but not all" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [ 286 | "%load solutions/01-delayed-control-flow.py" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "%time total.compute()" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "total.visualize()" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "### Some questions to consider:\n", 312 | "\n", 313 | "- What are other examples of control flow where we can't use delayed?\n", 314 | "- What would have happened if we had delayed the evaluation of `is_even(x)` in the example above?\n", 315 | "- What are your thoughts on delaying `sum`? This function is both computational but also fast to run." 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "## Exercise: Parallelizing a Pandas Groupby Reduction\n", 323 | "\n", 324 | "In this exercise we read several CSV files and perform a groupby operation in parallel. We are given sequential code to do this and parallelize it with `dask.delayed`.\n", 325 | "\n", 326 | "The computation we will parallelize is to compute the mean departure delay per airport from some historical flight data. We will do this by using `dask.delayed` together with `pandas`. In a future section we will do this same exercise with `dask.dataframe`." 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "### Prep data\n", 334 | "\n", 335 | "First, run this code to prep some data.\n", 336 | "\n", 337 | "This downloads and extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is originally from [here](http://stat-computing.org/dataexpo/2009/the-data.html)." 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "%run prep_data.py" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "### Inspect data" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "import os\n", 363 | "sorted(os.listdir(os.path.join('data', 'nycflights')))" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "### Read one file with `pandas.read_csv` and compute mean departure delay" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "import pandas as pd\n", 380 | "df = pd.read_csv(os.path.join('data', 'nycflights', '1990.csv'))\n", 381 | "df.head()" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": null, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "# What is the schema?\n", 391 | "df.dtypes" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "# What originating airports are in the data?\n", 401 | "df.Origin.unique()" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "# Mean departure delay per-airport for one year\n", 411 | "df.groupby('Origin').DepDelay.mean()" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "### Sequential code: Mean Departure Delay Per Airport\n", 419 | "\n", 420 | "The above cell computes the mean departure delay per-airport for one year. Here we expand that to all years using a sequential for loop." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "from glob import glob\n", 430 | "filenames = sorted(glob(os.path.join('data', 'nycflights', '*.csv')))" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "%%time\n", 440 | "\n", 441 | "sums = []\n", 442 | "counts = []\n", 443 | "for fn in filenames:\n", 444 | " # Read in file\n", 445 | " df = pd.read_csv(fn)\n", 446 | " \n", 447 | " # Groupby origin airport\n", 448 | " by_origin = df.groupby('Origin')\n", 449 | " \n", 450 | " # Sum of all departure delays by origin\n", 451 | " total = by_origin.DepDelay.sum()\n", 452 | " \n", 453 | " # Number of flights by origin\n", 454 | " count = by_origin.DepDelay.count()\n", 455 | " \n", 456 | " # Save the intermediates\n", 457 | " sums.append(total)\n", 458 | " counts.append(count)\n", 459 | "\n", 460 | "# Combine intermediates to get total mean-delay-per-origin\n", 461 | "total_delays = sum(sums)\n", 462 | "n_flights = sum(counts)\n", 463 | "mean = total_delays / n_flights" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": null, 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "mean" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": {}, 478 | "source": [ 479 | "### Parallelize the code above\n", 480 | "\n", 481 | "Use `dask.delayed` to parallelize the code above. Some extra things you will need to know.\n", 482 | "\n", 483 | "1. Methods and attribute access on delayed objects work automatically, so if you have a delayed object you can perform normal arithmetic, slicing, and method calls on it and it will produce the correct delayed calls.\n", 484 | "\n", 485 | " ```python\n", 486 | " x = delayed(np.arange)(10)\n", 487 | " y = (x + 1)[::2].sum() # everything here was delayed\n", 488 | " ```\n", 489 | "2. Calling the `.compute()` method works well when you have a single output. When you have multiple outputs you might want to use the `dask.compute` function:\n", 490 | "\n", 491 | " ```python\n", 492 | " >>> x = delayed(np.arange)(10)\n", 493 | " >>> y = x ** 2\n", 494 | " >>> min_, max_ = compute(y.min(), y.max())\n", 495 | " >>> min_, max_\n", 496 | " (0, 81)\n", 497 | " ```\n", 498 | " \n", 499 | " This way Dask can share the intermediate values (like `y = x**2`)\n", 500 | " \n", 501 | "So your goal is to parallelize the code above (which has been copied below) using `dask.delayed`. You may also want to visualize a bit of the computation to see if you're doing it correctly." 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "from dask import compute" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "%%time\n", 520 | "\n", 521 | "sums = []\n", 522 | "counts = []\n", 523 | "for fn in filenames:\n", 524 | " # Read in file\n", 525 | " df = pd.read_csv(fn)\n", 526 | " \n", 527 | " # Groupby origin airport\n", 528 | " by_origin = df.groupby('Origin')\n", 529 | " \n", 530 | " # Sum of all departure delays by origin\n", 531 | " total = by_origin.DepDelay.sum()\n", 532 | " \n", 533 | " # Number of flights by origin\n", 534 | " count = by_origin.DepDelay.count()\n", 535 | " \n", 536 | " # Save the intermediates\n", 537 | " sums.append(total)\n", 538 | " counts.append(count)\n", 539 | "\n", 540 | "# Combine intermediates to get total mean-delay-per-origin\n", 541 | "total_delays = sum(sums)\n", 542 | "n_flights = sum(counts)\n", 543 | "mean = total_delays / n_flights" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": null, 549 | "metadata": {}, 550 | "outputs": [], 551 | "source": [ 552 | "mean" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [ 561 | "%load solutions/01-delayed-groupby.py" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "### Some questions to consider:\n", 569 | "\n", 570 | "- How much speedup did you get? Is this how much speedup you'd expect?\n", 571 | "- Experiment with where to call `compute`. What happens when you call it on `sums` and `counts`? What happens if you wait and call it on `mean`?\n", 572 | "- Experiment with delaying the call to `sum`. What does the graph look like if `sum` is delayed? What does the graph look like if it isn't?\n", 573 | "- Can you think of any reason why you'd want to do the reduction one way over the other?" 574 | ] 575 | } 576 | ], 577 | "metadata": { 578 | "kernelspec": { 579 | "display_name": "Python 3", 580 | "language": "python", 581 | "name": "python3" 582 | }, 583 | "language_info": { 584 | "codemirror_mode": { 585 | "name": "ipython", 586 | "version": 3 587 | }, 588 | "file_extension": ".py", 589 | "mimetype": "text/x-python", 590 | "name": "python", 591 | "nbconvert_exporter": "python", 592 | "pygments_lexer": "ipython3", 593 | "version": "3.6.5" 594 | } 595 | }, 596 | "nbformat": 4, 597 | "nbformat_minor": 2 598 | } 599 | -------------------------------------------------------------------------------- /02-dask-arrays.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ " 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# Table of Contents\n", 18 | "* [Arrays](#Arrays)\n", 19 | "\t* [Blocked Algorithms](#Blocked-Algorithms)\n", 20 | "\t\t* [Exercise: Compute the mean using a blocked algorithm](#Exercise:--Compute-the-mean-using-a-blocked-algorithm)\n", 21 | "\t\t* [Exercise: Compute the mean](#Exercise:--Compute-the-mean)\n", 22 | "\t\t* [Example](#Example)\n", 23 | "\t\t* [Exercise: Meteorological data](#Exercise:--Meteorological-data)\n", 24 | "\t\t* [Exercise: Subsample and store](#Exercise:--Subsample-and-store)\n", 25 | "\t* [Example: Lennard-Jones potential](#Example:-Lennard-Jones-potential)\n", 26 | "\t\t* [Dask version](#Dask-version)\n", 27 | "\t* [Profiling](#Profiling)\n" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "# Arrays" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "

\n", 42 | "\n", 43 | "Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.\n", 44 | "\n", 45 | "* **Parallel**: Uses all of the cores on your computer\n", 46 | "* **Larger-than-memory**: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.\n", 47 | "* **Blocked Algorithms**: Perform large computations by performing many smaller computations\n", 48 | "\n", 49 | "**Related Documentation**\n", 50 | "\n", 51 | "* http://dask.readthedocs.io/en/latest/array.html\n", 52 | "* http://dask.readthedocs.io/en/latest/array-api.html" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## Blocked Algorithms" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "A *blocked algorithm* executes on a large dataset by breaking it up into many small blocks.\n", 67 | "\n", 68 | "For example, consider taking the sum of a billion numbers. We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.\n", 69 | "\n", 70 | "We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)\n", 71 | "\n", 72 | "We do exactly this with Python and NumPy in the following example:" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "**Create random dataset**" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# create data if it doesn't already exist\n", 89 | "from prep_data import random_array\n", 90 | "random_array() \n", 91 | "\n", 92 | "# Load data with h5py\n", 93 | "# this gives the load prescription, but does no real work.\n", 94 | "import h5py\n", 95 | "import os\n", 96 | "f = h5py.File(os.path.join('data', 'random.hdf5'), mode='r')\n", 97 | "dset = f['/x']" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "**Compute sum using blocked algorithm**" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "Here we compute the sum of this large array on disk by \n", 112 | "\n", 113 | "1. Computing the sum of each 1,000,000 sized chunk of the array\n", 114 | "2. Computing the sum of the 1,000 intermediate sums\n", 115 | "\n", 116 | "Note that we are fetching every partial result from the cluster and summing them here, in the notebook kernel." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# Compute sum of large array, one million numbers at a time\n", 126 | "sums = []\n", 127 | "for i in range(0, 1000000000, 1000000):\n", 128 | " chunk = dset[i: i + 1000000] # pull out numpy array\n", 129 | " sums.append(chunk.sum())\n", 130 | "\n", 131 | "total = sum(sums)\n", 132 | "print(total)" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "### Exercise: Compute the mean using a blocked algorithm" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "Now that we've seen the simple example above try doing a slightly more complicated problem, compute the mean of the array. You can do this by changing the code above with the following alterations:\n", 147 | "\n", 148 | "1. Compute the sum of each block\n", 149 | "2. Compute the length of each block\n", 150 | "3. Compute the sum of the 1,000 intermediate sums and the sum of the 1,000 intermediate lengths and divide one by the other\n", 151 | "\n", 152 | "This approach is overkill for our case but does nicely generalize if we don't know the size of the array or individual blocks beforehand." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "# Compute the mean of the array" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "%load solutions/02-dask-arrays-blocked-mean.py" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "`dask.array` contains these algorithms\n", 178 | "--------------------------------------------\n", 179 | "\n", 180 | "Dask.array is a NumPy-like library that does these kinds of tricks to operate on large datasets that don't fit into memory. It extends beyond the linear problems discussed above to full N-Dimensional algorithms and a decent subset of the NumPy interface." 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "**Create `dask.array` object**" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "You can create a `dask.array` `Array` object with the `da.from_array` function. This function accepts\n", 195 | "\n", 196 | "1. `data`: Any object that supports NumPy slicing, like `dset`\n", 197 | "2. `chunks`: A chunk size to tell us how to block up our array, like `(1000000,)`" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "import dask.array as da\n", 207 | "x = da.from_array(dset, chunks=(1000000,))" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "** Manipulate `dask.array` object as you would a numpy array**" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "Now that we have an `Array` we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc..\n", 222 | "\n", 223 | "The interface is familiar, but the actual work is different. dask_array.sum() does not do the same thing as numpy_array.sum()." 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "**What's the difference?**" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "`dask_array.sum()` builds an expression of the computation. It does not do the computation yet. `numpy_array.sum()` computes the sum immediately." 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "*Why the difference?*" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "Dask arrays are split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "result = x.sum()\n", 261 | "result" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "**Compute result**" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "Dask.array objects are lazily evaluated. Operations like `.sum` build up a graph of blocked tasks to execute. \n", 276 | "\n", 277 | "We ask for the final result with a call to `.compute()`. This triggers the actual computation." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [ 286 | "result.compute()" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "### Exercise: Compute the mean" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "And the variance, std, etc.. This should be a trivial change to the example above.\n", 301 | "\n", 302 | "Look at what other operations you can do with the Jupyter notebook's tab-completion." 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "Does this match your result from before?" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "Performance and Parallelism\n", 324 | "-------------------------------\n", 325 | "\n", 326 | "

\n", 327 | "\n", 328 | "In our first examples we used `for` loops to walk through the array one block at a time. For simple operations like `sum` this is optimal. However for complex operations we may want to traverse through the array differently. In particular we may want the following:\n", 329 | "\n", 330 | "1. Use multiple cores in parallel\n", 331 | "2. Chain operations on a single blocks before moving on to the next one\n", 332 | "\n", 333 | "Dask.array translates your array operations into a graph of inter-related tasks with data dependencies between them. Dask then executes this graph in parallel with multiple threads. We'll discuss more about this in the next section.\n", 334 | "\n" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "### Example" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "1. Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks\n", 349 | "2. Take the mean along one axis\n", 350 | "3. Take every 100th element" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "import numpy as np\n", 360 | "import dask.array as da\n", 361 | "\n", 362 | "x = da.random.normal(10, 0.1, size=(20000, 20000), # 400 million element array \n", 363 | " chunks=(1000, 1000)) # Cut into 1000x1000 sized chunks\n", 364 | "y = x.mean(axis=0)[::100] # Perform NumPy-style operations" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "x.nbytes / 1e9 # Gigabytes of the input processed lazily" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "%%time\n", 383 | "y.compute() # Time to compute the result" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "Performance comparision\n", 391 | "---------------------------\n", 392 | "\n", 393 | "The following experiment was performed on a heavy personal laptop. Your performance may vary. If you attempt the NumPy version then please ensure that you have more than 4GB of main memory." 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "**NumPy: 19s, Needs gigabytes of memory**" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "```python\n", 408 | "import numpy as np\n", 409 | "\n", 410 | "%%time \n", 411 | "x = np.random.normal(10, 0.1, size=(20000, 20000)) \n", 412 | "y = x.mean(axis=0)[::100] \n", 413 | "y\n", 414 | "\n", 415 | "CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s\n", 416 | "Wall time: 19.7 s\n", 417 | "```" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "**Dask Array: 4s, Needs megabytes of memory**" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "```python\n", 432 | "import dask.array as da\n", 433 | "\n", 434 | "%%time\n", 435 | "x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))\n", 436 | "y = x.mean(axis=0)[::100] \n", 437 | "y.compute() \n", 438 | "\n", 439 | "CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s\n", 440 | "Wall time: 4.01 s\n", 441 | "```" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "**Discussion**" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "Notice that the dask array computation ran in 4 seconds, but used 29.4 seconds of user CPU time. The numpy computation ran in 19.7 seconds and used 19.6 seconds of user CPU time.\n", 456 | "\n", 457 | "Dask finished faster, but used more total CPU time because Dask was able to transparently parallelize the computation because of the chunk size." 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "*Questions*" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "* What happens if the dask chunks=(20000,20000)?\n", 472 | " * Will the computation run in 4 seconds?\n", 473 | " * How much memory will be used?\n", 474 | "* What happens if the dask chunks=(25,25)?\n", 475 | " * What happens to CPU and memory?" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "### Exercise: Meteorological data" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "There is 2GB of somewhat artifical weather data in HDF5 files in `data/weather-big/*.hdf5`. We'll use the `h5py` library to interact with this data and `dask.array` to compute on it.\n", 490 | "\n", 491 | "Our goal is to visualize the average temperature on the surface of the Earth for this month. This will require a mean over all of this data. We'll do this in the following steps\n", 492 | "\n", 493 | "1. Create `h5py.Dataset` objects for each of the days of data on disk (`dsets`)\n", 494 | "2. Wrap these with `da.from_array` calls \n", 495 | "3. Stack these datasets along time with a call to `da.stack`\n", 496 | "4. Compute the mean along the newly stacked time axis with the `.mean()` method\n", 497 | "5. Visualize the result with `matplotlib.pyplot.imshow`" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [ 506 | "from prep_data import weather # Prep data if it doesn't exist\n", 507 | "weather()" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "metadata": {}, 514 | "outputs": [], 515 | "source": [ 516 | "import h5py\n", 517 | "from glob import glob\n", 518 | "import os\n", 519 | "\n", 520 | "filenames = sorted(glob(os.path.join('data', 'weather-big', '*.hdf5')))\n", 521 | "dsets = [h5py.File(filename, mode='r')['/t2m'] for filename in filenames]\n", 522 | "dsets[0]" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": {}, 529 | "outputs": [], 530 | "source": [ 531 | "dsets[0][:5, :5] # Slicing into h5py.Dataset object gives a numpy array" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": {}, 538 | "outputs": [], 539 | "source": [ 540 | "%matplotlib inline\n", 541 | "import matplotlib.pyplot as plt\n", 542 | "\n", 543 | "fig = plt.figure(figsize=(16, 8))\n", 544 | "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r')" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "**Integrate with `dask.array`**" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "Make a list of `dask.array` objects out of your list of `h5py.Dataset` objects using the `da.from_array` function with a chunk size of `(500, 500)`." 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "%load solutions/02-dask-arrays-make-arrays.py" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": {}, 580 | "source": [ 581 | "**Stack this list of `dask.array` objects into a single `dask.array` object with `da.stack`**" 582 | ] 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "metadata": {}, 587 | "source": [ 588 | "Stack these along the first axis so that the shape of the resulting array is `(31, 5760, 11520)`." 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": null, 594 | "metadata": {}, 595 | "outputs": [], 596 | "source": [] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "%load solutions/02-dask-arrays-stacked.py" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "**Plot the mean of this array along the time (`0th`) axis**" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "%load solutions/02-dask-arrays-weather-mean.py" 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": {}, 633 | "source": [ 634 | "**Plot the difference of the first day from the mean**" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": null, 640 | "metadata": {}, 641 | "outputs": [], 642 | "source": [] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": {}, 648 | "outputs": [], 649 | "source": [ 650 | "%load solutions/02-dask-arrays-weather-difference.py" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "### Exercise: Subsample and store" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": {}, 663 | "source": [ 664 | "In the above exercise the result of our computation is small, so we can call `compute` safely. Sometimes our result is still too large to fit into memory and we want to save it to disk. In these cases you can use one of the following two functions\n", 665 | "\n", 666 | "1. `da.store`: Store dask.array into any object that supports numpy setitem syntax, e.g.\n", 667 | "\n", 668 | " f = h5py.File('myfile.hdf5')\n", 669 | " output = f.create_dataset(shape=..., dtype=...)\n", 670 | " \n", 671 | " da.store(my_dask_array, output)\n", 672 | " \n", 673 | "2. `da.to_hdf5`: A specialized function that creates and stores a `dask.array` object into an `HDF5` file.\n", 674 | "\n", 675 | " da.to_hdf5('data/myfile.hdf5', '/output', my_dask_array)\n", 676 | " \n", 677 | "The task in this exercise is to **use numpy step slicing to subsample the full dataset by a factor of two in both the latitude and longitude direction and then store this result to disk** using one of the functions listed above.\n", 678 | "\n", 679 | "As a reminder, Python slicing takes three elements\n", 680 | "\n", 681 | " start:stop:step\n", 682 | "\n", 683 | " >>> L = [1, 2, 3, 4, 5, 6, 7]\n", 684 | " >>> L[::3]\n", 685 | " [1, 4, 7]" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": {}, 692 | "outputs": [], 693 | "source": [] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": {}, 699 | "outputs": [], 700 | "source": [ 701 | "%load solutions/02-dask-arrays-store.py" 702 | ] 703 | }, 704 | { 705 | "cell_type": "markdown", 706 | "metadata": {}, 707 | "source": [ 708 | "## Example: Lennard-Jones potential" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "The [Lennard-Jones](https://en.wikipedia.org/wiki/Lennard-Jones_potential) is used in partical simuluations in physics, chemistry and engineering. It is highly parallelizable.\n", 716 | "\n", 717 | "First, we'll run and profile the Numpy version on 7,000 particles." 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": null, 723 | "metadata": {}, 724 | "outputs": [], 725 | "source": [ 726 | "import numpy as np\n", 727 | "\n", 728 | "# make a random collection of particles\n", 729 | "def make_cluster(natoms, radius=40, seed=1981):\n", 730 | " np.random.seed(seed)\n", 731 | " cluster = np.random.normal(0, radius, (natoms,3))-0.5\n", 732 | " return cluster\n", 733 | "\n", 734 | "def lj(r2):\n", 735 | " sr6 = (1./r2)**3\n", 736 | " pot = 4.*(sr6*sr6 - sr6)\n", 737 | " return pot\n", 738 | "\n", 739 | "# build the matrix of distances\n", 740 | "def distances(cluster):\n", 741 | " diff = cluster[:, np.newaxis, :] - cluster[np.newaxis, :, :]\n", 742 | " mat = (diff*diff).sum(-1)\n", 743 | " return mat\n", 744 | "\n", 745 | "# the lj function is evaluated over the upper traingle\n", 746 | "# after removing distances near zero\n", 747 | "def potential(cluster):\n", 748 | " d2 = distances(cluster)\n", 749 | " dtri = np.triu(d2)\n", 750 | " energy = lj(dtri[dtri > 1e-6]).sum()\n", 751 | " return energy" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": null, 757 | "metadata": {}, 758 | "outputs": [], 759 | "source": [ 760 | "cluster = make_cluster(int(7e3), radius=500)" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": null, 766 | "metadata": {}, 767 | "outputs": [], 768 | "source": [ 769 | "%time potential(cluster)" 770 | ] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "metadata": {}, 775 | "source": [ 776 | "Notice that the most time consuming function is `distances`." 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": {}, 783 | "outputs": [], 784 | "source": [ 785 | "%prun -s cumulative potential(cluster)" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "### Dask version" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "metadata": {}, 798 | "source": [ 799 | "Here's the dask version. Only the `potential` function needs to be rewritten to best utilize Dask.\n", 800 | "\n", 801 | "Note that `da.nansum` has been used over the full $NxN$ distance matrix to improve parallel efficiency." 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": null, 807 | "metadata": {}, 808 | "outputs": [], 809 | "source": [ 810 | "import dask.array as da\n", 811 | "\n", 812 | "# compute the potential on the entire\n", 813 | "# matrix of distances and ignore division by zero\n", 814 | "def potential_dask(cluster):\n", 815 | " d2 = distances(cluster)\n", 816 | " energy = da.nansum(lj(d2))/2.\n", 817 | " return energy" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "Let's convert the NumPy array to a Dask array. Since the entire NumPy array fits in memory it is more computationally efficient to chunk the array by number of CPU cores." 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": null, 830 | "metadata": {}, 831 | "outputs": [], 832 | "source": [ 833 | "from os import cpu_count\n", 834 | "\n", 835 | "dcluster = da.from_array(cluster, chunks=cluster.shape[0]//cpu_count())" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "This step should scale quite well with number of cores. The warnings are complaining about dividing by zero, which is why we used `da.nansum` in `potential_dask`." 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": null, 848 | "metadata": {}, 849 | "outputs": [], 850 | "source": [ 851 | "e = potential_dask(dcluster)\n", 852 | "%time e.compute()" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "The distributed [dashboard](http://127.0.0.1:8787/tasks) shows the execution of the tasks, allowing a visualization of which is taking the most time." 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "Limitations\n", 867 | "-----------\n", 868 | "\n", 869 | "Dask.array does not implement the entire numpy interface. Users expecting this\n", 870 | "will be disappointed. Notably dask.array has the following failings:\n", 871 | "\n", 872 | "1. Dask does not implement all of ``np.linalg``. This has been done by a\n", 873 | " number of excellent BLAS/LAPACK implementations and is the focus of\n", 874 | " numerous ongoing academic research projects.\n", 875 | "2. Dask.array does not support some operation where the resulting shape\n", 876 | " depends on the values of the array. In order to form the dask graph we\n", 877 | " must be able to infer the shape of the array before actually executing the\n", 878 | " operation. Some operations that result in arrays with unknown shape are\n", 879 | " supported: e.g. indexing one dask array with another or operations like ``da.where``.\n", 880 | "3. Dask.array does not attempt operations like ``sort`` which are notoriously\n", 881 | " difficult to do in parallel and are of somewhat diminished value on very\n", 882 | " large data (you rarely actually need a full sort).\n", 883 | " Often we include parallel-friendly alternatives like ``topk``.\n", 884 | "4. Dask development is driven by immediate need, and so many lesser used\n", 885 | " functions, like ``np.full_like`` have not been implemented purely out of\n", 886 | " laziness. These would make excellent community contributions." 887 | ] 888 | } 889 | ], 890 | "metadata": { 891 | "anaconda-cloud": {}, 892 | "kernelspec": { 893 | "display_name": "Python 3", 894 | "language": "python", 895 | "name": "python3" 896 | }, 897 | "language_info": { 898 | "codemirror_mode": { 899 | "name": "ipython", 900 | "version": 3 901 | }, 902 | "file_extension": ".py", 903 | "mimetype": "text/x-python", 904 | "name": "python", 905 | "nbconvert_exporter": "python", 906 | "pygments_lexer": "ipython3", 907 | "version": "3.6.4" 908 | } 909 | }, 910 | "nbformat": 4, 911 | "nbformat_minor": 2 912 | } 913 | -------------------------------------------------------------------------------- /03-dask-dataframes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ \n", 11 | "\n", 12 | "\n", 13 | "# Dask DataFrames\n", 14 | "\n", 15 | "We finished the last section by building a parallel dataframe computation over a directory of CSV files using `dask.delayed`. In this section we use `dask.dataframe` to build computations for us in the common case of tabular computations. Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`.\n", 16 | "\n", 17 | "In this notebook we use the same airline data as in notebook 01, but now rather than write for loops we let `dask.dataframe` construct our computations for us. The `dask.dataframe.read_csv` function can take a globstring like `\"data/nycflights/*.csv\"` and build parallel computations on all of our data at once.\n", 18 | "\n", 19 | "## When to use `dask.dataframe`\n", 20 | "\n", 21 | "Pandas is great for tabular datasets that fit in memory. Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. We didn't want to overwhelm the conference WiFi downloading large datasets, so the demo dataset we're working with is only about 200MB. But `dask.dataframe` will scale to larger than memory datasets." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import os\n", 31 | "\n", 32 | "import dask\n", 33 | "import dask.dataframe as dd\n", 34 | "import pandas as pd\n", 35 | "\n", 36 | "pd.options.display.max_rows = 10\n", 37 | "\n", 38 | "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n", 39 | " parse_dates={'Date': [0, 1, 2]})" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "df" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# Get the first 5 rows\n", 58 | "df.head()" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "# Get the last 5 rows\n", 68 | "df.tail()" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "### What just happened?\n", 76 | "\n", 77 | "Unlike `pandas.read_csv` which reads in the entire file before inferring datatypes, `dask.dataframe.read_csv` only reads in a sample from the beginning of the file (or first file if using a glob). These inferred datatypes are then enforced when reading all partitions.\n", 78 | "\n", 79 | "In this case, the datatypes inferred in the sample are incorrect. The first `n` rows have no value for `CRSElapsedTime` (which pandas infers as a `float`), and later on turn out to be strings (`object` dtype). When this happens you have a few options:\n", 80 | "\n", 81 | "- Specify dtypes directly using the `dtype` keyword. This is the recommended solution, as it's the least error prone (better to be explicit than implicit) and also the most performant.\n", 82 | "- Increase the size of the `sample` keyword (in bytes)\n", 83 | "- Use `assume_missing` to make `dask` assume that columns inferred to be `int` (which don't allow missing values) are actually floats (which do allow missing values). In our particular case this doesn't apply.\n", 84 | "\n", 85 | "In our case we'll use the first option and directly specify the `dtypes` of the offending columns. " 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n", 95 | " parse_dates={'Date': [0, 1, 2]},\n", 96 | " dtype={'TailNum': str,\n", 97 | " 'CRSElapsedTime': float,\n", 98 | " 'Cancelled': bool})" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "df.tail()" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## Computations with `dask.dataframe`\n", 115 | "\n", 116 | "We compute the maximum of the `DepDelay` column. With just pandas, we would loop over each file to find the individual maximums, then find the final maximum over all the individual maximums\n", 117 | "\n", 118 | "```python\n", 119 | "maxes = []\n", 120 | "for fn in filenames:\n", 121 | " df = pd.read_csv(fn)\n", 122 | " maxes.append(df.DepDelay.max())\n", 123 | " \n", 124 | "final_max = max(maxes)\n", 125 | "```\n", 126 | "\n", 127 | "We could wrap that `pd.read_csv` with `dask.delayed` so that it runs in parallel. Regardless, we're still having to think about loops, intermediate results (one per file) and the final reduction (`max` of the intermediate maxes). This is just noise around the real task, which pandas solves with\n", 128 | "\n", 129 | "```python\n", 130 | "df = pd.read_csv(filename, dtype=dtype)\n", 131 | "df.DepDelay.max()\n", 132 | "```\n", 133 | "\n", 134 | "`dask.dataframe` lets us write pandas-like code, that operates on larger than memory datasets in parallel." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "%time df.DepDelay.max().compute()" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "This writes the delayed computation for us and then runs it. \n", 151 | "\n", 152 | "Some things to note:\n", 153 | "\n", 154 | "1. As with `dask.delayed`, we need to call `.compute()` when we're done. Up until this point everything is lazy.\n", 155 | "2. Dask will delete intermediate results (like the full pandas dataframe for each file) as soon as possible.\n", 156 | " - This lets us handle datasets that are larger than memory\n", 157 | " - This means that repeated computations will have to load all of the data in each time (run the code above again, is it faster or slower than you would expect?)\n", 158 | " \n", 159 | "As with `Delayed` objects, you can view the underlying task graph using the `.visualize` method:/" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "df.DepDelay.max().visualize()" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "## Exercises\n", 176 | "\n", 177 | "In this section we do a few `dask.dataframe` computations. If you are comfortable with Pandas then these should be familiar. You will have to think about when to call `compute`." 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "### 1.) How many rows are in our dataset?\n", 185 | "\n", 186 | "If you aren't familiar with pandas, how would you check how many records are in a list of tuples?" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "# Your code here...\n" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "%load solutions/03-dask-dataframe-rows.py" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "### 2.) In total, how many non-canceled flights were taken?\n", 212 | "\n", 213 | "With pandas, you would use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)." 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "# Your code here...\n" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "%load solutions/03-dask-dataframe-non-cancelled.py" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "### 3.) In total, how many non-cancelled flights were taken from each airport?\n", 239 | "\n", 240 | "*Hint*: use [`df.groupby`](https://pandas.pydata.org/pandas-docs/stable/groupby.html)." 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "# Your code here...\n" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "%load solutions/03-dask-dataframe-non-cancelled-per-airport.py" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "### 4.) What was the average departure delay from each airport?\n", 273 | "\n", 274 | "Note, this is the same computation you did in the previous notebook (is this approach faster or slower?)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "# Your code here...\n" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "df.columns" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "%load solutions/03-dask-dataframe-delay-per-airport.py" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "### 5.) What day of the week has the worst average departure delay?" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# Your code here...\n" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "%load solutions/03-dask-dataframe-delay-per-day.py" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## Sharing Intermediate Results\n", 334 | "\n", 335 | "When computing all of the above, we sometimes did the same operation more than once. For most operations, `dask.dataframe` hashes the arguments, allowing duplicate computations to be shared, and only computed once.\n", 336 | "\n", 337 | "For example, lets compute the mean and standard deviation for departure delay of all non-canceled flights:" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "non_cancelled = df[~df.Cancelled]\n", 347 | "mean_delay = non_cancelled.DepDelay.mean()\n", 348 | "std_delay = non_cancelled.DepDelay.std()" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "Since dask operations are lazy, those values aren't the final results yet. They're just the recipe require to get the result." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "non_cancelled" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "#### Using two calls to `.compute`:" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "%%time\n", 381 | "\n", 382 | "mean_delay_res = mean_delay.compute()\n", 383 | "std_delay_res = std_delay.compute()" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "#### Using one call to `dask.compute`:" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "%%time\n", 400 | "\n", 401 | "mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "Using `dask.compute` takes roughly 1/2 the time. This is because the task graphs for both results are merged when calling `dask.compute`, allowing shared operations to only be done once instead of twice. In particular, using `dask.compute` only does the following once:\n", 409 | "\n", 410 | "- the calls to `read_csv`\n", 411 | "- the filter (`df[~df.Cancelled]`)\n", 412 | "- some of the necessary reductions (`sum`, `count`)\n", 413 | "\n", 414 | "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to zoom in on the graph better):" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "dask.visualize(mean_delay, std_delay)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "## Dask DataFrame Data Model\n", 431 | "\n", 432 | "For the most part, a Dask DataFrame feels like a pandas DataFrame.\n", 433 | "So far, the biggest difference we've seen is that Dask operations are lazy; they build up a task graph instead of executing immediately (more details coming in [Schedulers](04-schedulers.ipynb)).\n", 434 | "This lets Dask do operations in parallel and out of core.\n", 435 | "\n", 436 | "In [Dask Arrays](02-dask-arrays.ipynb), we saw that a `dask.array` was composed of many NumPy arrays, chunked along one or more dimensions.\n", 437 | "It's similar for `dask.dataframe`: a Dask DataFrame is composed of many pandas DataFrames. For `dask.dataframe` the chunking happens only along the index.\n", 438 | "\n", 439 | "

\n", 440 | "\n", 441 | "We call each chunk a *partition*, and the upper / lower bounds are *divisions*.\n", 442 | "Dask *can* store information about the divisions. We'll cover this in more detail in [Distributed DataFrames](05-distributed-dataframes-and-efficiency.ipynb).\n", 443 | "For now, partitions come up when you write custom functions to apply to Dask DataFrames" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "## Converting `CRSDepTime` to a timestamp\n", 451 | "\n", 452 | "This dataset stores timestamps as `HHMM`, which are read in as integers in `read_csv`:" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "crs_dep_time = df.CRSDepTime.head(10)\n", 462 | "crs_dep_time" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "To convert these to timestamps of scheduled departure time, we need to convert these integers into `pd.Timedelta` objects, and then combine them with the `Date` column.\n", 470 | "\n", 471 | "In pandas we'd do this using the `pd.to_timedelta` function, and a bit of arithmetic:" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [ 480 | "import pandas as pd\n", 481 | "\n", 482 | "# Get the first 10 dates to complement our `crs_dep_time`\n", 483 | "date = df.Date.head(10)\n", 484 | "\n", 485 | "# Get hours as an integer, convert to a timedelta\n", 486 | "hours = crs_dep_time // 100\n", 487 | "hours_timedelta = pd.to_timedelta(hours, unit='h')\n", 488 | "\n", 489 | "# Get minutes as an integer, convert to a timedelta\n", 490 | "minutes = crs_dep_time % 100\n", 491 | "minutes_timedelta = pd.to_timedelta(minutes, unit='m')\n", 492 | "\n", 493 | "# Apply the timedeltas to offset the dates by the departure time\n", 494 | "departure_timestamp = date + hours_timedelta + minutes_timedelta\n", 495 | "departure_timestamp" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "df" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "df.UniqueCarrier.str.slice(2)" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "### Custom code and Dask Dataframe\n", 521 | "\n", 522 | "We could swap out `pd.to_timedelta` for `dd.to_timedelta` and do the same operations on the entire dask DataFrame. But let's say that Dask hadn't implemented a `dd.to_timedelta` that works on Dask DataFrames. What would you do then?\n", 523 | "\n", 524 | "`dask.dataframe` provides a few methods to make applying custom functions to Dask DataFrames easier:\n", 525 | "\n", 526 | "- [`map_partitions`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions)\n", 527 | "- [`map_overlap`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_overlap)\n", 528 | "- [`reduction`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.reduction)\n", 529 | "\n", 530 | "Here we'll just be discussing `map_partitions`, which we can use to implement `to_timedelta` on our own:" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "# Look at the docs for `map_partitions`\n", 540 | "\n", 541 | "help(df.CRSDepTime.map_partitions)" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "The basic idea is to apply a function that operates on a DataFrame to each partition.\n", 549 | "In this case, we'll apply `pd.to_timedelta`." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "hours = df.CRSDepTime // 100\n", 559 | "# hours_timedelta = pd.to_timedelta(hours, unit='h')\n", 560 | "hours_timedelta = hours.map_partitions(pd.to_timedelta, unit='h')\n", 561 | "\n", 562 | "minutes = df.CRSDepTime % 100\n", 563 | "# minutes_timedelta = pd.to_timedelta(minutes, unit='m')\n", 564 | "minutes_timedelta = minutes.map_partitions(pd.to_timedelta, unit='m')\n", 565 | "\n", 566 | "departure_timestamp = df.Date + hours_timedelta + minutes_timedelta" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "departure_timestamp" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "departure_timestamp.head()" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": {}, 590 | "source": [ 591 | "### Exercise: Rewrite above to use a single call to `map_partitions`\n", 592 | "\n", 593 | "This will be slightly more efficient than two separate calls, as it reduces the number of tasks in the graph." 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "def compute_departure_timestamp(df):\n", 603 | " # TODO" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": {}, 610 | "outputs": [], 611 | "source": [ 612 | "departure_timestamp = df.map_partitions(compute_departure_timestamp)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": null, 618 | "metadata": {}, 619 | "outputs": [], 620 | "source": [ 621 | "departure_timestamp.head()" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": null, 627 | "metadata": {}, 628 | "outputs": [], 629 | "source": [ 630 | "%load solutions/03-dask-dataframe-map-partitions.py" 631 | ] 632 | } 633 | ], 634 | "metadata": { 635 | "kernelspec": { 636 | "display_name": "Python 3", 637 | "language": "python", 638 | "name": "python3" 639 | }, 640 | "language_info": { 641 | "codemirror_mode": { 642 | "name": "ipython", 643 | "version": 3 644 | }, 645 | "file_extension": ".py", 646 | "mimetype": "text/x-python", 647 | "name": "python", 648 | "nbconvert_exporter": "python", 649 | "pygments_lexer": "ipython3", 650 | "version": "3.6.5" 651 | } 652 | }, 653 | "nbformat": 4, 654 | "nbformat_minor": 2 655 | } 656 | -------------------------------------------------------------------------------- /04-schedulers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ \n", 11 | "\n", 12 | "\n", 13 | "## Schedulers\n", 14 | "\n", 15 | "In the previous notebooks, we used `dask.delayed` and `dask.dataframe` to parallelize computations.\n", 16 | "These work by building a *task graph* instead of executing immediately.\n", 17 | "Each *task* represents some function to call on some data, and the full *graph* is the relationship between all the tasks.\n", 18 | "\n", 19 | "When we wanted the actual result, we called `compute`, which handed the task graph off to a *scheduler*.\n", 20 | "\n", 21 | "**Schedulers are responsible for running a task graph and producing a result**.\n", 22 | "\n", 23 | "![](https://raw.githubusercontent.com/dask/dask-org/master/images/grid_search_schedule.gif)\n", 24 | "\n", 25 | "First, there are the single machine schedulers that execute things in parallel using threads or processes (or synchronously for debugging). These are what we've used up until now. Second, there's the `dask.distributed` scheduler, which is newer and has more features than the single machine scheduler.\n", 26 | "\n", 27 | "In this notebook we'll first talk about the different schedulers. Then we'll use the `dask.distributed` scheduler in more depth." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "### Local Schedulers\n", 35 | "\n", 36 | "Dask separates computation description (task graphs) from execution (schedulers). This allows you to write code once, and run it locally or scale it out across a cluster.\n", 37 | "\n", 38 | "Here we discuss the *local* schedulers - schedulers that run only on a single machine. The three options here are:\n", 39 | "\n", 40 | "- `dask.threaded.get # uses a local thread pool`\n", 41 | "- `dask.multiprocessing.get # uses a local process pool`\n", 42 | "- `dask.get # uses only the main thread (useful for debugging)`\n", 43 | "\n", 44 | "In each case we change the scheduler used in a few different ways:\n", 45 | "\n", 46 | "- By providing a `get=` keyword argument to `compute`:\n", 47 | "\n", 48 | "```python\n", 49 | "total.compute(get=dask.multiprocessing.get)\n", 50 | "# or \n", 51 | "dask.compute(a, b, get=dask.multiprocessing.get)\n", 52 | "```\n", 53 | "\n", 54 | "- Using `dask.set_options`:\n", 55 | "\n", 56 | "```python\n", 57 | "# Use multiprocessing in this block\n", 58 | "with dask.set_options(get=dask.multiprocessing.get):\n", 59 | " total.compute()\n", 60 | "# Use multiprocessing globally\n", 61 | "dask.set_options(get=dask.multiprocessing.get)\n", 62 | "```\n", 63 | "\n", 64 | "---\n", 65 | "\n", 66 | "*Note: on master, we've also added a `scheduler=` keyword to `compute` that takes in scheduler names instead of scheduler functions. In future releases `dask.compute(..., scheduler='threads')` or `dask.set_options(scheduler='threads')` will be the preferred methods.*\n", 67 | "\n", 68 | "---\n", 69 | "\n", 70 | "Here we repeat a simple dataframe computation from the previous section using the different schedulers:" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "import os\n", 80 | "import dask.dataframe as dd" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n", 90 | " parse_dates={'Date': [0, 1, 2]},\n", 91 | " dtype={'TailNum': object,\n", 92 | " 'CRSElapsedTime': float,\n", 93 | " 'Cancelled': bool})\n", 94 | "\n", 95 | "# Maximum non-cancelled delay\n", 96 | "largest_delay = df[~df.Cancelled].DepDelay.max()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "largest_delay" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "%time _ = largest_delay.compute() # this uses threads by default" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "import dask.multiprocessing\n", 124 | "%time _ = largest_delay.compute(get=dask.multiprocessing.get) # this uses processes" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "%time _ = largest_delay.compute(get=dask.get) # This uses a single thread" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "By default the threaded and multiprocessing schedulers use the same number of workers as cores. You can change this using the `num_workers` keyword in the same way that you specified `get` above:\n", 141 | "\n", 142 | "```\n", 143 | "largest_delay.compute(get=dask.multiprocessing.get, num_workers=2)\n", 144 | "```\n", 145 | "\n", 146 | "To see how many cores you have on your computer, you can use `multiprocessing.cpu_count`" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "from multiprocessing import cpu_count\n", 156 | "cpu_count()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "### Some Questions to Consider:\n", 164 | "\n", 165 | "- How much speedup is possible for this task (hint, look at the graph).\n", 166 | "- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.\n", 167 | "- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?\n", 168 | "- Why is the multiprocessing scheduler so much slower here?" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "---" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## In what cases would you want to use one scheduler over another?\n", 183 | "\n", 184 | "http://dask.pydata.org/en/latest/setup/single-machine.html" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "---" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## Distributed Scheduler\n", 199 | "\n", 200 | "The `dask.distributed` system is composed of a single centralized scheduler and many worker processes. [Deploying](http://dask.pydata.org/en/latest/setup.html) a remote Dask cluster involves some additional effort. But doing things locally is just involves creating a `Client` object, which lets you interact with the \"cluster\" (local threads or processes on your machine). For more information see [here](http://dask.pydata.org/en/latest/setup/single-distributed.html)." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "from dask.distributed import Client\n", 210 | "\n", 211 | "# Setup a local cluster.\n", 212 | "# By default this sets up 1 worker per core\n", 213 | "client = Client()\n", 214 | "client.cluster" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "Be sure to click the `Dashboard` link to open up the diagnostics dashboard.\n", 222 | "\n", 223 | "By default, creating a `Client` makes it the default scheduler. Any calls to `.compute` will use the cluster your `client` is attached to (See http://dask.pydata.org/en/latest/scheduling.html for how to specify which scheduler to use)." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "%time largest_delay.compute()" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "#### Some Questions to Consider\n", 240 | "\n", 241 | "- How does this compare to the optimal parallel speedup?\n", 242 | "- Why is this faster than the threaded scheduler?" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "---" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "### Exercise\n", 257 | "\n", 258 | "Run the following computations while looking at the diagnostics page. In each case what is taking the most time?" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "# Number of flights\n", 268 | "_ = len(df)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "# Number of non-cancelled flights\n", 278 | "_ = len(df[~df.Cancelled])" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "# Number of non-cancelled flights per-airport\n", 288 | "_ = df[~df.Cancelled].groupby('Origin').Origin.count().compute()" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "# Average departure delay from each airport?\n", 298 | "_ = df[~df.Cancelled].groupby('Origin').DepDelay.mean().compute()" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "# Average departure delay per day-of-week\n", 308 | "_ = df.groupby(df.Date.dt.dayofweek).DepDelay.mean().compute()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "---" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "### New API\n", 323 | "\n", 324 | "The distributed scheduler is more sophisticated than the single machine schedulers. It can compute asynchronously, and also provides an api similar to that of `concurrent.futures`. This will be discussed more [in part 6](06-distributed-advanced.ipynb). For further information you can also see the docs http://distributed.readthedocs.io/en/latest/." 325 | ] 326 | } 327 | ], 328 | "metadata": { 329 | "kernelspec": { 330 | "display_name": "Python 3", 331 | "language": "python", 332 | "name": "python3" 333 | }, 334 | "language_info": { 335 | "codemirror_mode": { 336 | "name": "ipython", 337 | "version": 3 338 | }, 339 | "file_extension": ".py", 340 | "mimetype": "text/x-python", 341 | "name": "python", 342 | "nbconvert_exporter": "python", 343 | "pygments_lexer": "ipython3", 344 | "version": "3.5.4" 345 | } 346 | }, 347 | "nbformat": 4, 348 | "nbformat_minor": 2 349 | } 350 | -------------------------------------------------------------------------------- /05-distributed-dataframes-and-efficiency.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ \n", 11 | "\n", 12 | "# Personal Dask Cluster\n", 13 | "\n", 14 | "Go to http://35.192.17.108/hub/login for your own Dask cluster. Use your first and last name provided to PyCon for the username and \"dask\" for the password.\n", 15 | "(We don't actually do any authentication).\n", 16 | "\n", 17 | "# Distributed DataFrames and Efficiency\n", 18 | "\n", 19 | "In the previous notebooks we discussed `dask.dataframe` and `dask.distributed`. Here we combine them on a larger dataset, and discuss efficiency and performance tips.\n", 20 | "\n", 21 | "We will cover the following topics:\n", 22 | "\n", 23 | "1. Persist common intermediate results in memory with `persist`\n", 24 | "2. Partitions and partition size\n", 25 | "3. Using indices to improve efficiency" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "from pycon_utils import make_cluster\n", 35 | "from dask.distributed import Client" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "The next cell will start up some workers for you. This make take a few minutes, but they widget will update automatically when the workers are ready. You don't need to do anything with the manual or adaptive scaling." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "cluster = make_cluster()\n", 52 | "cluster" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "**Be sure to open the diagnostics UI.**" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "client = Client(cluster)\n", 69 | "client" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## Moving to distributed\n", 77 | "\n", 78 | "A few things change when moving from local to distributed computing.\n", 79 | "\n", 80 | "1. Environment: Each worker is a separate machine, and needs to have the required libraries installed. This cluster was setup using [Kubernetes](http://dask.pydata.org/en/latest/setup/kubernetes.html#).\n", 81 | "2. File system: Previously, every worker (threads, processes, or even the distributed scheduler in local mode) had access to your laptops file system. In a distributed environment, you'll need some kind of shared file system to read data (cloud storage like S3 or GCS, or a network file system)\n", 82 | "3. Communication: Moving data between machines is relatively expensive. When possible, the distributed scheduler will ensure that tasks are scheduled to be run on workers that already have the required data. But some tasks will require data from multiple machines." 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "## The full airline dataset\n", 90 | "\n", 91 | "We have the full airline dataset stored on `GCS`. This is the same as the one you've been working with, but includes all originating airports and a few extra columns. We change the `read_csv` call slightly to avoid the extra columns.\n", 92 | "\n", 93 | "`dask.dataframe` has support for reading directly from `GCS`, so we can use our `read_csv` call from before." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "import dask.dataframe as dd\n", 103 | "\n", 104 | "columns = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',\n", 105 | " 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',\n", 106 | " 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',\n", 107 | " 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',\n", 108 | " 'Cancelled']\n", 109 | "\n", 110 | "df = dd.read_csv('gcs://anaconda-public-data/airline/(199)|(200)*.csv',\n", 111 | " parse_dates={'Date': [0, 1, 2]},\n", 112 | " dtype={'TailNum': object,\n", 113 | " 'CRSElapsedTime': float,\n", 114 | " 'Distance': float,\n", 115 | " 'Cancelled': bool},\n", 116 | " usecols=columns)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "df.head()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "### Persist data in distributed memory\n", 133 | "\n", 134 | "Every time we run an operation like `df[~df.Cancelled].DepDelay.max().compute()` we read through our dataset from disk. This can be slow, especially because we're reading data from CSV. We usually have two options to make this faster:\n", 135 | "\n", 136 | "1. Persist relevant data in memory, either on our computer or on a cluster\n", 137 | "2. Use a faster on-disk format, like HDF5 or Parquet\n", 138 | "\n", 139 | "In this section we persist our data in memory. On a single machine this is often done by doing a bit of pre-processing and data reduction with dask dataframe and then `compute`-ing to a Pandas dataframe and using Pandas in the future. \n", 140 | "\n", 141 | "```python\n", 142 | "df = dd.read_csv(...)\n", 143 | "df = df[df.Origin == 'LGA'] # filter down to smaller dataset\n", 144 | "pdf = df.compute() # convert to pandas\n", 145 | "pdf ... # continue with familiar Pandas workflows\n", 146 | "```\n", 147 | "\n", 148 | "However on a distributed cluster when even our cleaned data is too large we still can't use Pandas. In this case we ask Dask to persist data in memory with the `dask.persist` function. This is what we'll do today. This will help us to understand when data is lazy and when it is computing.\n", 149 | "\n", 150 | "You can trigger computations using the persist method:\n", 151 | "\n", 152 | " x = x.persist()\n", 153 | "\n", 154 | "or the dask.persist function for multiple inputs:\n", 155 | "\n", 156 | " x, y = dask.persist(x, y)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "### Exercise\n", 164 | "\n", 165 | "Persist the dataframe into memory.\n", 166 | "\n", 167 | "- How long does the cell take to execute (look at the \"busy\" indicator in the top-right)?\n", 168 | "- After it has persisted how long does it take to compute `df[~df.Cancelled].DepDelay.count().compute()`?\n", 169 | "- Looking at the plots in the diagnostic web page (the link was printed above), what is taking up most of the time? (You can over over rectangles to see what function they represent)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "df = # TODO: persist dataframe in memory" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "%time _ = df.Cancelled[~df.Cancelled].count().compute()" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "### Exercise\n", 195 | "\n", 196 | "Repeat the groupby computation from the previous notebooks. What is taking all of the time now?" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "# What was the average departure delay from each airport?\n", 206 | "df[~df.Cancelled].groupby('Origin').DepDelay.mean().nlargest(10).compute()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "## Partitions\n", 214 | "\n", 215 | "One `dask.dataframe` is composed of several Pandas dataframes. The organization of these dataframes can significantly impact performance. In this section we discuss two common factors that commonly impact performance:\n", 216 | "\n", 217 | "1. The number of Pandas dataframes can affect overhead. If the dataframes are too small then Dask might spend more time deciding what to do than Pandas spends actually doing it. Ideally computations should take 100's of milliseconds.\n", 218 | "\n", 219 | "2. If we know how the dataframes are sorted then certain operations become much faster" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "### Number of partitions and partition size\n", 227 | "\n", 228 | "When we read in our data from CSV files we get potentially multiple Pandas dataframe for each file. Look at the metadata below to determine a few things about the current partitioning:\n", 229 | "- How many partitions are there?\n", 230 | "- Are the splits along the index between partitions known? If so, what are they?" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "# Number of partitions\n", 240 | "df.npartitions" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "# Are the splits between partitions known?\n", 250 | "df.known_divisions" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "# The splits between partitions. If unknown these are all `None`\n", 260 | "df.divisions" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "### Exercise: How large is the DataFrame?\n", 268 | "\n", 269 | "- How would you compute the memory usage of a single pandas DataFrame?\n", 270 | "- Given your knowledge of Dask, how would you do it for a Dask DataFrame?" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "# Your code here...\n" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": {}, 286 | "outputs": [], 287 | "source": [ 288 | "%load solutions/05-distributed-dataframes-memory-usage.ipynb" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "## Sorted Index column\n", 296 | "\n", 297 | "*This section doesn't have any exercises. Just follow along.*\n", 298 | "\n", 299 | "Many dataframe operations like loc-indexing, groupby-apply, and joins are *much* faster on a sorted index. For example, if we want to get data for a particular day of data it *really* helps to know where that day is, otherwise we need to search over all of our data.\n", 300 | "\n", 301 | "The Pandas model gives us a sorted index column. Dask.dataframe copies this model, and it remembers the min and max values of every partition's index.\n", 302 | "\n", 303 | "By default, our data doesn't have an index." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "df.head()" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "So if we search for a particular day it takes a while because it has to pass through all of the data." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "%time df[df.Date == '1992-05-05'].compute()" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "df[df.Date == '1992-05-05'].visualize(optimize_graph=True)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "However if we set the `Date` column as the index then this operation can be much much faster.\n", 345 | "\n", 346 | "Calling `set_index` followed by `persist` results in a new set of dataframe partitions stored in memory, sorted along the index column. To do this dask has to\n", 347 | "\n", 348 | "- Shuffle the data by date, resulting in the same number of output partitions\n", 349 | "- Set the index for each partition\n", 350 | "- Store the resulting partitions in distributed memory\n", 351 | "\n", 352 | "This can be a (relatively) expensive operation, but allows certain queries to be more optimized. \n", 353 | "\n", 354 | "Watch the diagnostics page while the next line is running to see how the shuffle and index operation progresses." 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "%%time\n", 364 | "df = df.set_index('Date').persist()" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "After the index is set, we now have known divisions:" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "# Number of partitions\n", 381 | "df.npartitions" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": null, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "# Are the splits between partitions known?\n", 391 | "df.known_divisions" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "# The splits between partitions.\n", 401 | "df.divisions" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "# The repr for a dask dataframe can also be useful for\n", 411 | "# seeing partition information\n", 412 | "df" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "Repeating the same query for all flights on a specific date, we can see that we're much faster after setting the index:" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "%time df.loc['1992-05-05'].compute()" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "If you look at the resulting graph, you can see that dask was able to optimize the computation to only look at a single partition:" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "df.loc['1992-05-05'].visualize(optimize_graph=True)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "### Timeseries operations\n", 452 | "\n", 453 | "When the index of a dask dataframe is a known `DatetimeIndes`, traditional pandas timeseries operations are supported. For example, now that we have a sorted index we can resample the `DepDelay` column into 1 month bins." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "%matplotlib inline" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "%%time \n", 472 | "(df.DepDelay\n", 473 | " .resample('1M')\n", 474 | " .mean()\n", 475 | " .fillna(method='ffill')\n", 476 | " .compute()\n", 477 | " .plot(figsize=(10, 5)));" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": null, 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "# When you're done with the `airlines` dataset\n", 494 | "client.restart()" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "## Exercise: Explore the NYC Taxi dataset\n", 502 | "\n", 503 | "We have some of the NYC Taxi ride dataset in parquet format stored in GCS." 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": {}, 510 | "outputs": [], 511 | "source": [ 512 | "taxi = dd.read_parquet(\"gcs://anaconda-public-data/nyc-taxi/nyc.parquet\")\n", 513 | "taxi.head()" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "Some questions?\n", 521 | "\n", 522 | "- How large is the dataset? Will it fit in your cluster's RAM if you persist it?\n", 523 | "- What's the average tip percent by hour?" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": {}, 530 | "outputs": [], 531 | "source": [] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "# clean up, when finished with the notebook\n", 540 | "client.close()\n", 541 | "cluster.close()" 542 | ] 543 | } 544 | ], 545 | "metadata": { 546 | "kernelspec": { 547 | "display_name": "Python 3", 548 | "language": "python", 549 | "name": "python3" 550 | }, 551 | "language_info": { 552 | "codemirror_mode": { 553 | "name": "ipython", 554 | "version": 3 555 | }, 556 | "file_extension": ".py", 557 | "mimetype": "text/x-python", 558 | "name": "python", 559 | "nbconvert_exporter": "python", 560 | "pygments_lexer": "ipython3", 561 | "version": "3.6.5" 562 | } 563 | }, 564 | "nbformat": 4, 565 | "nbformat_minor": 2 566 | } 567 | -------------------------------------------------------------------------------- /06-distributed-advanced.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " $\"Dask$ \n", 11 | "\n", 12 | "\n", 13 | "# The async futures interface\n", 14 | "\n", 15 | "In previous notebooks we showed that the distributed scheduler can be used to execute dask graphs, and so compute collections across multiple machines. However, the distributed scheduler also offers additional API that allow for finer control and asynchronous computation. In this notebook we delve into these features.\n", 16 | "\n", 17 | "To begin, the `distributed` scheduler implements a superset of of the standard library [`concurrent.futures`](https://docs.python.org/3/library/concurrent.futures.html) API, allowing for asynchronous map-reduce like functionality. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. Notice that the call returns immediately, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local python session.\n", 18 | "\n", 19 | "Before you start, you should open the diagnostics dashboard so you can watch what the scheduler is doing." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "from pycon_utils import make_cluster\n", 29 | "from dask.distributed import Client" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "cluster = make_cluster()\n", 39 | "cluster" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "c = Client(cluster)\n", 49 | "c" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "# Define some toy functions for simulating work\n", 59 | "import time\n", 60 | "\n", 61 | "def inc(x):\n", 62 | " time.sleep(5)\n", 63 | " return x + 1\n", 64 | "\n", 65 | "def dec(x):\n", 66 | " time.sleep(3)\n", 67 | " return x - 1\n", 68 | "\n", 69 | "def add(x, y):\n", 70 | " time.sleep(7)\n", 71 | " return x + y" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### `Client.submit`\n", 79 | "\n", 80 | "`submit` takes a function and arguments, pushes these to the cluster, returning a *Future* representing the result to be computed. The function is passed to a worker process for evaluation. Note that this cell returns immediately, while computation may still be ongoing on the cluster." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "fut = c.submit(inc, 1)\n", 90 | "fut" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Looking at the `repr` of the future, `fut`, you can see it's status is \"pending\". This means that the computation the future represents hasn't finished yet. Since this is asynchronous, you can do other things while it computes. However, if you want to wait until the future has completed you can use the `wait` function. At this time the future's status will become \"finished\"." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "from distributed import wait" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "# Block until `fut` has completed\n", 116 | "wait(fut);" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "To retrieve a result you can use the `.result` method (which fetches for the one future), or `Client.gather` (which fetches for one or more futures). In both cases these methods will block until all requested futures have completed." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "# Retrieve the data using `.result()`\n", 133 | "fut.result()" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# Retrieve the data using `Client.gather` - result is now already available\n", 143 | "c.gather(fut)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "Here we see an alternative way to execute work on the cluster: when you submit or map with the inputs as futures, the *computation moves to the data* rather than the other way around, and the client, in the local python session, need never see the intermediate values. This is similar to building the graph using delayed, and indeed, delayed can be used in conjunction with futures. Here we use the delayed object `total` from before." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "from dask import delayed\n", 160 | "\n", 161 | "x = delayed(inc)(1)\n", 162 | "y = delayed(dec)(2)\n", 163 | "total = delayed(add)(x, y)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "# notice the difference from total.compute()\n", 173 | "# notice that this cell completes immediately\n", 174 | "fut = c.compute(total)\n", 175 | "fut" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "c.gather(fut)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### Exercise: Rebuild the above delayed computation using `Client.submit` instead" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "Solution:" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "%load solutions/client_submit.py" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "The futures API offers a work submission style that can easily emulate the map/reduce paradigm (see `c.map()`) that may be familiar. The intermediate results, represented by futures, can be passed to new tasks without having to bring the data locally from the cluster, and new work can be assigned using the output of previous jobs that haven't begun yet.\n", 222 | "\n", 223 | "Generally, any dask operation that is executed using `.compute()` can be submitted for asynchronous execution using `c.compute()` instead, and this applies to all collections. Here we create a `Bag`, do some operations on it, and call `Client.compute` on it. Since this is asynchronous we could continue to submit more work (perhaps based on the result of the calculation), or, as in the next cell, follow the progress of the computation. A similar progress-bar appears in the monitoring UI page." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "import dask.bag as db\n", 233 | "\n", 234 | "res = (db.from_sequence(range(10))\n", 235 | " .map(inc)\n", 236 | " .filter(lambda x: x % 2 == 0)\n", 237 | " .sum())\n", 238 | "\n", 239 | "res" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "f = c.compute(res)\n", 249 | "f" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "from distributed.diagnostics import progress" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "# note that progress must be the last line of a cell\n", 268 | "# in order to show up\n", 269 | "progress(f)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "c.gather(f)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "**Note well**: a future points to data being computed on the cluster or held in memory. To release the memory, all references to a future need to be deleted. Look at the scheduler dashboard to see the effect of the following." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "del f, fut" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "## Asynchronous computation\n", 302 | "

\n", 303 | "\n", 304 | "One benefit of using the futures API is that you can have dynamic computations that adjust as things progress. Here we implement a simple naive search by looping through results as they come in, and submit new points to compute as others are still running.\n", 305 | "\n", 306 | "Watching the [diagnostics dashboard](../../9002/status) as this runs you can see computations are being concurrently run while more are being submitted. This flexibility can be useful for parallel algorithms that require some level of synchronization." 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "# a simple function with interesting minima\n", 316 | "\n", 317 | "def rosenbrock(point):\n", 318 | " \"\"\"Compute the rosenbrock function and return the point and result\"\"\"\n", 319 | " time.sleep(0.1)\n", 320 | " score = (1 - point[0])**2 + 2 * (point[1] - point[0]**2)**2\n", 321 | " return point, score" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "from bokeh.io import output_notebook, push_notebook\n", 331 | "from bokeh.models.sources import ColumnDataSource\n", 332 | "from bokeh.plotting import figure, show\n", 333 | "import numpy as np\n", 334 | "output_notebook()\n", 335 | "\n", 336 | "# set up plot background\n", 337 | "N = 500\n", 338 | "x = np.linspace(-5, 5, N)\n", 339 | "y = np.linspace(-5, 5, N)\n", 340 | "xx, yy = np.meshgrid(x, y)\n", 341 | "d = (1 - xx)**2 + 2 * (yy - xx**2)**2\n", 342 | "d = np.log(d)\n", 343 | "\n", 344 | "p = figure(x_range=(-5, 5), y_range=(-5, 5))\n", 345 | "p.image(image=[d], x=-5, y=-5, dw=10, dh=10, palette=\"Spectral11\");" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "from dask.distributed import as_completed\n", 355 | "from random import uniform\n", 356 | "\n", 357 | "scale = 5 # Intial random perturbation scale\n", 358 | "best_point = (0, 0) # Initial guess\n", 359 | "best_score = float('inf') # Best score so far\n", 360 | "startx = [uniform(-scale, scale) for _ in range(10)]\n", 361 | "starty = [uniform(-scale, scale) for _ in range(10)]\n", 362 | "\n", 363 | "# set up plot\n", 364 | "source = ColumnDataSource({'x': startx, 'y': starty, 'c': ['grey'] * 10})\n", 365 | "p.circle(source=source, x='x', y='y', color='c')\n", 366 | "t = show(p, notebook_handle=True)\n", 367 | "\n", 368 | "# initial 10 random points\n", 369 | "futures = [c.submit(rosenbrock, (x, y)) for x, y in zip(startx, starty)]\n", 370 | "iterator = as_completed(futures)\n", 371 | "\n", 372 | "for res in iterator:\n", 373 | " # take a completed point, is it an improvement?\n", 374 | " point, score = res.result()\n", 375 | " if score < best_score:\n", 376 | " best_score, best_point = score, point\n", 377 | " print(score, point)\n", 378 | "\n", 379 | " x, y = best_point\n", 380 | " newx, newy = (x + uniform(-scale, scale), y + uniform(-scale, scale))\n", 381 | " \n", 382 | " # update plot\n", 383 | " source.stream({'x': [newx], 'y': [newy], 'c': ['grey']}, rollover=20)\n", 384 | " push_notebook(t)\n", 385 | " \n", 386 | " # add new point, dynamically, to work on the cluster\n", 387 | " new_point = c.submit(rosenbrock, (newx, newy))\n", 388 | " iterator.add(new_point) # Start tracking new task as well\n", 389 | "\n", 390 | " # Narrow search and consider stopping\n", 391 | " scale *= 0.99\n", 392 | " if scale < 0.001:\n", 393 | " break\n", 394 | "point" 395 | ] 396 | }, 397 | { 398 | "cell_type": "raw", 399 | "metadata": {}, 400 | "source": [] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "# clean up\n", 409 | "del futures[:], new_point, iterator, res\n", 410 | "client.close()\n", 411 | "cluster.close()" 412 | ] 413 | } 414 | ], 415 | "metadata": { 416 | "anaconda-cloud": {}, 417 | "kernelspec": { 418 | "display_name": "Python 3", 419 | "language": "python", 420 | "name": "python3" 421 | }, 422 | "language_info": { 423 | "codemirror_mode": { 424 | "name": "ipython", 425 | "version": 3 426 | }, 427 | "file_extension": ".py", 428 | "mimetype": "text/x-python", 429 | "name": "python", 430 | "nbconvert_exporter": "python", 431 | "pygments_lexer": "ipython3", 432 | "version": "3.6.4" 433 | } 434 | }, 435 | "nbformat": 4, 436 | "nbformat_minor": 2 437 | } 438 | -------------------------------------------------------------------------------- /07-machine-learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Parallel and Distributed Machine Learning\n", 8 | "\n", 9 | "[Dask-ML](https://dask-ml.readthedocs.io) has resources for parallel and distributed machine learning." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Types of Scaling\n", 17 | "\n", 18 | "There are a couple of distinct scaling problems you might face.\n", 19 | "The scaling strategy depends on which problem you're facing.\n", 20 | "\n", 21 | "1. Large Models: Data fits in RAM, but training takes too long. Many hyperparameter combinations, a large ensemble of many models, etc.\n", 22 | "2. Large Datasets: Data is larger than RAM, and sampling isn't an option.\n", 23 | "\n", 24 | "![](static/ml-dimensions-color.png)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "* For in-memory problems, just use scikit-learn (or your favorite ML library).\n", 32 | "* For large models, use `dask_ml.joblib` and your favorite scikit-learn estimator\n", 33 | "* For large datasets, use `dask_ml` estimators" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Scikit-Learn in 5 Minutes\n", 41 | "\n", 42 | "Scikit-Learn has a nice, consistent API.\n", 43 | "\n", 44 | "1. You instantiate an `Estimator` (e.g. `LinearRegression`, `RandomForestClassifier`, etc.). All of the models *hyperparameters* (user-specified parameters, not the ones learned by the estimator) are passed to the estimator when it's created.\n", 45 | "2. You call `estimator.fit(X, y)` to train the estimator.\n", 46 | "3. Use `estimator` to inspect attributes, make predictions, etc. " 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Let's generate some random data." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "from sklearn.datasets import make_classification\n", 63 | "\n", 64 | "X, y = make_classification(n_samples=10000, n_features=4, random_state=0)\n", 65 | "X[:8]" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "y[:8]" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "We'll fit a LogisitcRegression." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "from sklearn.svm import SVC" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Create the estimator and fit it." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "estimator = SVC(random_state=0)\n", 107 | "estimator.fit(X, y)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "Inspect the learned attributes." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "estimator.support_vectors_[:4]" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Check the accuracy." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "estimator.score(X, y)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## Hyperparameters" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "Most models have *hyperparameters*. They affect the fit, but are specified up front instead of learned during training." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "estimator = SVC(C=0.00001, shrinking=False, random_state=0)\n", 163 | "estimator.fit(X, y)\n", 164 | "estimator.support_vectors_[:4]" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "estimator.score(X, y)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "## Hyperparameter Optimization\n", 181 | "\n", 182 | "There are a few ways to learn the best *hyper*parameters while training. One is `GridSearchCV`.\n", 183 | "As the name implies, this does a brute-force search over a grid of hyperparameter combinations." 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "from sklearn.model_selection import GridSearchCV" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "%%time\n", 202 | "estimator = SVC(gamma='auto', random_state=0, probability=True)\n", 203 | "param_grid = {\n", 204 | " 'C': [0.001, 10.0],\n", 205 | " 'kernel': ['rbf', 'poly'],\n", 206 | "}\n", 207 | "\n", 208 | "grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2)\n", 209 | "grid_search.fit(X, y)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "## Single-machine parallelism with scikit-learn\n", 217 | "\n", 218 | "![](static/sklearn-parallel.png)\n", 219 | "\n", 220 | "Scikit-Learn has nice *single-machine* parallelism, via Joblib.\n", 221 | "Any scikit-learn estimator that can operate in parallel exposes an `n_jobs` keyword.\n", 222 | "This controls the number of CPU cores that will be used." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "%%time\n", 232 | "grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)\n", 233 | "grid_search.fit(X, y)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "## Multi-machine parallelism with Dask\n", 241 | "\n", 242 | "![](static/sklearn-parallel-dask.png)\n", 243 | "\n", 244 | "Dask can talk to scikit-learn (via joblib) so that your *cluster* is used to train a model. " 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "from pycon_utils import make_cluster\n", 254 | "import dask_ml.joblib\n", 255 | "from dask.distributed import Client\n", 256 | "\n", 257 | "from sklearn.externals import joblib" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "cluster = make_cluster()\n", 267 | "cluster" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "client = Client(cluster)\n", 277 | "client" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "Let's try it on a larger problem (more hyperparameters)." 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": {}, 291 | "outputs": [], 292 | "source": [ 293 | "param_grid = {\n", 294 | " 'C': [0.001, 0.1, 1.0, 2.5, 5, 10.0],\n", 295 | " 'kernel': ['rbf', 'poly', 'linear'],\n", 296 | " 'shrinking': [True, False],\n", 297 | "}\n", 298 | "\n", 299 | "grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=5, n_jobs=-1)\n", 300 | "\n", 301 | "with joblib.parallel_backend(\"dask\", scatter=[X, y]):\n", 302 | " grid_search.fit(X, y)" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "# Training on Large Datasets\n", 310 | "\n", 311 | "Sometimes you'll want to train on a larger than memory dataset. `dask-ml` has implemented estimators that work well on dask arrays and dataframes that may be larger than your machine's RAM." 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "import dask.array as da\n", 321 | "import dask.delayed\n", 322 | "from sklearn.datasets import make_blobs\n", 323 | "import numpy as np" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "We'll make a small (random) dataset locally using scikit-learn." 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "n_centers = 12\n", 340 | "n_features = 20\n", 341 | "\n", 342 | "X_small, y_small = make_blobs(n_samples=1000, centers=n_centers, n_features=n_features, random_state=0)\n", 343 | "\n", 344 | "centers = np.zeros((n_centers, n_features))\n", 345 | "\n", 346 | "for i in range(n_centers):\n", 347 | " centers[i] = X_small[y_small == i].mean(0)\n", 348 | " \n", 349 | "centers[:4]" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "The small dataset will be the template for our large random dataset.\n", 357 | "We'll use `dask.delayed` to adapt `sklearn.datasets.make_blobs`, so that the actual dataset is being generated on our workers. " 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "n_samples_per_block = 200000\n", 367 | "n_blocks = 500\n", 368 | "\n", 369 | "delayeds = [dask.delayed(make_blobs)(n_samples=n_samples_per_block,\n", 370 | " centers=centers,\n", 371 | " n_features=n_features,\n", 372 | " random_state=i)[0]\n", 373 | " for i in range(n_blocks)]\n", 374 | "arrays = [da.from_delayed(obj, shape=(n_samples_per_block, n_features), dtype=X.dtype)\n", 375 | " for obj in delayeds]\n", 376 | "X = da.concatenate(arrays)\n", 377 | "X" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "X.nbytes / 1e9" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "X = X.persist() # Only run this on the cluster." 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "The algorithms implemented in Dask-ML are scalable. They handle larger-than-memory datasets just fine.\n", 403 | "\n", 404 | "They follow the scikit-learn API, so if you're familiar with scikit-learn, you'll feel at home with Dask-ML." 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "from dask_ml.cluster import KMeans" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "clf = KMeans(init_max_iter=3, oversampling_factor=10)" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": {}, 429 | "outputs": [], 430 | "source": [ 431 | "%time clf.fit(X)" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [ 440 | "clf.labels_" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [ 449 | "clf.labels_[:10].compute()" 450 | ] 451 | } 452 | ], 453 | "metadata": { 454 | "kernelspec": { 455 | "display_name": "Python 3", 456 | "language": "python", 457 | "name": "python3" 458 | }, 459 | "language_info": { 460 | "codemirror_mode": { 461 | "name": "ipython", 462 | "version": 3 463 | }, 464 | "file_extension": ".py", 465 | "mimetype": "text/x-python", 466 | "name": "python", 467 | "nbconvert_exporter": "python", 468 | "pygments_lexer": "ipython3", 469 | "version": "3.6.5" 470 | } 471 | }, 472 | "nbformat": 4, 473 | "nbformat_minor": 2 474 | } 475 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Parallel Data Analysis with Dask 2 | 3 | Materials for the [Dask tutorial at PyCon 2018](https://us.pycon.org/2018/schedule/presentation/47/). 4 | 5 | ## First Time Setup 6 | 7 | If you don't have `git` installed, you can download a ZIP copy of the repository using the green button above. 8 | Note that the file will be called `dask-tutorial-pycon-2018-master`, instead of `dask-tutorial-pycon-2018`. 9 | Adjust the commands below accordingly. 10 | 11 | 12 | [Install Miniconda](https://conda.io/miniconda.html) or ensure you have Python 3.6 installed on your system. 13 | 14 | ``` 15 | # Update conda 16 | conda update conda 17 | 18 | # Clone the repository. Or download the ZIP and add `-master` to the name. 19 | git clone https://github.com/TomAugspurger/dask-tutorial-pycon-2018 20 | 21 | # Enter the repository 22 | cd dask-tutorial-pycon-2018 23 | 24 | # Create the environment 25 | conda env create 26 | 27 | # Activate the environment 28 | conda activate dask-pycon 29 | 30 | # Download data 31 | python prep_data.py 32 | 33 | # Start jupyterlab 34 | jupyter lab 35 | ``` 36 | 37 | If you aren't using conda 38 | 39 | ``` 40 | # Clone the repository. Or download the ZIP and add `-master` to the name. 41 | git clone https://github.com/TomAugspurger/dask-tutorial-pycon-2018 42 | 43 | # Enter the repository 44 | cd dsak-tutorial-pycon-2018 45 | 46 | # Create a virtualenv 47 | python3 -m venv .env 48 | 49 | # Activate the env 50 | # See https://docs.python.org/3/library/venv.html#creating-virtual-environments 51 | # For bash it's 52 | source .env/bin/activate 53 | 54 | # Install the dependencies 55 | python -m pip install -r requirements.txt 56 | 57 | # Download data 58 | python prep_data.py 59 | 60 | # Start jupyterlab 61 | jupyter lab 62 | ``` 63 | 64 | ## Connect to the Cluster 65 | 66 | We have a [pangeo](https://github.com/pangeo-data/pangeo) deployment running that'll provide everyone with their own cluster to try out Dask on some larger problems. 67 | You can log into the cluster by going to: 68 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-pycon 2 | dependencies: 3 | - dask 4 | - dask-ml 5 | - distributed 6 | - graphviz 7 | - h5py 8 | - ipywidgets 9 | - jupyter 10 | - jupyterlab 11 | - matplotlib 12 | - memory_profiler 13 | - numba 14 | - numpy 15 | - pandas 16 | - pillow 17 | - pytables 18 | - python=3.6 19 | - s3fs 20 | - scikit-image 21 | - scikit-learn 22 | - scipy 23 | - pip: 24 | - gcsfs 25 | - graphviz 26 | - cachey 27 | - snakeviz 28 | -------------------------------------------------------------------------------- /prep_data.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import os 4 | import numpy as np 5 | import pandas as pd 6 | import tarfile 7 | import urllib.request 8 | import zipfile 9 | from glob import glob 10 | 11 | data_dir = 'data' 12 | 13 | 14 | def flights(): 15 | flights_raw = os.path.join(data_dir, 'nycflights.tar.gz') 16 | flightdir = os.path.join(data_dir, 'nycflights') 17 | jsondir = os.path.join(data_dir, 'flightjson') 18 | 19 | if not os.path.exists(data_dir): 20 | os.mkdir(data_dir) 21 | 22 | if not os.path.exists(flights_raw): 23 | print("- Downloading NYC Flights dataset... ", end='', flush=True) 24 | url = "https://storage.googleapis.com/dask-tutorial-data/nycflights.tar.gz" 25 | urllib.request.urlretrieve(url, flights_raw) 26 | print("done", flush=True) 27 | 28 | if not os.path.exists(flightdir): 29 | print("- Extracting flight data... ", end='', flush=True) 30 | tar_path = os.path.join('data', 'nycflights.tar.gz') 31 | with tarfile.open(tar_path, mode='r:gz') as flights: 32 | flights.extractall('data/') 33 | print("done", flush=True) 34 | 35 | if not os.path.exists(jsondir): 36 | print("- Creating json data... ", end='', flush=True) 37 | os.mkdir(jsondir) 38 | for path in glob(os.path.join('data', 'nycflights', '*.csv')): 39 | prefix = os.path.splitext(os.path.basename(path))[0] 40 | # Just take the first 10000 rows for the demo 41 | df = pd.read_csv(path).iloc[:10000] 42 | df.to_json(os.path.join('data', 'flightjson', prefix + '.json'), 43 | orient='records', lines=True) 44 | print("done", flush=True) 45 | 46 | print("** Finished! **") 47 | 48 | 49 | def random_array(): 50 | if os.path.exists(os.path.join('data', 'random.hdf5')): 51 | return 52 | 53 | print("Create random data for array exercise") 54 | import h5py 55 | 56 | with h5py.File(os.path.join('data', 'random.hdf5')) as f: 57 | dset = f.create_dataset('/x', shape=(1000000000,), dtype='f4') 58 | for i in range(0, 1000000000, 1000000): 59 | dset[i: i + 1000000] = np.random.exponential(size=1000000) 60 | 61 | 62 | def weather(growth=3200): 63 | url = 'https://storage.googleapis.com/dask-tutorial-data/weather-small.zip' 64 | weather_zip = os.path.join('data', 'weather-small.zip') 65 | weather_small = os.path.join('data', 'weather-small') 66 | 67 | if not os.path.exists(weather_zip): 68 | print("Downloading weather data.") 69 | urllib.request.urlretrieve(url, weather_zip) 70 | 71 | if not os.path.exists(weather_small): 72 | print("Extracting to {}".format(weather_small)) 73 | zf = zipfile.ZipFile(weather_zip) 74 | zf.extractall(data_dir) 75 | 76 | filenames = sorted(glob(os.path.join('data', 'weather-small', '*.hdf5'))) 77 | 78 | if not os.path.exists(os.path.join('data', 'weather-big')): 79 | os.mkdir(os.path.join('data', 'weather-big')) 80 | 81 | if all(os.path.exists(fn.replace('small', 'big')) for fn in filenames): 82 | return 83 | 84 | from skimage.transform import resize 85 | import h5py 86 | 87 | for fn in filenames: 88 | with h5py.File(fn, mode='r') as f: 89 | x = f['/t2m'][:] 90 | 91 | new_shape = tuple(s * growth // 100 for s in x.shape) 92 | 93 | y = resize(x, new_shape, mode='constant') 94 | 95 | out_fn = os.path.join('data', 'weather-big', os.path.split(fn)[-1]) 96 | 97 | try: 98 | with h5py.File(out_fn) as f: 99 | f.create_dataset('/t2m', data=y, chunks=(500, 500)) 100 | except: 101 | pass 102 | 103 | 104 | def main(): 105 | print("Setting up data directory") 106 | print("-------------------------") 107 | 108 | flights() 109 | random_array() 110 | weather() 111 | 112 | print('Finished!') 113 | 114 | 115 | if __name__ == '__main__': 116 | main() 117 | -------------------------------------------------------------------------------- /pycon_utils.py: -------------------------------------------------------------------------------- 1 | def make_cluster(**kwargs): 2 | try: 3 | from dask_kubernetes import KubeCluster 4 | kwargs.setdefault('n_workers', 9) 5 | cluster = KubeCluster(**kwargs) 6 | except ImportError: 7 | from distributed.deploy.local import LocalCluster 8 | cluster = LocalCluster() 9 | return cluster 10 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | appnope==0.1.0 2 | backcall==0.1.0 3 | bleach==2.1.3 4 | bokeh==0.12.15 5 | boto3==1.7.4 6 | botocore==1.10.4 7 | cachetools==2.0.1 8 | cachey==0.1.1 9 | certifi==2018.4.16 10 | chardet==3.0.4 11 | click==6.7 12 | cloudpickle==0.5.2 13 | cycler==0.10.0 14 | dask==0.17.3 15 | dask-glm==0.1.0 16 | dask-ml==0.4.1 17 | dask-searchcv==0.2.0 18 | decorator==4.3.0 19 | distributed==1.21.6 20 | docutils==0.14 21 | entrypoints==0.2.3 22 | gcsfs==0.1.0 23 | google-auth==1.4.1 24 | google-auth-oauthlib==0.2.0 25 | graphviz==0.8.3 26 | h5py==2.7.1 27 | heapdict==1.0.0 28 | html5lib==1.0.1 29 | idna==2.6 30 | ipykernel==4.8.2 31 | ipython==6.3.1 32 | ipython-genutils==0.2.0 33 | ipywidgets==7.2.1 34 | jedi==0.12.0 35 | Jinja2==2.10 36 | jmespath==0.9.3 37 | jsonschema==2.6.0 38 | jupyter==1.0.0 39 | jupyter-client==5.2.3 40 | jupyter-console==5.2.0 41 | jupyter-core==4.4.0 42 | jupyterlab==0.32.1 43 | jupyterlab-launcher==0.10.5 44 | kiwisolver==1.0.1 45 | llvmlite==0.22.0 46 | locket==0.2.0 47 | MarkupSafe==1.0 48 | matplotlib==2.2.2 49 | memory-profiler==0.52.0 50 | mistune==0.8.3 51 | msgpack-python==0.5.6 52 | multipledispatch==0.5.0 53 | nbconvert==5.3.1 54 | nbformat==4.4.0 55 | notebook==5.4.1 56 | numba==0.37.0 57 | numexpr==2.6.4 58 | numpy==1.14.2 59 | oauthlib==2.0.7 60 | olefile==0.45.1 61 | packaging==17.1 62 | pandas==0.22.0 63 | pandocfilters==1.4.2 64 | parso==0.2.0 65 | partd==0.3.8 66 | pexpect==4.5.0 67 | pickleshare==0.7.4 68 | Pillow==5.1.0 69 | prompt-toolkit==1.0.15 70 | psutil==5.4.5 71 | ptyprocess==0.5.2 72 | pyasn1==0.4.2 73 | pyasn1-modules==0.2.1 74 | Pygments==2.2.0 75 | pyparsing==2.2.0 76 | python-dateutil==2.6.1 77 | pytz==2018.4 78 | PyYAML==3.12 79 | pyzmq==17.0.0 80 | qtconsole==4.3.1 81 | requests==2.18.4 82 | requests-oauthlib==0.8.0 83 | rsa==3.4.2 84 | s3fs==0.1.4 85 | s3transfer==0.1.13 86 | scikit-image 87 | scikit-learn==0.19.1 88 | scipy==1.0.1 89 | Send2Trash==1.5.0 90 | simplegeneric==0.8.1 91 | six==1.11.0 92 | snakeviz==0.4.2 93 | sortedcontainers==1.5.10 94 | tables==3.4.3 95 | tblib==1.3.2 96 | terminado==0.8.1 97 | testpath==0.3.1 98 | toolz==0.9.0 99 | tornado==5.0.2 100 | traitlets==4.3.2 101 | urllib3==1.22 102 | wcwidth==0.1.7 103 | webencodings==0.5.1 104 | widgetsnbextension==3.2.1 105 | zict==0.1.3 106 | -------------------------------------------------------------------------------- /solutions/00-hello-world.py: -------------------------------------------------------------------------------- 1 | print("Hello, world!") -------------------------------------------------------------------------------- /solutions/01-delayed-control-flow.py: -------------------------------------------------------------------------------- 1 | results = [] 2 | for x in data: 3 | if is_even(x): # even 4 | y = delayed(double)(x) 5 | else: # odd 6 | y = delayed(inc)(x) 7 | results.append(y) 8 | 9 | total = delayed(sum)(results) -------------------------------------------------------------------------------- /solutions/01-delayed-groupby.py: -------------------------------------------------------------------------------- 1 | # This is just one possible solution, there are 2 | # several ways to do this using `delayed` 3 | 4 | sums = [] 5 | counts = [] 6 | for fn in filenames: 7 | # Read in file 8 | df = delayed(pd.read_csv)(fn) 9 | 10 | # Groupby origin airport 11 | by_origin = df.groupby('Origin') 12 | 13 | # Sum of all departure delays by origin 14 | total = by_origin.DepDelay.sum() 15 | 16 | # Number of flights by origin 17 | count = by_origin.DepDelay.count() 18 | 19 | # Save the intermediates 20 | sums.append(total) 21 | counts.append(count) 22 | 23 | # Compute the intermediates 24 | sums, counts = compute(sums, counts) 25 | 26 | # Combine intermediates to get total mean-delay-per-origin 27 | total_delays = sum(sums) 28 | n_flights = sum(counts) 29 | mean = total_delays / n_flights -------------------------------------------------------------------------------- /solutions/01-delayed-loop.py: -------------------------------------------------------------------------------- 1 | results = [] 2 | 3 | for x in data: 4 | y = delayed(inc)(x) 5 | results.append(y) 6 | 7 | total = delayed(sum)(results) 8 | print("Before computing:", total) # Let's see what type of thing total is 9 | result = total.compute() 10 | print("After computing :", result) # After it's computed -------------------------------------------------------------------------------- /solutions/02-dask-arrays-blocked-mean.py: -------------------------------------------------------------------------------- 1 | sums = [] 2 | lengths = [] 3 | for i in range(0, 1000000000, 1000000): 4 | chunk = dset[i: i + 1000000] # pull out numpy array 5 | sums.append(chunk.sum()) 6 | lengths.append(len(chunk)) 7 | 8 | total = sum(sums) 9 | length = sum(lengths) 10 | print(total / length) -------------------------------------------------------------------------------- /solutions/02-dask-arrays-make-arrays.py: -------------------------------------------------------------------------------- 1 | arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets] 2 | arrays -------------------------------------------------------------------------------- /solutions/02-dask-arrays-stacked.py: -------------------------------------------------------------------------------- 1 | x = da.stack(arrays, axis=0) 2 | x -------------------------------------------------------------------------------- /solutions/02-dask-arrays-store.py: -------------------------------------------------------------------------------- 1 | result = x[:, ::2, ::2] 2 | da.to_hdf5(os.path.join('data', 'myfile.hdf5'), '/output', result) -------------------------------------------------------------------------------- /solutions/02-dask-arrays-weather-difference.py: -------------------------------------------------------------------------------- 1 | result = x[0] - x.mean(axis=0) 2 | fig = plt.figure(figsize=(16, 8)) 3 | plt.imshow(result, cmap='RdBu_r') -------------------------------------------------------------------------------- /solutions/02-dask-arrays-weather-mean.py: -------------------------------------------------------------------------------- 1 | result = x.mean(axis=0) 2 | fig = plt.figure(figsize=(16, 8)) 3 | plt.imshow(result, cmap='RdBu_r') -------------------------------------------------------------------------------- /solutions/03-dask-dataframe-delay-per-airport.py: -------------------------------------------------------------------------------- 1 | df.groupby("Origin").DepDelay.mean().compute() -------------------------------------------------------------------------------- /solutions/03-dask-dataframe-delay-per-day.py: -------------------------------------------------------------------------------- 1 | df.groupby("DayOfWeek").DepDelay.mean().compute() -------------------------------------------------------------------------------- /solutions/03-dask-dataframe-map-partitions.py: -------------------------------------------------------------------------------- 1 | def compute_departure_timestamp(df): 2 | hours = df.CRSDepTime // 100 3 | hours_timedelta = pd.to_timedelta(hours, unit='h') 4 | 5 | minutes = df.CRSDepTime % 100 6 | minutes_timedelta = pd.to_timedelta(minutes, unit='m') 7 | 8 | return df.Date + hours_timedelta + minutes_timedelta 9 | 10 | departure_timestamp = df.map_partitions(compute_departure_timestamp) -------------------------------------------------------------------------------- /solutions/03-dask-dataframe-non-cancelled-per-airport.py: -------------------------------------------------------------------------------- 1 | df[~df.Cancelled].groupby('Origin').Origin.count().compute() -------------------------------------------------------------------------------- /solutions/03-dask-dataframe-non-cancelled.py: -------------------------------------------------------------------------------- 1 | len(df[~df.Cancelled]) -------------------------------------------------------------------------------- /solutions/03-dask-dataframe-rows.py: -------------------------------------------------------------------------------- 1 | len(df) -------------------------------------------------------------------------------- /solutions/05-distributed-dataframes-memory-usage.ipynb: -------------------------------------------------------------------------------- 1 | print("{:0.2f} GB".format(df.memory_usage().sum().compute() / 1e9)) -------------------------------------------------------------------------------- /solutions/client_submit.py: -------------------------------------------------------------------------------- 1 | x = c.submit(inc, 1) 2 | y = c.submit(dec, 2) 3 | total = c.submit(add, x, y) 4 | 5 | print(total) # This is still a future 6 | c.gather(total) # This blocks until the computation has finished 7 | -------------------------------------------------------------------------------- /static/fail-case.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TomAugspurger/dask-tutorial-pycon-2018/972fd9968a945c7e97a88214907758d5c98dd024/static/fail-case.gif -------------------------------------------------------------------------------- /static/ml-dimensions-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TomAugspurger/dask-tutorial-pycon-2018/972fd9968a945c7e97a88214907758d5c98dd024/static/ml-dimensions-color.png -------------------------------------------------------------------------------- /static/ml-dimensions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TomAugspurger/dask-tutorial-pycon-2018/972fd9968a945c7e97a88214907758d5c98dd024/static/ml-dimensions.png -------------------------------------------------------------------------------- /static/sklearn-parallel-dask.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TomAugspurger/dask-tutorial-pycon-2018/972fd9968a945c7e97a88214907758d5c98dd024/static/sklearn-parallel-dask.png -------------------------------------------------------------------------------- /static/sklearn-parallel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TomAugspurger/dask-tutorial-pycon-2018/972fd9968a945c7e97a88214907758d5c98dd024/static/sklearn-parallel.png --------------------------------------------------------------------------------