├── .dask └── config.yaml ├── .github └── workflows │ └── build.yml ├── .gitignore ├── 0-welcome.ipynb ├── 1-overview.ipynb ├── 2-custom-operations.ipynb ├── 3-graph-optimization.ipynb ├── 4-distributed-scheduler.ipynb ├── LICENSE ├── README.md ├── binder ├── environment.yml ├── jupyterlab-workspace.json └── start ├── images ├── animation.gif ├── custom_operations_blockwise.png ├── custom_operations_groupby_aggregation.png ├── custom_operations_map_blocks.png ├── custom_operations_reduction.png ├── dask-array.png ├── dask-cluster.png ├── dask-dataframe.png ├── dask-overview.png ├── dask_horizontal.svg └── inc-add.png └── puzzle ├── bicycle.png ├── bicycle_0_0.png ├── bicycle_0_1.png ├── bicycle_1_0.png └── bicycle_1_1.png /.dask/config.yaml: -------------------------------------------------------------------------------- 1 | distributed: 2 | dashboard: 3 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 4 | -------------------------------------------------------------------------------- /.github/workflows/build.yml: -------------------------------------------------------------------------------- 1 | name: Build 2 | on: [push] 3 | 4 | jobs: 5 | binder: 6 | runs-on: ubuntu-latest 7 | steps: 8 | 9 | - name: Build and cache on mybinder.org 10 | uses: jupyterhub/repo2docker-action@master 11 | with: 12 | NO_PUSH: true 13 | MYBINDERORG_TAG: ${{ github.event.ref }} 14 | 15 | conda-solve: 16 | runs-on: ${{ matrix.os }} 17 | strategy: 18 | matrix: 19 | os: [windows-latest, ubuntu-latest, macos-latest] 20 | 21 | steps: 22 | - name: Checkout source 23 | uses: actions/checkout@v2 24 | 25 | - name: Setup Conda Environment 26 | uses: conda-incubator/setup-miniconda@v2 27 | with: 28 | environment-file: binder/environment.yml 29 | activate-environment: hacking-dask 30 | auto-activate-base: false -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | dask-worker-space/ 2 | my_report.html 3 | example_data/ 4 | mydask.png 5 | .DS_Store 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | *$py.class 11 | 12 | # C extensions 13 | *.so 14 | 15 | # Distribution / packaging 16 | .Python 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | .eggs/ 23 | lib/ 24 | lib64/ 25 | parts/ 26 | sdist/ 27 | var/ 28 | wheels/ 29 | share/python-wheels/ 30 | *.egg-info/ 31 | .installed.cfg 32 | *.egg 33 | MANIFEST 34 | 35 | # PyInstaller 36 | # Usually these files are written by a python script from a template 37 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 38 | *.manifest 39 | *.spec 40 | 41 | # Installer logs 42 | pip-log.txt 43 | pip-delete-this-directory.txt 44 | 45 | # Unit test / coverage reports 46 | htmlcov/ 47 | .tox/ 48 | .nox/ 49 | .coverage 50 | .coverage.* 51 | .cache 52 | nosetests.xml 53 | coverage.xml 54 | *.cover 55 | *.py,cover 56 | .hypothesis/ 57 | .pytest_cache/ 58 | cover/ 59 | 60 | # Translations 61 | *.mo 62 | *.pot 63 | 64 | # Django stuff: 65 | *.log 66 | local_settings.py 67 | db.sqlite3 68 | db.sqlite3-journal 69 | 70 | # Flask stuff: 71 | instance/ 72 | .webassets-cache 73 | 74 | # Scrapy stuff: 75 | .scrapy 76 | 77 | # Sphinx documentation 78 | docs/_build/ 79 | 80 | # PyBuilder 81 | .pybuilder/ 82 | target/ 83 | 84 | # Jupyter Notebook 85 | .ipynb_checkpoints 86 | 87 | # IPython 88 | profile_default/ 89 | ipython_config.py 90 | 91 | # pyenv 92 | # For a library or package, you might want to ignore these files since the code is 93 | # intended to run in multiple environments; otherwise, check them in: 94 | # .python-version 95 | 96 | # pipenv 97 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 98 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 99 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 100 | # install all needed dependencies. 101 | #Pipfile.lock 102 | 103 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 104 | __pypackages__/ 105 | 106 | # Celery stuff 107 | celerybeat-schedule 108 | celerybeat.pid 109 | 110 | # SageMath parsed files 111 | *.sage.py 112 | 113 | # Environments 114 | .env 115 | .venv 116 | env/ 117 | venv/ 118 | ENV/ 119 | env.bak/ 120 | venv.bak/ 121 | 122 | # Spyder project settings 123 | .spyderproject 124 | .spyproject 125 | 126 | # Rope project settings 127 | .ropeproject 128 | 129 | # mkdocs documentation 130 | /site 131 | 132 | # mypy 133 | .mypy_cache/ 134 | .dmypy.json 135 | dmypy.json 136 | 137 | # Pyre type checker 138 | .pyre/ 139 | 140 | # pytype static type analyzer 141 | .pytype/ 142 | 143 | # Cython debug symbols 144 | cython_debug/ 145 | -------------------------------------------------------------------------------- /0-welcome.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "\"Dask\n", 14 | " \n", 15 | "# Hacking Dask: Diving into Dask's internals\n", 16 | "\n", 17 | "## Materials and setup\n", 18 | "\n", 19 | "The materials for this tutorial are available at https://github.com/jrbourbeau/hacking-dask.\n", 20 | "\n", 21 | "There are two ways to run through the tutorial:\n", 22 | "\n", 23 | "- Run locally on your laptop\n", 24 | "- Run using Binder (no setup required)\n", 25 | "\n", 26 | "## About the instructors\n", 27 | "\n", 28 | "#### [Julia Signell](https://jsignell.github.io) — Head of Open Source, [Saturn Cloud](https://www.saturncloud.io)\n", 29 | "#### [James Bourbeau](https://www.jamesbourbeau.com) — Lead Open Source Software Engineer, [Coiled](https://coiled.io/)\n", 30 | "\n", 31 | "## Tutorial goals\n", 32 | "\n", 33 | "The goal of this tutorial is to cover more advanced features of Dask like task graph optimization, the worker and scheduler plugin system, how to inspect the internal state of a cluster, and more.\n", 34 | "\n", 35 | "Attendees should walk away with an introduction to more advanced features, ideas of how they can apply these features effectively to their own data intensive workloads, and a deeper understanding of Dask’s internals.\n", 36 | "\n", 37 | "> ℹ️ NOTE: While there is a brief overview notebook, this tutorial largely assumes some prior knowledge of Dask. If you are new to Dask, we recommend going through the Dask tutorial (https://tutorial.dask.org) to get an introduction to Dask prior to going through this tutorial.\n", 38 | "\n", 39 | "## Outline\n", 40 | "\n", 41 | "The tutorial consists of several Jupyter notebooks which we will cover in the order listed below:\n", 42 | "\n", 43 | "- [0-welcome.ipynb](0-welcome.ipynb)\n", 44 | "- [1-overview.ipynb](1-overview.ipynb)\n", 45 | "- [2-custom-operations.ipynb](2-custom-operations.ipynb)\n", 46 | "- [3-graph-optimization.ipynb](3-graph-optimization.ipynb)\n", 47 | "- [4-distributed-scheduler.ipynb](5-distributed-scheduler.ipynb)\n", 48 | "\n", 49 | "Each notebook also contains hands-on exercises to illustrate the concepts being presented. Let's look at our first example to get a sense for how they work.\n", 50 | "\n", 51 | "### Exercise: Print \"Hello world!\"\n", 52 | "\n", 53 | "Use Python to print the string \"Hello world!\" to the screen." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "# Your solution here" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": { 69 | "jupyter": { 70 | "source_hidden": true 71 | }, 72 | "tags": [] 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "# A solution\n", 77 | "print(\"Hello world!\")" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Next steps\n", 85 | "\n", 86 | "Let's start by going through a brief overview of Dask's basics over in [1-overview.ipynb](1-overview.ipynb)." 87 | ] 88 | } 89 | ], 90 | "metadata": { 91 | "kernelspec": { 92 | "display_name": "Python 3", 93 | "language": "python", 94 | "name": "python3" 95 | }, 96 | "language_info": { 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 3 100 | }, 101 | "file_extension": ".py", 102 | "mimetype": "text/x-python", 103 | "name": "python", 104 | "nbconvert_exporter": "python", 105 | "pygments_lexer": "ipython3", 106 | "version": "3.9.2" 107 | } 108 | }, 109 | "nbformat": 4, 110 | "nbformat_minor": 4 111 | } 112 | -------------------------------------------------------------------------------- /1-overview.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "\"Dask\n", 14 | " \n", 15 | "# Parallel Computing in Python with Dask\n", 16 | "\n", 17 | "This notebook provides a high-level overview of Dask. We discuss why you might want to use Dask, high-level and low-level APIs for generating computational graphs, and Dask's schedulers which enable the parallel execution of these graphs." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "slideshow": { 24 | "slide_type": "subslide" 25 | } 26 | }, 27 | "source": [ 28 | "# Overview\n", 29 | "\n", 30 | "[Dask](https://docs.dask.org) is a flexible, [open source](https://github.com/dask/dask) library for parallel and distributed computing in Python. Dask is designed to scale the existing Python ecosystem.\n", 31 | "\n", 32 | "You might want to use Dask because it:\n", 33 | "\n", 34 | "- Enables parallel and larger-than-memory computations\n", 35 | "\n", 36 | "- Uses familiar APIs you're used to from projects like NumPy, pandas, and scikit-learn\n", 37 | "\n", 38 | "- Allows you to scale existing workflows with minimal code changes\n", 39 | "\n", 40 | "- Dask works on your laptop, but also scales out to large clusters\n", 41 | "\n", 42 | "- Offers great built-in diagnosic tools" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "### Components of Dask\n", 50 | "\n", 51 | "From a high level, Dask is comprised of two main components:\n", 52 | "\n", 53 | "1. **Dask collections** which extend common interfaces like NumPy, pandas, and Python iterators to larger-than-memory or distributed environments by creating *task graphs*\n", 54 | "2. **Schedulers** which compute task graphs produced by Dask collections in parallel\n", 55 | "\n", 56 | "\"Dask" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Task Graphs" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "def inc(i):\n", 75 | " return i + 1\n", 76 | "\n", 77 | "def add(a, b):\n", 78 | " return a + b\n", 79 | "\n", 80 | "a, b = 1, 12\n", 81 | "c = inc(a)\n", 82 | "d = inc(b)\n", 83 | "output = add(c, d)\n", 84 | "\n", 85 | "print(f'output = {output}')" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "This computation can be encoded in the following task graph:" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "![](images/inc-add.png)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "\n", 107 | "- Graph of inter-related tasks with dependencies between them\n", 108 | "\n", 109 | "- Circular nodes in the graph are Python function calls\n", 110 | "\n", 111 | "- Square nodes are Python objects that are created by one task as output and can be used as inputs in another task" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": { 117 | "slideshow": { 118 | "slide_type": "subslide" 119 | } 120 | }, 121 | "source": [ 122 | "# Dask Collections\n", 123 | "\n", 124 | "Let's looks at two Dask user interfaces: Dask Array and Dask Delayed.\n", 125 | "\n", 126 | "## Dask Arrays\n", 127 | "\n", 128 | "- Dask arrays are chunked, n-dimensional arrays\n", 129 | "\n", 130 | "- Can think of a Dask array as a collection of NumPy `ndarray` arrays\n", 131 | "\n", 132 | "- Dask arrays implement a large subset of the NumPy API using blocked algorithms\n", 133 | "\n", 134 | "- For many purposes Dask arrays can serve as drop-in replacements for NumPy arrays" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": { 140 | "slideshow": { 141 | "slide_type": "subslide" 142 | } 143 | }, 144 | "source": [ 145 | "" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "slideshow": { 153 | "slide_type": "subslide" 154 | } 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "import numpy as np\n", 159 | "import dask.array as da" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "x_np = np.random.random(size=(1_000, 1_000))\n", 169 | "x_np" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "We can create a Dask array in a similar manner, but need to specify a `chunks` argument to tell Dask how to break up the underlying array into chunks." 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "x = da.random.random(size=(1_000, 1_000), chunks=(250, 500))" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "x # Dask arrays have nice HTML output in Jupyter notebooks" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "Dask arrays look and feel like NumPy arrays. For example, they have `dtype` and `shape` attributes" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "print(x.dtype)\n", 211 | "print(x.shape)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "slideshow": { 218 | "slide_type": "subslide" 219 | } 220 | }, 221 | "source": [ 222 | "Dask arrays are _lazily_ evaluated. The result from a computation isn't computed until you ask for it. Instead, a Dask task graph for the computation is produced. You can visualize the task graph using the `visualize()` method." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "slideshow": { 230 | "slide_type": "-" 231 | } 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "x.visualize()" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": { 241 | "slideshow": { 242 | "slide_type": "subslide" 243 | } 244 | }, 245 | "source": [ 246 | "To compute a task graph call the `compute()` method" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "result = x.compute() # We'll go into more detail about .compute() later on\n", 256 | "result" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "The result of this computation is a fimilar NumPy `ndarray`" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "type(result)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Dask arrays support a large portion of the NumPy interface:\n", 280 | "\n", 281 | "- Arithmetic and scalar mathematics: `+`, `*`, `exp`, `log`, ...\n", 282 | "\n", 283 | "- Reductions along axes: `sum()`, `mean()`, `std()`, `sum(axis=0)`, ...\n", 284 | "\n", 285 | "- Tensor contractions / dot products / matrix multiply: `tensordot`\n", 286 | "\n", 287 | "- Axis reordering / transpose: `transpose`\n", 288 | "\n", 289 | "- Slicing: `x[:100, 500:100:-2]`\n", 290 | "\n", 291 | "- Fancy indexing along single axes with lists or numpy arrays: `x[:, [10, 1, 5]]`\n", 292 | "\n", 293 | "- Array protocols like `__array__` and `__array_ufunc__`\n", 294 | "\n", 295 | "- Some linear algebra: `svd`, `qr`, `solve`, `solve_triangular`, `lstsq`, ...\n", 296 | "\n", 297 | "- ...\n", 298 | "\n", 299 | "See the [Dask array API docs](http://docs.dask.org/en/latest/array-api.html) for full details about what portion of the NumPy API is implemented for Dask arrays." 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "We can build more complex computations using the familiar NumPy operations we're used to." 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": { 313 | "slideshow": { 314 | "slide_type": "subslide" 315 | } 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "result = (x + x.T).sum(axis=0).mean()" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "result.visualize()" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "result.compute()" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "**Note**: Dask can be used to scale other array-like libraries that support the NumPy `ndarray` interface. For example, [pydata/sparse](https://sparse.pydata.org/en/latest/) for sparse arrays or [CuPy](https://cupy.chainer.org/) for GPU-accelerated arrays." 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "## Dask Delayed\n", 352 | "\n", 353 | "Sometimes problems don’t fit nicely into one of the high-level collections like Dask arrays or Dask DataFrames. In these cases, you can parallelize custom algorithms using the lower-level Dask `delayed` interface. This allows one to manually create task graphs with a light annotation of normal Python code." 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "import time\n", 363 | "import random\n", 364 | "\n", 365 | "def inc(x):\n", 366 | " time.sleep(random.random())\n", 367 | " return x + 1\n", 368 | "\n", 369 | "def double(x):\n", 370 | " time.sleep(random.random())\n", 371 | " return 2 * x\n", 372 | " \n", 373 | "def add(x, y):\n", 374 | " time.sleep(random.random())\n", 375 | " return x + y " 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "%%time\n", 385 | "\n", 386 | "data = [1, 2, 3, 4]\n", 387 | "\n", 388 | "output = []\n", 389 | "for i in data:\n", 390 | " a = inc(i)\n", 391 | " b = double(i)\n", 392 | " c = add(a, b)\n", 393 | " output.append(c)\n", 394 | "\n", 395 | "total = sum(output)" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "Dask `delayed` wraps function calls and delays their execution. `delayed` functions record what we want to compute (a function and input parameters) as a task in a graph that we’ll run later on parallel hardware by calling `compute`." 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": {}, 409 | "outputs": [], 410 | "source": [ 411 | "from dask import delayed" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "@delayed\n", 421 | "def lazy_inc(x):\n", 422 | " time.sleep(random.random())\n", 423 | " return x + 1" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "lazy_inc" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "metadata": {}, 439 | "outputs": [], 440 | "source": [ 441 | "inc_output = lazy_inc(3) # lazily evaluate inc(3)\n", 442 | "inc_output" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": {}, 449 | "outputs": [], 450 | "source": [ 451 | "inc_output.compute()" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "Using `delayed` functions, we can build up a task graph for the particular computation we want to perform" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "double_inc_output = lazy_inc(inc_output)\n", 468 | "double_inc_output" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "double_inc_output.visualize()" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": null, 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [ 486 | "double_inc_output.compute()" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "We can use `delayed` to make our previous example computation lazy by wrapping all the function calls with delayed" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": null, 499 | "metadata": {}, 500 | "outputs": [], 501 | "source": [ 502 | "import time\n", 503 | "import random\n", 504 | "\n", 505 | "@delayed\n", 506 | "def inc(x):\n", 507 | " time.sleep(random.random())\n", 508 | " return x + 1\n", 509 | "\n", 510 | "@delayed\n", 511 | "def double(x):\n", 512 | " time.sleep(random.random())\n", 513 | " return 2 * x\n", 514 | "\n", 515 | "@delayed\n", 516 | "def add(x, y):\n", 517 | " time.sleep(random.random())\n", 518 | " return x + y" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "%%time\n", 528 | "\n", 529 | "data = [1, 2, 3, 4]\n", 530 | "\n", 531 | "output = []\n", 532 | "for i in data:\n", 533 | " a = inc(i)\n", 534 | " b = double(i)\n", 535 | " c = add(a, b)\n", 536 | " output.append(c)\n", 537 | "\n", 538 | "total = delayed(sum)(output)\n", 539 | "total" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": {}, 546 | "outputs": [], 547 | "source": [ 548 | "total.visualize()" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "%%time\n", 558 | "\n", 559 | "total.compute()" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "We highly recommend checking out the [Dask delayed best practices](http://docs.dask.org/en/latest/delayed-best-practices.html) page to avoid some common pitfalls when using `delayed`. " 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "# Schedulers\n", 574 | "\n", 575 | "High-level collections like Dask arrays and Dask DataFrames, as well as the low-level `dask.delayed` interface build up task graphs for a computation. After these graphs are generated, they need to be executed (potentially in parallel). This is the job of a task scheduler. Different task schedulers exist within Dask. Each will consume a task graph and compute the same result, but with different performance characteristics. " 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "![grid-search](images/animation.gif \"grid-search\")\n" 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "Dask has two different classes of schedulers: single-machine schedulers and a distributed scheduler." 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "## Single Machine Schedulers\n", 597 | "\n", 598 | "Single machine schedulers provide basic features on a local process or thread pool and require no setup (only use the Python standard library). The different single machine schedulers Dask provides are:\n", 599 | "\n", 600 | "- `'threads'`: The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`. The threaded scheduler is the default choice for Dask arrays, Dask DataFrames, and Dask delayed. \n", 601 | "\n", 602 | "- `'processes'`: The multiprocessing scheduler executes computations with a local `concurrent.futures.ProcessPoolExecutor`.\n", 603 | "\n", 604 | "- `'single-threaded'`: The single-threaded synchronous scheduler executes all computations in the local thread, with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes." 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "You can configure which scheduler is used in a few different ways. You can set the scheduler globally by using the `dask.config.set(scheduler=)` command" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "import dask\n", 621 | "\n", 622 | "dask.config.set(scheduler='threads')\n", 623 | "x.compute(); # Will use the multi-threading scheduler" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "or use it as a context manager to set the scheduler for a block of code" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": null, 636 | "metadata": {}, 637 | "outputs": [], 638 | "source": [ 639 | "with dask.config.set(scheduler='processes'):\n", 640 | " x.compute() # Will use the multi-processing scheduler" 641 | ] 642 | }, 643 | { 644 | "cell_type": "markdown", 645 | "metadata": {}, 646 | "source": [ 647 | "or even within a single compute call" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "x.compute(scheduler='threads'); # Will use the multi-threading scheduler" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": {}, 662 | "source": [ 663 | "The `num_workers` argument is used to specify the number of threads or processes to use" 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "metadata": {}, 670 | "outputs": [], 671 | "source": [ 672 | "x.compute(scheduler='threads', num_workers=4);" 673 | ] 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "metadata": {}, 678 | "source": [ 679 | "## Distributed Scheduler\n", 680 | "\n", 681 | "Despite having \"distributed\" in it's name, the distributed scheduler works well on both single and multiple machines. Think of it as the \"advanced scheduler\".\n", 682 | "\n", 683 | "A Dask distributed cluster is composed of a single centralized scheduler and one or more worker processes. A `Client` object is used as the user-facing entry point to interact with the cluster. We will talk about the components of Dask clusters in more detail later on in [4-distributed-scheduler.py](4-distributed-scheduler.py).\n", 684 | "\n", 685 | "\"Dask" 688 | ] 689 | }, 690 | { 691 | "cell_type": "markdown", 692 | "metadata": {}, 693 | "source": [ 694 | "The distributed scheduler has many features:\n", 695 | "\n", 696 | "- [Real-time, `concurrent.futures`-like interface](https://docs.dask.org/en/latest/futures.html)\n", 697 | "\n", 698 | "- [Sophisticated memory management](https://distributed.dask.org/en/latest/memory.html)\n", 699 | "\n", 700 | "- [Data locality](https://distributed.dask.org/en/latest/locality.html)\n", 701 | "\n", 702 | "- [Adaptive deployments](https://distributed.dask.org/en/latest/adaptive.html)\n", 703 | "\n", 704 | "- [Cluster resilience](https://distributed.dask.org/en/latest/resilience.html)\n", 705 | "\n", 706 | "- ...\n", 707 | "\n", 708 | "See the [Dask distributed documentation](https://distributed.dask.org) for full details about all the distributed scheduler features." 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": null, 714 | "metadata": {}, 715 | "outputs": [], 716 | "source": [ 717 | "from dask.distributed import Client\n", 718 | "\n", 719 | "# Creates a local Dask cluster\n", 720 | "client = Client()\n", 721 | "client" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "metadata": {}, 728 | "outputs": [], 729 | "source": [ 730 | "x = da.ones((20_000, 20_000), chunks=(400, 400))\n", 731 | "result = (x + x.T).sum(axis=0).mean()" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": {}, 738 | "outputs": [], 739 | "source": [ 740 | "result.compute()" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": null, 746 | "metadata": {}, 747 | "outputs": [], 748 | "source": [ 749 | "client.close()" 750 | ] 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "metadata": {}, 755 | "source": [ 756 | "# Next steps\n", 757 | "\n", 758 | "Next, let's learn more about performing custom operations on Dask collections in the [2-custom-operations.ipynb](2-custom-operations.ipynb) notebook." 759 | ] 760 | } 761 | ], 762 | "metadata": { 763 | "kernelspec": { 764 | "display_name": "Python 3", 765 | "language": "python", 766 | "name": "python3" 767 | }, 768 | "language_info": { 769 | "codemirror_mode": { 770 | "name": "ipython", 771 | "version": 3 772 | }, 773 | "file_extension": ".py", 774 | "mimetype": "text/x-python", 775 | "name": "python", 776 | "nbconvert_exporter": "python", 777 | "pygments_lexer": "ipython3", 778 | "version": "3.9.4" 779 | } 780 | }, 781 | "nbformat": 4, 782 | "nbformat_minor": 4 783 | } 784 | -------------------------------------------------------------------------------- /2-custom-operations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f3e6c3f1-9673-4eb0-81b1-3fe9ac7e84cf", 6 | "metadata": {}, 7 | "source": [ 8 | "# Advanced Collections: _Custom Operations_\n", 9 | "\n", 10 | "In the overview notabook we discussed some of the many algorithms that are pre-defined for different types of Dask collections\n", 11 | "(such as Arrays and DataFrames). These include operations like `mean`, `max`, `value_counts` and many other standard operations.\n", 12 | "\n", 13 | "In this notebook we'll explore how those operations are implemented and learn how to construct our own custom operations to use with Dask Arrays and Dask Dataframes.\n", 14 | "\n", 15 | "\n", 16 | "**Related Documentation**\n", 17 | "\n", 18 | " - [Array Tutorial](https://tutorial.dask.org/03_array.html)\n", 19 | " - [Best Practices](https://docs.dask.org/en/latest/best-practices.html#learn-techniques-for-customization)" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "id": "7c0fe241-96cf-414d-b5ee-d700b51651df", 25 | "metadata": {}, 26 | "source": [ 27 | "## Blocked Algorithms\n", 28 | "\n", 29 | "Dask computations are implemented using _blocked algorithms_. These algorithms break up a computation on a large array into many computations on smaller pieces of the array. This minimizes the memory load (amount of RAM) of computations and allows for working with larger-than-memory datasets in parallel." 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "id": "fca494a4-362a-4503-9725-e79ebeee84c4", 36 | "metadata": { 37 | "slideshow": { 38 | "slide_type": "subslide" 39 | } 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "import dask.array as da\n", 44 | "\n", 45 | "x = da.random.random(size=(1_000, 1_000), chunks=(250, 500))\n", 46 | "x" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "849c5b30-fab9-4cd8-9e7f-a2dd4bbf1f4c", 52 | "metadata": {}, 53 | "source": [ 54 | "In the overview notebook we looked at the task graph for the following computation:" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "776c7645-3c80-4390-a8ed-dde0d4463712", 61 | "metadata": { 62 | "slideshow": { 63 | "slide_type": "subslide" 64 | } 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "result = (x + x.T).sum(axis=0).mean()" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "b9f181e4-0205-4f9a-89db-4f8f3a54d801", 74 | "metadata": {}, 75 | "source": [ 76 | "Now let's break that down a bit and look at the task graph for just one part of that computation." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "id": "ad44e637-1782-4eb1-89c3-0025230e1fce", 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "x.T.visualize()" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "id": "4087e539-2d92-4ed4-98ab-03471de15c8d", 92 | "metadata": {}, 93 | "source": [ 94 | "This graph demonstrates how blocked algorithms work. In the perfectly parallelizable situation, Dask can operate on each block in isolation and then reassemble the results from the outputs. Dask makes it easy to contruct graphs like this using a numpy-like API. " 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "id": "negative-minority", 100 | "metadata": {}, 101 | "source": [ 102 | "## Custom Block Computations\n", 103 | "Block computations operate on a per-block basis. So each block gets the function applied to it, and the output has the same chunk location as the input.\n", 104 | "\n", 105 | "Some examples include the following:\n", 106 | "- custom IO operations\n", 107 | "- applying embarassingly parallel functions for which there is no exising Dask implementation\n", 108 | "\n", 109 | "![map_blocks](images/custom_operations_map_blocks.png)\n", 110 | "\n", 111 | "**Related Documentation**\n", 112 | "\n", 113 | " - [`dask.array.map_blocks`](https://docs.dask.org/en/latest/array-api.html?highlight=map_blocks#dask.array.Array.map_blocks)\n", 114 | " - [`dask.dataframe.map_partitions`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "id": "wired-albert", 120 | "metadata": {}, 121 | "source": [ 122 | "### `map_blocks`\n", 123 | "\n", 124 | "Let's imagine that there was no `da.random.random` method. We can create our own version using `map_blocks`. " 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "id": "8262fd43-b432-4e40-aced-4f9ad633d0c7", 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "import numpy as np\n", 135 | "\n", 136 | "def random_sample():\n", 137 | " return np.random.random(size=(250, 500))\n", 138 | "\n", 139 | "x = da.map_blocks(random_sample, chunks=((250, 250, 250, 250), (500, 500)), dtype=float)\n", 140 | "x" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "id": "8660b3ed-494e-4c41-af0f-59b0e1626194", 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "x.visualize()" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "id": "requested-congo", 156 | "metadata": {}, 157 | "source": [ 158 | "> #### Understanding `chunks` argument\n", 159 | ">\n", 160 | "> In the example above we explicitly declare what the size of the output chunks will be ``chunks=((250, 250, 250, 250), (500, 500))`` this means 8 chunks each with shape `(250, 500)` you'll also see the chunks argument written in the short version where only the shape of one chunk is defined ``chunks=(250, 500)``. These mean the same thing.\n", 161 | ">\n", 162 | "> Specifying the output chunks is very useful when doing more involved operations with ``map_blocks``. By specifying ``chunks``, you can guarantee that the output will have the right shape which lets you properly chain together other operations. \n", 163 | ">\n", 164 | "> When in doubt, you can always find shape and chunk information in the array representation." 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "id": "555f0551-bf38-4e95-a8a4-3aa9ac5b26fd", 170 | "metadata": {}, 171 | "source": [ 172 | "In that example we created an array from scratch by passing in `dtype` and `chunks`. Next we'll consider the case of applying `map_blocks` to existing arrays." 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "id": "nervous-devon", 178 | "metadata": {}, 179 | "source": [ 180 | "#### Multiple arrays\n", 181 | "\n", 182 | "``map_blocks`` can be used on single arrays or to combine several arrays. When multiple arrays are passed, ``map_blocks``\n", 183 | "aligns blocks by block location without regard to shape.\n", 184 | "\n", 185 | "In the following example we have two arrays with the same number of blocks\n", 186 | "but with different shape and chunk sizes." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "id": "effective-correspondence", 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "a = da.arange(1000, chunks=(100,))\n", 197 | "b = da.arange(100, chunks=(10,))" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "id": "biblical-maldives", 203 | "metadata": {}, 204 | "source": [ 205 | "Let's take a look at these arrays:" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "id": "representative-twins", 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "a" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "id": "sound-johns", 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "b" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "id": "polish-excess", 231 | "metadata": {}, 232 | "source": [ 233 | "We can pass these arrays into ``map_blocks`` using a function that takes two inputs, calculates the max of each, than then returns a numpy array of the outputs. " 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "id": "exceptional-aberdeen", 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "def func(a, b):\n", 244 | " return np.array([a.max(), b.max()])\n", 245 | "\n", 246 | "result = da.map_blocks(func, a, b, chunks=(2,))\n", 247 | "result.visualize()" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "adapted-output", 253 | "metadata": {}, 254 | "source": [ 255 | "#### Special arguments\n", 256 | "\n", 257 | "There are special arguments (``block_info`` and ``block_id``) that you can use within ``map_blocks`` functions. ``block_id`` gives the index of the block within the chunks, so for a 1D array it will be something like `(i,)`. ``block_info`` is a dictionary where there is an integer key for each input dask array and a `None` key for the output array.\n", 258 | "\n", 259 | "Let's use the example above and print ``block_info`` for the first block so that we can get a sense of what information is contained in it:" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "id": "editorial-commodity", 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "from pprint import pprint\n", 270 | "\n", 271 | "def func(a, b, block_id=None, block_info=None):\n", 272 | " if block_id == (0,):\n", 273 | " pprint(block_info)\n", 274 | " return np.array([a.max(), b.max()])\n", 275 | "\n", 276 | "da.map_blocks(func, a, b, chunks=(2,)).compute()" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "id": "higher-kentucky", 282 | "metadata": {}, 283 | "source": [ 284 | "One of the use cases for the ``block_info`` and ``block_id`` arguments is to create an array from scratch by reading in specific files for each block." 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "id": "confident-lover", 290 | "metadata": {}, 291 | "source": [ 292 | "**Exercise**\n", 293 | "\n", 294 | "Say you have a set of images that each represent a particular portion of a scene. How can you use the\n", 295 | "technique we just learned to patch them together? \n", 296 | "\n", 297 | "Let's look at what is in the puzzle directory:" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "id": "mighty-dakota", 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "!ls puzzle" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "id": "sixth-expansion", 313 | "metadata": {}, 314 | "source": [ 315 | "The following cell displays the completed puzzle." 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "id": "urban-bulletin", 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "from imageio import imread\n", 326 | "import matplotlib.pyplot as plt\n", 327 | "\n", 328 | "image = imread(\"puzzle/bicycle.png\")\n", 329 | "plt.imshow(image)" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "id": "relevant-stake", 335 | "metadata": {}, 336 | "source": [ 337 | "Now use ``map_blocks`` to read in the puzzle pieces from \"bicycle_i_j.png\". Note that each image piece has 3 dimensions: x, y, and RGBA." 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "id": "789f531c-8d58-4987-bbda-8127addfe835", 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "# define a function that reads one file\n", 348 | "def reader(block, block_id=None):\n", 349 | " ... = block_id # unpack block_id to get chunk location\n", 350 | " filename = ... # use chunk location to get the correct file\n", 351 | " return imread(filename)" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "id": "coated-daughter", 358 | "metadata": { 359 | "jupyter": { 360 | "source_hidden": true 361 | }, 362 | "tags": [] 363 | }, 364 | "outputs": [], 365 | "source": [ 366 | "# solution\n", 367 | "# define a function that reads one file\n", 368 | "def reader(block_id=None):\n", 369 | " ii, jj, _ = block_id # unpack block_id to get chunk location\n", 370 | " filename = f\"puzzle/bicycle_{ii}_{jj}.png\" # use chunk location to get the correct file\n", 371 | " return imread(filename)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "id": "c5149566-5665-496c-b6f8-b935f8d3be0d", 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# map that function to every block and set the expected dtype and chunk pattern of the output\n", 382 | "result = da.map_blocks(reader, dtype=int, chunks=((24, 24), (24, 24), (4,)))\n", 383 | "plt.imshow(result)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "id": "atlantic-interim", 389 | "metadata": {}, 390 | "source": [ 391 | "### ``map_partitions``\n", 392 | "\n", 393 | "In Dask dataframe there is a similar method to ``map_blocks`` but it is called ``map_partitions``.\n", 394 | "\n", 395 | "Here is an example of using it to check if the sum of two columns is greater than some arbitrary threshold." 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "id": "d01aeac1-a313-4ab4-b940-719ee5fbd116", 402 | "metadata": { 403 | "tags": [] 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "import dask\n", 408 | "import dask.dataframe as dd\n", 409 | "\n", 410 | "ddf = dask.datasets.timeseries()\n", 411 | "\n", 412 | "result = ddf.map_partitions(lambda df, threshold: (df.x + df.y) > 0, threshold=0)\n", 413 | "result.compute()" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "id": "collected-preparation", 419 | "metadata": {}, 420 | "source": [ 421 | "#### Internal uses\n", 422 | "In practice ``map_partitions`` is used to implement many of the helper dataframe methods\n", 423 | "that let Dask dataframe mimic Pandas. Here is the implementation of `ddf.index` for instance:\n", 424 | "\n", 425 | "```python\n", 426 | "@property\n", 427 | "def index(self):\n", 428 | " \"\"\"Return dask Index instance\"\"\"\n", 429 | " return self.map_partitions(\n", 430 | " getattr,\n", 431 | " \"index\",\n", 432 | " token=self._name + \"-index\",\n", 433 | " meta=self._meta.index,\n", 434 | " enforce_metadata=False,\n", 435 | " )\n", 436 | "```\n", 437 | "\n", 438 | "[source](https://github.com/dask/dask/blob/09862ed99a02bf3a617ac53b116f9ecf81eea338/dask/dataframe/core.py#L458-L467)\n" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "id": "optical-norwegian", 444 | "metadata": {}, 445 | "source": [ 446 | "#### Understanding `meta` argument\n", 447 | "\n", 448 | "Dask dataframes and dask arrays have a special attribute called `_meta` that allows them to know metadata about the type of dataframe/array that they represent. This metadata includes:\n", 449 | " - dtype (int, float)\n", 450 | " - column names and order\n", 451 | " - name\n", 452 | " - type (pandas dataframe, cudf dataframe)\n", 453 | " \n", 454 | "**Related documentation**\n", 455 | "\n", 456 | "- [Dataframe metadata](https://docs.dask.org/en/latest/dataframe-design.html#metadata)\n", 457 | "\n", 458 | "This information is stored in an empty object of the proper type." 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "id": "external-treaty", 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [ 468 | "print(ddf._meta)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "id": "tender-architect", 474 | "metadata": {}, 475 | "source": [ 476 | "That's how dask knows what to render when you display a dask object:" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "id": "operating-estimate", 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [ 486 | "print(ddf)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "id": "conscious-prairie", 492 | "metadata": {}, 493 | "source": [ 494 | "When you add an item to the task graph, Dask tries to run the function on the meta before you call compute. \n", 495 | "\n", 496 | "This approach has several benefits:\n", 497 | "\n", 498 | "- it gives Dask a sense of what the output will look like. \n", 499 | "- if there are fundamental issues, Dask will fail fast\n", 500 | "\n", 501 | "Here's a few examples. " 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "id": "bd3fb018-de20-4f75-9d7d-441a6ec3a97b", 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "ddf.sum()" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "id": "essential-excitement", 518 | "metadata": {}, 519 | "outputs": [], 520 | "source": [ 521 | "ddf.name.str.startswith(\"A\")" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "id": "dcfa14a6-fd76-4dff-89de-def4039b0103", 527 | "metadata": {}, 528 | "source": [ 529 | "See how the output looks right? The dtypes are correct, the type is a `Series` rather than a `DataFrame` like the input.\n", 530 | "\n", 531 | "**Exercise**\n", 532 | "\n", 533 | "Try using `startswith` on a different column and see what you get :)" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "id": "7344e1d3-e417-4324-82e0-57e8efab5fd4", 540 | "metadata": { 541 | "jupyter": { 542 | "source_hidden": true 543 | }, 544 | "tags": [] 545 | }, 546 | "outputs": [], 547 | "source": [ 548 | "# solution\n", 549 | "ddf.x.str.startswith(\"A\")" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "id": "congressional-memory", 555 | "metadata": {}, 556 | "source": [ 557 | "### Declaring meta\n", 558 | "\n", 559 | "Sometimes running the function on a miniature version of the data doesn't produce a result that is similar enough to your expected output. \n", 560 | "\n", 561 | "In those cases you can provide a `meta` manually." 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "id": "royal-carroll", 568 | "metadata": {}, 569 | "outputs": [], 570 | "source": [ 571 | "result = ddf.map_partitions(lambda df, threshold: (df.x + df.y) > threshold, threshold=0, meta=bool)\n", 572 | "result.compute()" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "id": "expressed-sharing", 578 | "metadata": {}, 579 | "source": [ 580 | "### `map_overlap`\n", 581 | "Sometimes you want to operate on a per-block basis, but you need some information from neighboring blocks. \n", 582 | "\n", 583 | "Example operations include the following:\n", 584 | "\n", 585 | "- Convolve a filter across an image\n", 586 | "- Rolling sum/mean/max, …\n", 587 | "- Search for image motifs like a Gaussian blob that might span the border of a block\n", 588 | "- Evaluate a partial derivative\n", 589 | "\n", 590 | "Dask Array supports these operations by creating a new array where each block is slightly expanded by the borders of its neighbors. \n", 591 | "\n", 592 | "![](https://docs.dask.org/en/latest/_images/overlapping-neighbors.png)\n", 593 | "\n", 594 | "This costs an excess copy and the communication of many small chunks, but allows localized functions to evaluate in an embarrassingly parallel manner.\n", 595 | "\n", 596 | "**Related Documentation**\n", 597 | " - [Array Overlap](https://docs.dask.org/en/latest/array-overlap.html)\n", 598 | "\n", 599 | "The main API for these computations is the ``map_overlap`` method. ``map_overlap`` is very similar to ``map_blocks`` but has the additional arguments: ``depth``, ``boundary``, and ``trim``.\n", 600 | "\n", 601 | "Here is an example of calculating the derivative:" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "id": "8e78c226-a318-4772-959d-f099f8564185", 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "import numpy as np\n", 612 | "import dask.array as da\n", 613 | "\n", 614 | "a = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])\n", 615 | "a = da.from_array(a, chunks=5)\n", 616 | "\n", 617 | "plt.plot(a)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "id": "cde80b97-1781-4104-8ca2-b43e4544fc4c", 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "def derivative(a):\n", 628 | " return a - np.roll(a, 1)\n", 629 | "\n", 630 | "b = a.map_overlap(derivative, depth=1, boundary=None)\n", 631 | "b.compute()" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "id": "restricted-lying", 637 | "metadata": {}, 638 | "source": [ 639 | "In this case each block shares 1 value from its neighboring block: ``depth``. And since we set ``boundary=0``on the outer edges of the array, the first and last block are padded with the integer 0. Since we haven't specified ``trim`` it is true by default meaning that the overlap is removed before returning the results.\n", 640 | "\n", 641 | "If you inspect the task graph you'll see two mostly independent towers of tasks, with just some value sharing at the edges." 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "id": "eight-mirror", 648 | "metadata": {}, 649 | "outputs": [], 650 | "source": [ 651 | "b.visualize(collapse_outputs=True)" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "id": "great-crowd", 657 | "metadata": {}, 658 | "source": [ 659 | "**Exercise**\n", 660 | "\n", 661 | "Lets apply a gaussian filter to an image following the example from the [scipy docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html).\n", 662 | "\n", 663 | "First create a dask array from the numpy array:" 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "id": "handmade-infrared", 670 | "metadata": {}, 671 | "outputs": [], 672 | "source": [ 673 | "from scipy import misc\n", 674 | "import dask.array as da\n", 675 | "\n", 676 | "a = da.from_array(misc.ascent(), chunks=(128, 128))\n", 677 | "a" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "id": "disabled-editing", 683 | "metadata": {}, 684 | "source": [ 685 | "Now use ``map_overlap`` to apply ``gausian_filter`` to each block." 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "id": "graphic-bedroom", 692 | "metadata": {}, 693 | "outputs": [], 694 | "source": [ 695 | "from scipy.ndimage import gaussian_filter\n", 696 | "\n", 697 | "b = a.map_overlap(gaussian_filter, sigma=5, ...)" 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": null, 703 | "id": "informative-worcester", 704 | "metadata": { 705 | "jupyter": { 706 | "source_hidden": true 707 | }, 708 | "tags": [] 709 | }, 710 | "outputs": [], 711 | "source": [ 712 | "# solution\n", 713 | "from scipy.ndimage import gaussian_filter\n", 714 | "\n", 715 | "b = a.map_overlap(gaussian_filter, sigma=5, depth=10, boundary=\"periodic\")" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "id": "printable-anaheim", 721 | "metadata": {}, 722 | "source": [ 723 | "Check what you've come up with by plotting the results:" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": null, 729 | "id": "satisfactory-basket", 730 | "metadata": {}, 731 | "outputs": [], 732 | "source": [ 733 | "import matplotlib.pyplot as plt\n", 734 | "\n", 735 | "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))\n", 736 | "ax1.imshow(a)\n", 737 | "ax2.imshow(b)\n", 738 | "plt.show()" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "id": "exposed-mistake", 744 | "metadata": {}, 745 | "source": [ 746 | "> Notice that if you set the depth to a smaller value, you can see the edges of the blocks in the output image.\n", 747 | "\n", 748 | "\n", 749 | "## Blockwise Computations\n", 750 | "\n", 751 | "Blockwise computations provide the infrastructure for implementing ``map_blocks``, ``reduction`` and many\n", 752 | "of the elementwise methods that make up the Array API. \n", 753 | "\n", 754 | "Example operations include the following:\n", 755 | "\n", 756 | "- simple mapping operations like adding two arrays\n", 757 | "- operations that flip the dataset like transpose\n", 758 | "- reductions like tensordot and inner product\n", 759 | "\n", 760 | "\n", 761 | "\n", 762 | "\n", 763 | "**Related Documentation**\n", 764 | "\n", 765 | " - [API Documentation](https://docs.dask.org/en/latest/array-api.html#dask.array.blockwise)" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "id": "a276edc9-d092-4620-a542-80abd3d0e81a", 771 | "metadata": {}, 772 | "source": [ 773 | "### Internal uses\n", 774 | "\n", 775 | "This is the internal definition of transpose on dask.Array. In it you can see that there is a\n", 776 | "regular ``np.transpose`` applied within each block and then the blocks are themselves transposed.\n", 777 | "\n", 778 | "```python\n", 779 | "def transpose(a, axes=None):\n", 780 | " if axes:\n", 781 | " if len(axes) != a.ndim:\n", 782 | " raise ValueError(\"axes don't match array\")\n", 783 | " else:\n", 784 | " axes = tuple(range(a.ndim))[::-1]\n", 785 | " axes = tuple(d + a.ndim if d < 0 else d for d in axes)\n", 786 | " return blockwise(\n", 787 | " np.transpose, axes, a, tuple(range(a.ndim)), dtype=a.dtype, axes=axes\n", 788 | " )\n", 789 | "```\n", 790 | "\n", 791 | "[source](https://github.com/dask/dask/blob/4569b150db36af0aa9d9a8d318b4239a78e2eaec/dask/array/routines.py#L161:L170)" 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "id": "1f437fff-688f-4fa7-859e-1139750f4a76", 797 | "metadata": {}, 798 | "source": [ 799 | "We can implement our own version of transpose using `blockise` directly. If you are comfortable with matrix math then the notation will look very familiar." 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": null, 805 | "id": "489b4fe2-95c7-4434-ae41-5e124cbefd98", 806 | "metadata": {}, 807 | "outputs": [], 808 | "source": [ 809 | "result = da.blockwise(np.transpose, \"ji\", x, \"ij\")\n", 810 | "result.visualize()" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "id": "80cd7d6d-9b90-4df6-82a5-ab90795a57f1", 816 | "metadata": {}, 817 | "source": [ 818 | "Using `blockwise` we can operate on multiple arrays by just passing more arguments." 819 | ] 820 | }, 821 | { 822 | "cell_type": "code", 823 | "execution_count": null, 824 | "id": "2a54ad3d-34fd-48ca-b41f-e28b35e925a5", 825 | "metadata": {}, 826 | "outputs": [], 827 | "source": [ 828 | "result = da.blockwise(np.add, \"ij\", x, \"ij\", x.T, \"ij\", dtype=x.dtype)\n", 829 | "result.visualize()" 830 | ] 831 | }, 832 | { 833 | "cell_type": "markdown", 834 | "id": "a6a8f5a5-a08e-488b-a309-1cf2a1ca0674", 835 | "metadata": {}, 836 | "source": [ 837 | "This is equivalent to `x + x.T`" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": null, 843 | "id": "63b1be59-cff9-4fb9-b54b-72918cdd7845", 844 | "metadata": {}, 845 | "outputs": [], 846 | "source": [ 847 | "(x + x.T).visualize()" 848 | ] 849 | }, 850 | { 851 | "cell_type": "markdown", 852 | "id": "5599d4de-2f1d-43db-86fc-31148c897010", 853 | "metadata": {}, 854 | "source": [ 855 | "We can increase or decrease the dimensionality of the output by changing the out index. There are lots more examples of how to use `blockwise` in the documentation." 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": null, 861 | "id": "5fc7eba5-eff9-4a63-9b22-49e2f44e386b", 862 | "metadata": {}, 863 | "outputs": [], 864 | "source": [ 865 | "da.blockwise?" 866 | ] 867 | }, 868 | { 869 | "cell_type": "markdown", 870 | "id": "22cb0c3f-1aca-42db-9ec3-565a45c68a45", 871 | "metadata": {}, 872 | "source": [ 873 | "## Reduction\n", 874 | "Each dask collection has a `reduction` method. This is the generalized method that supports operations that reduce the dimensionality of the inputs.\n", 875 | "\n", 876 | "The difference between `blockwise` and `reduction` is that with `reduction` you have finer grained control over the behavior of the tree-reduce.\n", 877 | "\n", 878 | "![Custom operations: reduction](images/custom_operations_reduction.png)\n", 879 | "\n", 880 | "**Related Documentation**\n", 881 | " - [`dask.array.reduction`](http://dask.pydata.org/en/latest/array-api.html#dask.dataframe.Array.reduction)\n", 882 | " - [`dask.dataframe.reduction`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.reduction)" 883 | ] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "id": "unexpected-representation", 888 | "metadata": {}, 889 | "source": [ 890 | "### Internal uses\n", 891 | "\n", 892 | "This is the internal definition of sum on dask.Array. In it you can see that there is a\n", 893 | "regular ``np.sum`` applied across each block and then tree-reduced with ``np.sum`` again.\n", 894 | "\n", 895 | "```python\n", 896 | "def sum(a, axis=None, dtype=None, keepdims=False, split_every=None, out=None):\n", 897 | " if dtype is None:\n", 898 | " dtype = getattr(np.zeros(1, dtype=a.dtype).sum(), \"dtype\", object)\n", 899 | " result = reduction(\n", 900 | " a,\n", 901 | " chunk.sum, # this is just `np.sum`\n", 902 | " chunk.sum, # this is just `np.sum`\n", 903 | " axis=axis,\n", 904 | " keepdims=keepdims,\n", 905 | " dtype=dtype,\n", 906 | " split_every=split_every,\n", 907 | " out=out,\n", 908 | " )\n", 909 | " return result\n", 910 | "```\n", 911 | "[source](https://github.com/dask/dask/blob/ac1bd05cfd40207d68f6eb8603178d7ac0ded922/dask/array/reductions.py#L344-L357)\n", 912 | "\n", 913 | "Here is `da.sum` reimplemented as a custom reduction." 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": null, 919 | "id": "0eab1d5a-c839-4b7c-8fed-9bf88bfa12d5", 920 | "metadata": {}, 921 | "outputs": [], 922 | "source": [ 923 | "da.reduction(x, np.sum, np.sum, dtype=x.dtype).visualize()" 924 | ] 925 | }, 926 | { 927 | "cell_type": "markdown", 928 | "id": "negative-display", 929 | "metadata": {}, 930 | "source": [ 931 | "By visualizing `b` we can see how the tree reduction works. First ``sum`` is applied to each block, then every 4 chunks are combined using ``sum-partial``. This keeps going until there are less than 4 results left, then ``sum-aggregate`` is used to finish up." 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "id": "greek-heritage", 937 | "metadata": {}, 938 | "source": [ 939 | "**Exercise**\n", 940 | "\n", 941 | "See how the graph changes when you set the chunks - maybe to `(100, 250)` or `(250, 250)`" 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": null, 947 | "id": "4b7a74a2-4e02-4807-b4d1-2221818ef95d", 948 | "metadata": { 949 | "jupyter": { 950 | "source_hidden": true 951 | }, 952 | "tags": [] 953 | }, 954 | "outputs": [], 955 | "source": [ 956 | "# solution\n", 957 | "x = da.random.random(size=(1_000, 1_000), chunks=(100, 250))\n", 958 | "x.sum().visualize()" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "id": "cb0f3153-ac83-46c3-abaa-3e487a81c0b4", 964 | "metadata": {}, 965 | "source": [ 966 | "### Understanding ``split_every``\n", 967 | "\n", 968 | "``split_every`` controls the number of chunk outputs that are used as input to each ``partial`` call. \n", 969 | "\n", 970 | "Here is an example of doing partial aggregation on every 5 blocks along the 0 axis and every 2 blocks along the 1 axis (so 10 blocks go into each `partial-sum`)" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "id": "e57621ca-58c4-4506-964f-9cf53746e962", 977 | "metadata": {}, 978 | "outputs": [], 979 | "source": [ 980 | "x.sum(split_every={0: 5, 1: 2}).visualize()" 981 | ] 982 | }, 983 | { 984 | "cell_type": "markdown", 985 | "id": "902b2b79-af4d-4d79-bd2e-daa90e80bcd5", 986 | "metadata": {}, 987 | "source": [ 988 | "**Exercise**\n", 989 | "\n", 990 | "Try setting different values for `split_every` and visualizing the task graph to see the impact." 991 | ] 992 | }, 993 | { 994 | "cell_type": "code", 995 | "execution_count": null, 996 | "id": "37b1dcaa-83fa-474f-9351-dc443500595b", 997 | "metadata": { 998 | "jupyter": { 999 | "source_hidden": true 1000 | }, 1001 | "tags": [] 1002 | }, 1003 | "outputs": [], 1004 | "source": [ 1005 | "# solution\n", 1006 | "x.sum(split_every={0: 10, 1: 2}).visualize()" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "id": "8e5bd9c4-5aaf-4e09-b655-210bf0cc3169", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "> **Side note**\n", 1015 | ">\n", 1016 | "> You can use reductions to calculate aggregations per-block reduction even if you don't want to combine and aggregate the results of those blocks:\n", 1017 | ">\n", 1018 | "> ```python\n", 1019 | "> da.reduction(x, np.sum, lambda x, **kwargs: x, dtype=int).compute()\n", 1020 | "> ```" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "markdown", 1025 | "id": "single-oasis", 1026 | "metadata": { 1027 | "tags": [] 1028 | }, 1029 | "source": [ 1030 | "### Groupby Aggregation\n", 1031 | "\n", 1032 | "A different but powerful form of reduction is groupby aggregations.\n", 1033 | "\n", 1034 | "There are many standard reductions supported by default on dataframe groupbys. These include methods like `mean, max, min, sum, nunique`. These are easily scaled and parallelized.\n", 1035 | "\n", 1036 | "**Related Documentation**\n", 1037 | "\n", 1038 | " - [DataFrame Groupby](https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate)\n", 1039 | " - [Examples](https://examples.dask.org/dataframes/02-groupby.html)\n", 1040 | "\n", 1041 | "\n", 1042 | "![groupby aggregation](images/custom_operations_groupby_aggregation.png)\n", 1043 | "\n", 1044 | "In order to do that you need to write three functions. These are analogous to the functions that we use for general-purpose `reduction`\n", 1045 | "\n", 1046 | "- `chunk`: operates on the series groupby on each individual partition (`ddf.partitions[0].groupby(\"name\")[\"x\"]`)\n", 1047 | "- `agg`: operates on the concatenated output from calling chunk on every partition\n", 1048 | "- `finalize`: operates on the output from calling aggregate - returns one column. This one is actually optional.\n", 1049 | "\n", 1050 | "Here's an example of a custom aggregation for calculating the mean." 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": null, 1056 | "id": "colonial-uganda", 1057 | "metadata": {}, 1058 | "outputs": [], 1059 | "source": [ 1060 | "import dask\n", 1061 | "import dask.dataframe as dd\n", 1062 | "\n", 1063 | "ddf = dask.datasets.timeseries()\n", 1064 | "\n", 1065 | "custom_mean = dd.Aggregation(\n", 1066 | " 'custom_mean',\n", 1067 | " lambda s: (s.count(), s.sum()),\n", 1068 | " lambda count, sum: (count.sum(), sum.sum()),\n", 1069 | " lambda count, sum: sum / count,\n", 1070 | ")\n", 1071 | "custom_result = ddf.groupby('name').agg(custom_mean)\n", 1072 | "custom_result.head()" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "markdown", 1077 | "id": "demonstrated-defense", 1078 | "metadata": {}, 1079 | "source": [ 1080 | "Here is how it works:\n", 1081 | "\n", 1082 | "- for every partition (one per day) group by ``name``\n", 1083 | "- on each of those pandas series groupby objects calculate the `count` and the `sum`\n", 1084 | "- concatenate every 8 (this is configurable) outputs together\n", 1085 | "- sum each of these\n", 1086 | "- finally: divide the `sum` by the `count`\n", 1087 | "\n", 1088 | "This is equivalent to:" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": null, 1094 | "id": "statutory-attention", 1095 | "metadata": {}, 1096 | "outputs": [], 1097 | "source": [ 1098 | "simple_result = ddf.groupby('name').mean()\n", 1099 | "simple_result.head()" 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "markdown", 1104 | "id": "settled-lawrence", 1105 | "metadata": {}, 1106 | "source": [ 1107 | "**NOTE**: If you look at the task graph you'll see that the structure of the computation is actually pretty different. That's because `.mean` computes the `sum` and the `count` independently and only combines the values at the end." 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": null, 1113 | "id": "cognitive-macro", 1114 | "metadata": {}, 1115 | "outputs": [], 1116 | "source": [ 1117 | "custom_result.visualize()" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "execution_count": null, 1123 | "id": "facial-stanford", 1124 | "metadata": {}, 1125 | "outputs": [], 1126 | "source": [ 1127 | "simple_result.visualize()" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "markdown", 1132 | "id": "encouraging-shirt", 1133 | "metadata": {}, 1134 | "source": [ 1135 | "#### Why not use apply?\n", 1136 | "\n", 1137 | "If you are trying to run a custom function on the groups in a groupby it can be tempting to use `.apply` but this is often a poor choice because it requires that the data be shuffled. Instead you should try writing a custom aggregation. Custom aggregation is preferable beacause, each partition is grouped (without being repartitioned) and then the results from each group on each partition are aggregated.\n", 1138 | "\n", 1139 | "DON'T DO:\n", 1140 | "```python\n", 1141 | "ddf.groupby(\"name\").apply(lambda x: x.mean())\n", 1142 | "```\n", 1143 | "\n", 1144 | "This will shuffle the data so that all the data for a particular name is in the same partition. If you call `.compute()` on it you'll notice that it's much slower (about 50x on my computer)." 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "markdown", 1149 | "id": "dea37d15-f26e-4361-be1a-33e71d3d682e", 1150 | "metadata": {}, 1151 | "source": [ 1152 | "## dask.delayed\n", 1153 | "\n", 1154 | "You can wrap arbitrary functions in dask.delayed to parallelize them, but when you are operating on a Dask collection or several Dask collections, dask.delayed won't understand the organization of your blocks. You need to first can convert your collections into individual blocks set up the computation, and afterwards arrange those blocks as you like. Note that this is often the slowest approach.\n", 1155 | "\n", 1156 | "Let's look back at some of our original computation and implement `da.random.random` and `transpose` using `dask.delayed`." 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "markdown", 1161 | "id": "03d8d50b-fdf4-444a-bcfb-6fa91c8a0932", 1162 | "metadata": {}, 1163 | "source": [ 1164 | "### Construct an array" 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "code", 1169 | "execution_count": null, 1170 | "id": "aace26e6-ed88-4c86-b149-5974782e0047", 1171 | "metadata": {}, 1172 | "outputs": [], 1173 | "source": [ 1174 | "x = da.random.random(size=(1_000, 1_000), chunks=(250, 500))\n", 1175 | "x" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "id": "28dfbea0-4b24-46b3-992e-74a673b995d9", 1181 | "metadata": {}, 1182 | "source": [ 1183 | "We looked at how you can implement this using `map_blocks` at the top of this notebook. Now let's implement random using `dask.delayed`." 1184 | ] 1185 | }, 1186 | { 1187 | "cell_type": "code", 1188 | "execution_count": null, 1189 | "id": "0271311f-6359-4a49-8420-3af3c841ffa3", 1190 | "metadata": {}, 1191 | "outputs": [], 1192 | "source": [ 1193 | "@dask.delayed\n", 1194 | "def random_sample():\n", 1195 | " return np.random.random(size=(250, 500))\n", 1196 | "\n", 1197 | "da.concatenate(\n", 1198 | " (da.concatenate(\n", 1199 | " (da.from_delayed(random_sample(), shape=(250, 500), dtype=float) \n", 1200 | " for i in range(4)), axis=0\n", 1201 | " ) \n", 1202 | " for i in range(2)), axis=1\n", 1203 | ")" 1204 | ] 1205 | }, 1206 | { 1207 | "cell_type": "markdown", 1208 | "id": "86eb3c4e-6f94-497d-95f4-010c0c8a2059", 1209 | "metadata": {}, 1210 | "source": [ 1211 | "### Operate on an existing array\n", 1212 | "There we saw how to create an array from scratch, now let's look at how to operate on an existing array. Let's use delayed to compute the transpose of `x`." 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "code", 1217 | "execution_count": null, 1218 | "id": "ebbeb1f7-2e56-43c8-90ea-aba70de93ff4", 1219 | "metadata": {}, 1220 | "outputs": [], 1221 | "source": [ 1222 | "x.T" 1223 | ] 1224 | }, 1225 | { 1226 | "cell_type": "markdown", 1227 | "id": "091deb99-1c80-412d-94bd-379ef2bd023b", 1228 | "metadata": {}, 1229 | "source": [ 1230 | "First we'll set up the delayed function:" 1231 | ] 1232 | }, 1233 | { 1234 | "cell_type": "code", 1235 | "execution_count": null, 1236 | "id": "66458052-eff0-48cb-8b9a-92b67aae4a2d", 1237 | "metadata": {}, 1238 | "outputs": [], 1239 | "source": [ 1240 | "@dask.delayed\n", 1241 | "def transpose(block):\n", 1242 | " return np.transpose(block)" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "markdown", 1247 | "id": "fe601479-aced-49eb-8a61-9545ccc099a7", 1248 | "metadata": {}, 1249 | "source": [ 1250 | "**Naive approach** - You can just pass the array right in to this function, but when you do the array is first merged into one large array and then the delayed function is applied to the whole array all at once. This can be very nice if your array isn't too large, or if your delayed function can't be coerced to work in per-block. \n", 1251 | "\n", 1252 | "Let's compare the task graphs for calling `x.T` and `transpose(x)`." 1253 | ] 1254 | }, 1255 | { 1256 | "cell_type": "code", 1257 | "execution_count": null, 1258 | "id": "d67d5044-bbb5-47d0-90aa-186fa91756dd", 1259 | "metadata": {}, 1260 | "outputs": [], 1261 | "source": [ 1262 | "x.T.visualize()" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": null, 1268 | "id": "9c69d321-b9fe-4c52-87bb-e8dded94feb8", 1269 | "metadata": {}, 1270 | "outputs": [], 1271 | "source": [ 1272 | "transpose(x).visualize()" 1273 | ] 1274 | }, 1275 | { 1276 | "cell_type": "markdown", 1277 | "id": "a0257e40-089c-4790-a6b0-66baaec4ebd6", 1278 | "metadata": {}, 1279 | "source": [ 1280 | "The better approach is to first convert the array to a nested list of delayed objects - there will be one delayed object to represent each block:" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "code", 1285 | "execution_count": null, 1286 | "id": "1a8b4437-4352-4155-b8f5-27f645785551", 1287 | "metadata": {}, 1288 | "outputs": [], 1289 | "source": [ 1290 | "delayed_array = x.to_delayed()\n", 1291 | "delayed_array" 1292 | ] 1293 | }, 1294 | { 1295 | "cell_type": "markdown", 1296 | "id": "19ac1718-67f7-4361-8447-325c9d8eab02", 1297 | "metadata": {}, 1298 | "source": [ 1299 | "Then you can iterate over that nested list of delayed objects, convert the output to arrays and concatenate them." 1300 | ] 1301 | }, 1302 | { 1303 | "cell_type": "code", 1304 | "execution_count": null, 1305 | "id": "7d7a819a-c359-4eed-acb7-2e7990bc911e", 1306 | "metadata": {}, 1307 | "outputs": [], 1308 | "source": [ 1309 | "result = da.concatenate(\n", 1310 | " (da.concatenate(\n", 1311 | " (da.from_delayed(transpose(block), shape=(500, 250), dtype=float) \n", 1312 | " for block in row), \n", 1313 | " axis=0)\n", 1314 | " for row in delayed_array),\n", 1315 | " axis=1\n", 1316 | ")\n", 1317 | "result" 1318 | ] 1319 | }, 1320 | { 1321 | "cell_type": "code", 1322 | "execution_count": null, 1323 | "id": "00423d90-14ef-4c25-a057-948ec1ba767e", 1324 | "metadata": {}, 1325 | "outputs": [], 1326 | "source": [ 1327 | "result.visualize()" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "markdown", 1332 | "id": "disturbed-snapshot", 1333 | "metadata": {}, 1334 | "source": [ 1335 | "## When to use which method\n", 1336 | "\n", 1337 | "In this notebook we've covered several different mechanisms for applying arbitrary functions to the blocks in arrays or dataframes. Here's a brief summary of when you should use these various methods\n", 1338 | "\n", 1339 | "- `map_block`, `map_partition` - block organization of the input matches the block organization of the output and the function is fully parallelizable. \n", 1340 | "- `map_overlap` - block organizations of input and output match, but the function is not fully parallelizable (requires input from neighboring chunks).\n", 1341 | "- `blockwise` - same function can be applied to the blocks as to the partial and aggregated versions. Also output blocks can be in different orientations.\n", 1342 | "- `reduction` - dimensionality of output does not necessarily match that of input and function is fully parallelizable.\n", 1343 | "- `groupby().agg` - data needs to be aggregated per group (the index of the output will be the group keys).\n", 1344 | "- `dask.delayed` - data doesn't have a complex block organization or the data is small and the computation is pretty fast." 1345 | ] 1346 | } 1347 | ], 1348 | "metadata": { 1349 | "jupytext": { 1350 | "cell_metadata_filter": "-all", 1351 | "main_language": "python", 1352 | "notebook_metadata_filter": "-all", 1353 | "text_representation": { 1354 | "extension": ".md", 1355 | "format_name": "markdown" 1356 | } 1357 | }, 1358 | "kernelspec": { 1359 | "display_name": "Python 3", 1360 | "language": "python", 1361 | "name": "python3" 1362 | }, 1363 | "language_info": { 1364 | "codemirror_mode": { 1365 | "name": "ipython", 1366 | "version": 3 1367 | }, 1368 | "file_extension": ".py", 1369 | "mimetype": "text/x-python", 1370 | "name": "python", 1371 | "nbconvert_exporter": "python", 1372 | "pygments_lexer": "ipython3", 1373 | "version": "3.9.4" 1374 | } 1375 | }, 1376 | "nbformat": 4, 1377 | "nbformat_minor": 5 1378 | } 1379 | -------------------------------------------------------------------------------- /3-graph-optimization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "recovered-trouble", 6 | "metadata": {}, 7 | "source": [ 8 | "# Advanced Collections: _Graph Optimizations_\n", 9 | "\n", 10 | "In general, there are two goals when doing graph optimizations:\n", 11 | "\n", 12 | "1. Simplify computation\n", 13 | "2. Improve parallelism\n", 14 | "\n", 15 | "Simplifying computation can be done on a graph level by removing unnecessary\n", 16 | "tasks (``cull``).\n", 17 | "\n", 18 | "Parallelism can be improved by reducing\n", 19 | "inter-task communication, whether by fusing many tasks into one (``fuse``), or\n", 20 | "by inlining cheap operations (``inline``, ``inline_functions``).\n", 21 | "\n", 22 | "\n", 23 | "**Related Documentation**\n", 24 | "\n", 25 | " - [Graph Optimization](https://docs.dask.org/en/latest/optimization.html)\n", 26 | " \n", 27 | "The ``dask.optimization`` module contains several functions to transform graphs\n", 28 | "in a variety of useful ways. In most cases, users won't need to interact with\n", 29 | "these functions directly, as specialized subsets of these transforms are done\n", 30 | "automatically in the Dask collections (``dask.array``, ``dask.bag``, and\n", 31 | "``dask.dataframe``). However, users working with custom graphs or computations\n", 32 | "may find that applying these methods results in substantial speedups.\n", 33 | "\n", 34 | "## Example\n", 35 | "\n", 36 | "Suppose you had a custom Dask graph for doing a word counting task.\n", 37 | "\n", 38 | "> **NOTE**: To gain intuition about optimization we'll be looking at just the task graph first. We will talk about how this relates to collections at the end.\n", 39 | "\n", 40 | "In this example we are:\n", 41 | "\n", 42 | "1. counting the frequency of the words ``'orange``, ``'apple'``, and ``'pear'`` in the list of words\n", 43 | "2. formatting an output string reporting the results\n", 44 | "3. printing the output and returning the output string" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "id": "disabled-greek", 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import dask\n", 55 | "\n", 56 | "def print_and_return(string):\n", 57 | " print(string)\n", 58 | " return string\n", 59 | "\n", 60 | "def format_str(count, val, nwords):\n", 61 | " return (f'word list has {count} occurrences of '\n", 62 | " f'{val}, out of {nwords} words')\n", 63 | "\n", 64 | "dsk = {'words': 'apple orange apple pear orange pear pear',\n", 65 | " 'nwords': (len, (str.split, 'words')),\n", 66 | " 'val1': 'orange',\n", 67 | " 'val2': 'apple',\n", 68 | " 'val3': 'pear',\n", 69 | " 'count1': (str.count, 'words', 'val1'),\n", 70 | " 'count2': (str.count, 'words', 'val2'),\n", 71 | " 'count3': (str.count, 'words', 'val3'),\n", 72 | " 'format1': (format_str, 'count1', 'val1', 'nwords'),\n", 73 | " 'format2': (format_str, 'count2', 'val2', 'nwords'),\n", 74 | " 'format3': (format_str, 'count3', 'val3', 'nwords'),\n", 75 | " 'print1': (print_and_return, 'format1'),\n", 76 | " 'print2': (print_and_return, 'format2'),\n", 77 | " 'print3': (print_and_return, 'format3'),\n", 78 | "}\n", 79 | "\n", 80 | "dask.visualize(dsk, verbose=True, collapse_outputs=True)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "id": "circular-driving", 86 | "metadata": {}, 87 | "source": [ 88 | "### Cull\n", 89 | "\n", 90 | "To perform the computation, we first remove unnecessary components from the\n", 91 | "graph using the ``cull`` function and then pass the Dask graph and the desired\n", 92 | "output keys to a scheduler ``get`` function:" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "id": "meaningful-knife", 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "from dask.threaded import get\n", 103 | "from dask.optimization import cull\n", 104 | "\n", 105 | "outputs = ['print1', 'print2']\n", 106 | "dsk1, dependencies = cull(dsk, outputs) # remove unnecessary tasks from the graph\n", 107 | "\n", 108 | "results = get(dsk1, outputs)\n", 109 | "dask.visualize(dsk1, verbose=True, collapse_outputs=True)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "id": "afraid-asbestos", 115 | "metadata": {}, 116 | "source": [ 117 | "As can be seen above, the scheduler computed only the requested outputs\n", 118 | "(``'print3'`` was never computed). This is because we called the\n", 119 | "``dask.optimization.cull`` function, which removes the unnecessary tasks from\n", 120 | "the graph.\n", 121 | "\n", 122 | "Culling is part of the default optimization pass of almost all collections.\n", 123 | "Often you want to call it somewhat early to reduce the amount of work done in\n", 124 | "later steps." 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "id": "utility-document", 130 | "metadata": {}, 131 | "source": [ 132 | "#### In practice\n", 133 | "\n", 134 | "Cull is very useful for operating on partitioned data. For instance consider the\n", 135 | "case of a timeseries stored in a parquet file where each parition represents one\n", 136 | "day. Let's generate such a timeseries:" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "id": "thermal-symphony", 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "import dask\n", 147 | "import dask.dataframe as dd\n", 148 | "\n", 149 | "dask.datasets.timeseries().to_parquet(\"timeseries\")" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "sophisticated-asset", 155 | "metadata": {}, 156 | "source": [ 157 | "Now read in the data and look at the task graph:" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "id": "amber-locator", 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "ddf = dd.read_parquet(\"timeseries\")\n", 168 | "ddf.visualize(optimize_graph=True)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "id": "laden-trout", 174 | "metadata": {}, 175 | "source": [ 176 | "If you select a subset of the days, and optimize the graph you can see that there is now only one task at the bottom of the task graph. That is because Dask knows that that all the other partitions don't have data that you need, so those tasks have been culled." 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "id": "aerial-choice", 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "ddf[\"2000-01-13\"].visualize(optimize_graph=True)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "id": "integral-handling", 192 | "metadata": {}, 193 | "source": [ 194 | "### Inline\n", 195 | "\n", 196 | "Back to the fruits!\n", 197 | "\n", 198 | "Looking at the word counting task graph, there are multiple accesses to constants such\n", 199 | "as ``'val1'`` or ``'val2'``. These can be inlined into the\n", 200 | "tasks to improve efficiency using the ``inline`` function. For example:" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "id": "lightweight-gamma", 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "from dask.optimization import inline\n", 211 | "\n", 212 | "dsk2 = inline(dsk1, dependencies=dependencies)\n", 213 | "results = get(dsk2, outputs)\n", 214 | "dask.visualize(dsk2, verbose=True, collapse_outputs=True)" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "id": "promising-retro", 220 | "metadata": {}, 221 | "source": [ 222 | "Now we have two sets of *almost* linear task chains. The only link between them\n", 223 | "is the word counting function. For cheap operations like this, the\n", 224 | "serialization cost may be larger than the actual computation, so it may be\n", 225 | "faster to do the computation more than once, rather than passing the results to\n", 226 | "all nodes. To perform this function inlining, the ``inline_functions`` function\n", 227 | "can be used:" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "id": "stretch-doctor", 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "from dask.optimization import inline_functions\n", 238 | "\n", 239 | "dsk3 = inline_functions(dsk2, outputs, [len, str.split],\n", 240 | " dependencies=dependencies)\n", 241 | "results = get(dsk3, outputs)\n", 242 | "dask.visualize(dsk3, verbose=True, collapse_outputs=True)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "numerical-contrast", 248 | "metadata": {}, 249 | "source": [ 250 | "### Fuse\n", 251 | "\n", 252 | "Now we have a set of purely linear tasks. We'd like to have the scheduler run\n", 253 | "all of these on the same worker to reduce data serialization between workers.\n", 254 | "One option is just to merge these linear chains into one big task using the\n", 255 | "``fuse`` function:" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "id": "cleared-shoulder", 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "from dask.optimization import fuse\n", 266 | "\n", 267 | "dsk4, dependencies = fuse(dsk3)\n", 268 | "results = get(dsk4, outputs)\n", 269 | "dask.visualize(dsk4, verbose=True, collapse_outputs=True)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "id": "weird-blade", 275 | "metadata": {}, 276 | "source": [ 277 | "### Result\n", 278 | "\n", 279 | "Putting it all together:" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "id": "experimental-special", 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "def optimize_and_get(dsk, keys):\n", 290 | " dsk1, deps = cull(dsk, keys)\n", 291 | " dsk2 = inline(dsk1, dependencies=deps)\n", 292 | " dsk3 = inline_functions(dsk2, keys, [len, str.split],\n", 293 | " dependencies=deps)\n", 294 | " dsk4, deps = fuse(dsk3)\n", 295 | " return get(dsk4, keys)\n", 296 | "\n", 297 | "optimize_and_get(dsk, outputs)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "id": "hearing-rochester", 303 | "metadata": {}, 304 | "source": [ 305 | "In summary, the above operations accomplish the following:\n", 306 | "\n", 307 | "1. Removed tasks unnecessary for the desired output using ``cull``\n", 308 | "2. Inlined constants using ``inline``\n", 309 | "3. Inlined cheap computations using ``inline_functions``, improving parallelism\n", 310 | "4. Fused linear tasks together to ensure they run on the same worker using ``fuse``\n", 311 | "\n", 312 | "These optimizations are already performed automatically in the Dask collections." 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "id": "stone-surfing", 318 | "metadata": {}, 319 | "source": [ 320 | "## Customizing Optimization\n", 321 | "\n", 322 | "Dask defines a default optimization strategy for each collection type (Array,\n", 323 | "Bag, DataFrame, Delayed). However, different applications may have different\n", 324 | "needs. To address this variability of needs, you can construct your own custom\n", 325 | "optimization function and use it instead of the default. An optimization\n", 326 | "function takes in a task graph and list of desired keys and returns a new\n", 327 | "task graph:\n", 328 | "\n", 329 | "```python\n", 330 | "def my_optimize_function(dsk, keys):\n", 331 | " new_dsk = {...}\n", 332 | " return new_dsk\n", 333 | "```\n", 334 | "\n", 335 | "You can then register this optimization class against whichever collection type\n", 336 | "you prefer and it will be used instead of the default scheme:\n", 337 | "\n", 338 | "```python\n", 339 | "with dask.config.set(array_optimize=my_optimize_function):\n", 340 | " x, y = dask.compute(x, y)\n", 341 | "```\n", 342 | "\n", 343 | "You can register separate optimization functions for different collections, or\n", 344 | "you can register ``None`` if you do not want particular types of collections to\n", 345 | "be optimized:\n", 346 | "\n", 347 | "```python\n", 348 | "with dask.config.set(array_optimize=my_optimize_function,\n", 349 | " dataframe_optimize=None,\n", 350 | " delayed_optimize=my_other_optimize_function):\n", 351 | " ...\n", 352 | "```\n", 353 | "\n", 354 | "You do not need to specify all collections. Collections will default to their\n", 355 | "standard optimization scheme (which is usually a good choice).\n", 356 | "\n", 357 | "When creating your own collection, you can specify the optimization function\n", 358 | "by implementing a ``_dask_optimize__`` method." 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "id": "identified-brain", 364 | "metadata": {}, 365 | "source": [ 366 | "## Exercise\n", 367 | "\n", 368 | "This exercise is based off of [graphchain](https://github.com/radix-ai/graphchain)\n", 369 | "which implements a custom optimization function that reads data from a\n", 370 | "cache if it has already been computed.\n", 371 | "\n", 372 | "Take a look at the following task graph." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "id": "massive-pasta", 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "import dask\n", 383 | "import pandas as pd\n", 384 | "\n", 385 | "def create_dataframe(num_rows, num_cols):\n", 386 | " print('Creating DataFrame...')\n", 387 | " return pd.DataFrame(data=[range(num_cols)]*num_rows)\n", 388 | "\n", 389 | "def complicated_computation(df, num_quantiles):\n", 390 | " print('Running complicated computation on DataFrame...')\n", 391 | " return df.quantile(q=[i / num_quantiles for i in range(num_quantiles)])\n", 392 | "\n", 393 | "def summarise_dataframes(*dfs):\n", 394 | " print('Summing DataFrames...')\n", 395 | " return sum(df.sum().sum() for df in dfs)\n", 396 | "\n", 397 | "dsk = {\n", 398 | " 'df_a': (create_dataframe, 10_000, 1000),\n", 399 | " 'df_b': (create_dataframe, 10_000, 1000),\n", 400 | " 'df_c': (complicated_computation, 'df_a', 2048),\n", 401 | " 'df_d': (complicated_computation, 'df_b', 2048),\n", 402 | " 'result': (summarise_dataframes, 'df_c', 'df_d')\n", 403 | "}\n", 404 | "\n", 405 | "%time dask.get(dsk, 'result')" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "id": "earned-sight", 411 | "metadata": {}, 412 | "source": [ 413 | "Notice how `df_a` and `df_b` have the same definition?\n", 414 | "\n", 415 | "Write an optimization function that replaces one of the matching tasks with a key\n", 416 | "pointing to the other task.\n", 417 | "\n", 418 | "> NOTE: This is a much more simplistic approach than graphchain uses, and it's less\n", 419 | "> useful in a distributed context." 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "id": "expired-mandate", 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "def optimize(dsk):\n", 430 | " new_dsk = {} # make sure that you don't alter the task graph inplace\n", 431 | " \n", 432 | " for k, v in dsk.items():\n", 433 | " ... # if any two tasks match, replace one of the matching tasks with a key pointing to the other task.\n", 434 | " return new_dsk\n", 435 | "\n", 436 | "\n", 437 | "%time dask.get(optimize(dsk), 'result')" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "id": "bound-specific", 444 | "metadata": { 445 | "jupyter": { 446 | "source_hidden": true 447 | }, 448 | "tags": [] 449 | }, 450 | "outputs": [], 451 | "source": [ 452 | "#solution\n", 453 | "def optimize(dsk):\n", 454 | " new_dsk = {}\n", 455 | " \n", 456 | " for k, v in dsk.items():\n", 457 | " if v not in new_dsk.values():\n", 458 | " new_dsk[k] = v\n", 459 | " else:\n", 460 | " new_dsk[k] = {v: k for k, v in new_dsk.items()}[v]\n", 461 | " return new_dsk\n", 462 | "\n", 463 | "%time dask.get(optimize(dsk), 'result')" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": null, 469 | "id": "relevant-punch", 470 | "metadata": { 471 | "jupyter": { 472 | "source_hidden": true 473 | } 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "# alternate solution\n", 478 | "def optimize(dsk):\n", 479 | " new_dsk = {}\n", 480 | "\n", 481 | " for k, v in dsk.items():\n", 482 | " first_key = list(dsk.keys())[list(dsk.values()).index(v)]\n", 483 | " if first_key not in new_dsk.keys():\n", 484 | " new_dsk[k] = v\n", 485 | " else:\n", 486 | " new_dsk[k] = first_key\n", 487 | " return new_dsk\n", 488 | "\n", 489 | "%time dask.get(optimize(dsk), 'result')" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "id": "b222c529-d522-47ec-853d-d23d2744b7b1", 495 | "metadata": {}, 496 | "source": [ 497 | "## Conclusion\n", 498 | "\n", 499 | "Optimizations in Dask let's you simplify computation and improve parallelism. There are some great ones included by default (`cull`, `inline`, `fuse`), but sometimes it can be really powerful to write custom optimizations and either use them on existing collections or on custom collections. We'll touch on this a bit more in the next section about custom collections." 500 | ] 501 | } 502 | ], 503 | "metadata": { 504 | "jupytext": { 505 | "cell_metadata_filter": "-all", 506 | "main_language": "python", 507 | "notebook_metadata_filter": "-all", 508 | "text_representation": { 509 | "extension": ".md", 510 | "format_name": "markdown" 511 | } 512 | }, 513 | "kernelspec": { 514 | "display_name": "Python 3", 515 | "language": "python", 516 | "name": "python3" 517 | }, 518 | "language_info": { 519 | "codemirror_mode": { 520 | "name": "ipython", 521 | "version": 3 522 | }, 523 | "file_extension": ".py", 524 | "mimetype": "text/x-python", 525 | "name": "python", 526 | "nbconvert_exporter": "python", 527 | "pygments_lexer": "ipython3", 528 | "version": "3.9.4" 529 | }, 530 | "toc-autonumbering": false, 531 | "toc-showcode": false, 532 | "toc-showmarkdowntxt": false, 533 | "toc-showtags": false 534 | }, 535 | "nbformat": 4, 536 | "nbformat_minor": 5 537 | } 538 | -------------------------------------------------------------------------------- /4-distributed-scheduler.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "9d19f489-59e6-48c7-aade-25530dcf253c", 6 | "metadata": {}, 7 | "source": [ 8 | "# Hacking Dask clusters\n", 9 | "\n", 10 | "This notebook covers Dask's distributed clusters in more detail. We provide a more in depth look at the components of a cluster, illustrate how to inspect the internal state of a cluster, and how you can extend the functionality of your cluster using Dask's plugin system." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "c45762cb-a5a2-4f7c-ae29-02f05c78ba89", 16 | "metadata": {}, 17 | "source": [ 18 | "# Cluster overview\n", 19 | "\n", 20 | "In this section we'll discuss:\n", 21 | "\n", 22 | "1. The different components which make up a Dask cluster\n", 23 | "2. Survey different ways to launch a cluster\n", 24 | "\n", 25 | "## Components of a cluster\n", 26 | "\n", 27 | "A Dask cluster is composed of three different types of objects:\n", 28 | "\n", 29 | "1. **Scheduler**: A single, centralized scheduler process which responds to requests for computations, maintains relavant state about tasks and worker, and sends tasks to workers to be computed.\n", 30 | "2. **Workers**: One or more worker processes which compute tasks and store/serve their results.\n", 31 | "3. **Clients**: One or more client objects which are the user-facing entry point to interact with the cluster.\n", 32 | "\n", 33 | "\"Dask\n", 36 | "\n", 37 | "A couple of notes about workers:\n", 38 | "\n", 39 | "- Each worker runs in its own Python process. Each worker Python process has its own `concurrent.futures.ThreadPoolExecutor` which is uses to compute tasks in parallel. The same threads vs. processes considerations we discussed earlier also apply to Dask workers.\n", 40 | "- There's actually a fourth cluster object which is often not discussed: the **Nanny**. By default Dask workers are launched and managed by a separate nanny process. This separate process allows workers to restart themselves if you want to use the `Client.restart` method, or to restart workers automatically if they get above a certain memory limit threshold.\n", 41 | "\n", 42 | "#### Related Documentation\n", 43 | "\n", 44 | "- [Cluster architecture](https://distributed.dask.org/en/latest/#architecture)\n", 45 | "- [Journey of a task](https://distributed.dask.org/en/latest/journey.html)\n", 46 | "\n", 47 | "## Deploying Dask clusters\n", 48 | "\n", 49 | "Deploying a Dask cluster means launching scheduler, worker, and client processes and setting up the appropriate network connections so these processes can communicate with one another. Dask clusters can be lauched in a few different ways which we highlight in the following sections.\n", 50 | "\n", 51 | "### Manual setup\n", 52 | "\n", 53 | "Launch a scheduler process using the `dask-scheduler` command line utility:\n", 54 | "\n", 55 | "```terminal\n", 56 | "$ dask-scheduler\n", 57 | "Scheduler at: tcp://192.0.0.100:8786\n", 58 | "```\n", 59 | "\n", 60 | "and then launch several workers by using the `dask-worker` command and providing them the address of the scheduler they should connect to:\n", 61 | "\n", 62 | "```terminal\n", 63 | "$ dask-worker tcp://192.0.0.100:8786\n", 64 | "Start worker at: tcp://192.0.0.1:12345\n", 65 | "Registered to: tcp://192.0.0.100:8786\n", 66 | "\n", 67 | "$ dask-worker tcp://192.0.0.100:8786\n", 68 | "Start worker at: tcp://192.0.0.2:40483\n", 69 | "Registered to: tcp://192.0.0.100:8786\n", 70 | "\n", 71 | "$ dask-worker tcp://192.0.0.100:8786\n", 72 | "Start worker at: tcp://192.0.0.3:27372\n", 73 | "Registered to: tcp://192.0.0.100:8786\n", 74 | "```\n", 75 | "\n", 76 | "### Python API (advanced)\n", 77 | "\n", 78 | "⚠️ **Warning**: Creating `Scheduler` / `Worker` objects explicitly in Python is rarely needed in practice and is intended for more advanced users ⚠️" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "id": "ac63e3e5-e75f-4ebd-b1ff-4619a7e6f5a0", 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "from dask.distributed import Scheduler, Worker, Client\n", 89 | "\n", 90 | "# Launch a scheduler\n", 91 | "async with Scheduler() as scheduler: # Launch a scheduler\n", 92 | " # Launch a worker which connects to the scheduler\n", 93 | " async with Worker(scheduler.address) as worker:\n", 94 | " # Launch a client which connects to the scheduler\n", 95 | " async with Client(scheduler.address, asynchronous=True) as client:\n", 96 | " result = await client.submit(sum, range(100))\n", 97 | " print(f\"{result = }\")" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "id": "ec8031ac-58a0-40bb-a0b1-07847d54b964", 103 | "metadata": {}, 104 | "source": [ 105 | "### Cluster managers (recommended)\n", 106 | "\n", 107 | "Dask has the notion of cluster manager objects. Cluster managers offer a consistent interface for common activities like adding/removing workers to a cluster, retrieving logs, etc." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "a77fbd0f-720a-48e2-aa07-fd9d9a01f381", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "from dask.distributed import LocalCluster\n", 118 | "\n", 119 | "# Launch a scheduler and 4 workers on my local machine\n", 120 | "cluster = LocalCluster(n_workers=4, threads_per_worker=2)\n", 121 | "cluster" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "id": "41daa9b9-51c5-4996-85a5-144f19714389", 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# Scale up to 10 workers\n", 132 | "cluster.scale(10)" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "id": "c8937b5b-2559-4eb1-96ae-d76b385526c2", 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# Scale down to 2 workers\n", 143 | "cluster.scale(2)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "id": "3b6098c3-1f8b-4e42-8d0a-996f2b44b1e3", 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "# Retrieve cluster logs\n", 154 | "cluster.get_logs()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "id": "e350ead3-229a-4974-bd89-7174d99edbef", 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "# Shut down cluster\n", 165 | "cluster.close()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "id": "94169465-79c0-4063-b7bf-dbc9264cdfe4", 171 | "metadata": {}, 172 | "source": [ 173 | "There are several projects in the Dask ecosystem for easily deploying clusters on commonly used computing resources:\n", 174 | "\n", 175 | "- [Dask-Kubernetes](https://kubernetes.dask.org/en/latest/) for deploying Dask using native Kubernetes APIs\n", 176 | "- [Dask-Cloudprovider](https://cloudprovider.dask.org/en/latest/) for deploying Dask clusters on various cloud platforms (e.g. AWS, GCP, Azure, etc.)\n", 177 | "- [Dask-Yarn](https://yarn.dask.org/en/latest/) for deploying Dask on YARN clusters\n", 178 | "- [Dask-MPI](http://mpi.dask.org/en/latest/) for deploying Dask on existing MPI environments\n", 179 | "- [Dask-Jobqueue](https://jobqueue.dask.org/en/latest/) for deploying Dask on job queuing systems (e.g. PBS, Slurm, etc.)\n", 180 | "\n", 181 | "Launching clusters with any of these projects follows a similar pattern as using Dask's built-in `LocalCluster`:\n", 182 | "\n", 183 | "```python\n", 184 | "# Launch a Dask cluster on a Kubernetes cluster\n", 185 | "from dask_kubernetes import KubeCluster\n", 186 | "cluster = KubeCluster(...)\n", 187 | "\n", 188 | "# Launch a Dask cluster on AWS Fargate\n", 189 | "from dask_cloudprovider.aws import FargateCluster\n", 190 | "cluster = FargateCluster(...)\n", 191 | "\n", 192 | "# Launch a Dask cluster on a PBS job queueing system\n", 193 | "from dask_jobqueue import PBSCluster\n", 194 | "cluster = PBSCluster(...)\n", 195 | "```\n", 196 | "\n", 197 | "Additionally, there are compaines like [Coiled](https://coiled.io) and [Saturn Cloud](https://www.saturncloud.io) which have Dask deployment-as-a-service offerings. *Disclaimer*: the instructors for this tutorial are employed both by these comapnies. \n", 198 | "\n", 199 | "#### Related Documentation\n", 200 | "\n", 201 | "- [Cluster setup](https://docs.dask.org/en/latest/setup.html)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "id": "20ba2911-0ac4-43bd-bdb9-54a2806e5ea9", 207 | "metadata": {}, 208 | "source": [ 209 | "# Inspecting a cluster's state\n", 210 | "\n", 211 | "In this section we'll:\n", 212 | "\n", 213 | "1. Familiarize ourselves with Dask's scheduler and worker processes\n", 214 | "2. Explore the various state that's tracked throughout the cluster\n", 215 | "3. Learn how to inspect remote scheduler and worker processes\n", 216 | "\n", 217 | "Dask has a a variety of ways to provide users insight into what's going on during their computations. For example, Dask's [diagnositc dashboard](https://docs.dask.org/en/latest/diagnostics-distributed.html) displays real-time information about what tasks are current running, overal progress on a computation, worker CPU and memory load, statistical profiling information, and much more. Additionally, Dask's [performance reports](https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports) allow you to save the diagnostic dashboards as static HTML plots. Performance reports are particularly useful when benchmarking/profiling workloads or when sharing workload performance with colleagues." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "id": "ee7a1ae0-f029-41a8-90bc-7d6a5e6651e8", 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "from dask.distributed import LocalCluster, Client, Worker\n", 228 | "\n", 229 | "cluster = LocalCluster(worker_class=Worker)\n", 230 | "client = Client(cluster)\n", 231 | "client" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "id": "33ce7759-2113-4dcc-9581-2a473eead63b", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "import dask.array as da\n", 242 | "from dask.distributed import performance_report\n", 243 | "\n", 244 | "with performance_report(\"my_report.html\"):\n", 245 | " x = da.random.random((10_000, 10_000), chunks=(1_000, 1_000))\n", 246 | " result = (x + x.T).mean(axis=0).mean()\n", 247 | " result.compute()" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "bed8743b-a34e-488a-bc51-abe923914cf0", 253 | "metadata": {}, 254 | "source": [ 255 | "These are invaluable tools and we highly recommend utilizing them. Often times Dask's dashboard is totally sufficient to understand the performance of your computations.\n", 256 | "\n", 257 | "However, sometimes it can be useful to dive more deeply into the internals of your cluster and directly inspect the state of your scheduler and workers. Let's start by submitting some tasks to the cluster to be computed." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "id": "98cd7868-ceb2-4066-bb8f-094ad3fc0053", 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "import random\n", 268 | "\n", 269 | "def double(x):\n", 270 | " random.seed(x)\n", 271 | " # Simulate some random task failures\n", 272 | " if random.random() < 0.1:\n", 273 | " raise ValueError(\"Oh no!\")\n", 274 | " return 2 * x\n", 275 | "\n", 276 | "futures = client.map(double, range(50))" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "id": "3b8a3c19-97d2-47a9-8216-3f73abe6189a", 282 | "metadata": {}, 283 | "source": [ 284 | "One of the nice things about `LocalCluster` is it gives us direct access the `Scheduler` Python object. This allows us to easily inspect the scheduler directly." 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "id": "5b842c4f-64c6-4a0b-b068-45332552d6a9", 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "scheduler = cluster.scheduler\n", 295 | "scheduler" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "id": "6b8e8155-68a2-4cfc-8cfb-424180e98bdb", 301 | "metadata": {}, 302 | "source": [ 303 | "ℹ️ Note that often times you won't have direct access to the `Scheduler` Python object (e.g. when the scheduler is running on separate machine). In these cases it's still possible to inspect the scheduler and we will discuss how to do this later on.\n", 304 | "\n", 305 | "The scheduler tracks **a lot** of state. Let's start to explore the scheduler to get a sense for what information it keeps track of." 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "id": "a4a43b6c-210c-41a0-bfff-38c38bfb6316", 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "scheduler.address # Scheduler's address" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "id": "eb25dc76-df8c-418e-84ea-ba3d33fcd373", 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "scheduler.time_started # Time the scheduler was started" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "id": "59c8b1be-3adb-4cdd-893c-f83d71a75a92", 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "dict(scheduler.workers)" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "id": "f9f68f77-69c6-4b17-90ce-cf179af7d287", 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "worker_state = next(iter(scheduler.workers.values()))\n", 346 | "worker_state" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "id": "f5ba8180-8ef7-4bc7-ae0a-cf7b9708c9e6", 352 | "metadata": {}, 353 | "source": [ 354 | "Let's take a look at the `WorkerState` attributes" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "id": "a73d206f-7a65-49f0-8361-3f9e37f5a73a", 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "worker_state.address # Worker's address" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "id": "c0d58de4-ec72-4d16-a77b-f35a05d74563", 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "worker_state.status # Current status of the worker (e.g. \"running\", \"closed\")" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "id": "9b0add25-1edc-4bad-92dd-c2b18947d451", 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "worker_state.nthreads # Number of threads in the worker's `ThreadPoolExecutor`" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "id": "24be2aab-a7c7-454d-91f9-78897e32eadb", 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "worker_state.executing # Dictionary of all tasks which are currently being processed, along with the current duration of the task" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "id": "8e414215-dc91-4468-af2f-4119945bca3a", 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "worker_state.metrics # Various metrics describing the current state of the worker" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "id": "e2bbacd5-a368-4566-912c-4a4db4b8a211", 410 | "metadata": {}, 411 | "source": [ 412 | "Workers check in with the scheduler inform it when certain event occur (e.g. when a worker has completed a task) so the scheduler can update it's internal state." 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "id": "abf2af29-be3e-4d78-a5c2-ca595d380357", 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "worker_state.last_seen" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "id": "be601b2d-7047-44b2-81c2-5dabd344053c", 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "import time\n", 433 | "\n", 434 | "for _ in range(10):\n", 435 | " print(f\"{worker_state.last_seen = }\")\n", 436 | " time.sleep(1)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "id": "13b53603-8d32-40c8-b71d-b3ed1af57848", 442 | "metadata": {}, 443 | "source": [ 444 | "In addition to the state of each worker, the scheduler also tracks information for each task it has been asked to run." 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "id": "ca9ddfa1-7ab7-4585-961d-01522c343b42", 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "scheduler.tasks" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "id": "ac277222-ac14-4743-be25-d2460f5d9d9f", 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "task_state = next(iter(scheduler.tasks.values()))" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "id": "9d8093d0-9542-42d7-a417-1e9b59a6970d", 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "task_state" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "id": "1bc9a48c-6c3a-4f39-be74-fae5e26ea645", 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "task_state.key # Task's name (unique identifier)" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "id": "cfe7c942-172b-4e54-9548-8e94c17da4c0", 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [ 494 | "task_state.state # Task's state (e.g. \"memory\", \"waiting\", \"processing\", \"erred\", etc.)" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": null, 500 | "id": "b9cd9140-19ce-45dd-adc0-d74c7a671ba1", 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "task_state.who_has # Set of workers (`WorkerState`s) who have this task's result in memory" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "id": "fce9b8e3-e8c7-49d0-8343-00d899c35a8a", 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "task_state.nbytes # The number of bytes of the result of this finished task" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "id": "4b2ec17e-8bb3-47a7-b078-869b0019f6ae", 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "task_state.type # The type of the the task's result (as a string)" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "id": "1ee16f2c-3f95-41f6-ad80-8530b1c33802", 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "task_state.retries # The number of times this task can automatically be retried in case of failure" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "id": "f498bef9-988d-4991-981f-d62158dec477", 540 | "metadata": {}, 541 | "source": [ 542 | "## Exercise 1\n", 543 | "\n", 544 | "Spend the next 5 minutes continuing to explore the attributes the scheduler keeps track of. Try to answer the following questions:\n", 545 | "\n", 546 | "1. What are the keys for the tasks which failed?\n", 547 | "2. How many tasks successfully ran on each worker?" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": null, 553 | "id": "0f35a9e7-05e6-4395-9551-bf6ba39d6aad", 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "# What are the keys for the tasks which failed?\n", 558 | "# Your solution goes here" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "id": "4b4e6c15-bf52-4bf5-b841-9646770ac99e", 565 | "metadata": { 566 | "jupyter": { 567 | "source_hidden": true 568 | }, 569 | "tags": [] 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "# Solution to \"What are the keys for the tasks which failed?\"\n", 574 | "erred_tasks = [key for key, ts in scheduler.tasks.items() if ts.state == \"erred\"]\n", 575 | "erred_tasks" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "id": "5d22af01-a1e8-400e-b049-9889c24142ae", 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "# How many tasks successfull ran on each worker?\n", 586 | "# Your solution goes here" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "id": "b698f481-b9d6-4e9c-ac0c-325d1ef725d8", 593 | "metadata": { 594 | "jupyter": { 595 | "source_hidden": true 596 | }, 597 | "tags": [] 598 | }, 599 | "outputs": [], 600 | "source": [ 601 | "# Solution to \"How many tasks successfull ran on each worker?\"\n", 602 | "from collections import defaultdict\n", 603 | "\n", 604 | "erred_tasks = [key for key, ts in scheduler.tasks.items() if ts.state == \"erred\"]\n", 605 | "counter = defaultdict(int)\n", 606 | "for key in scheduler.tasks:\n", 607 | " if key in erred_tasks:\n", 608 | " continue\n", 609 | " for worker in scheduler.who_has[key]:\n", 610 | " counter[worker] += 1\n", 611 | "print(counter)\n", 612 | "\n", 613 | "# # Alternative solution to \"How many tasks successfull ran on each worker?\"\n", 614 | "# counter = {address: worker_state.metrics[\"in_memory\"]\n", 615 | "# for address, worker_state in scheduler.workers.items()}\n", 616 | "# print(counter)" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "id": "5e07c7c2-f157-4fbb-82cb-81b59fff7da6", 622 | "metadata": {}, 623 | "source": [ 624 | "In addition to inspecting the scheduler, we can also investigate the state of each of our workers." 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "id": "fbc95933-bff8-4a8e-9b28-430c4a042034", 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "cluster.workers" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": null, 640 | "id": "42800bd2-8ef2-456a-87d7-0354cf7c9626", 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "worker = next(iter(cluster.workers.values()))\n", 645 | "worker" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "id": "b3f5ccb3-4495-4290-9209-25b2687c951e", 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [ 655 | "worker.address # Worker's address" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "id": "e89df365-8547-4157-823c-0b33c6ef83e9", 662 | "metadata": {}, 663 | "outputs": [], 664 | "source": [ 665 | "worker.executing_count # Number of tasks the worker is currenting computing" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "id": "969a7bed-a6e3-4708-9f95-1900d48b27f5", 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [ 675 | "worker.executed_count # Running total of all tasks processed on this worker" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": null, 681 | "id": "aaeebb27-e9c3-456b-9267-c5180559e82e", 682 | "metadata": {}, 683 | "outputs": [], 684 | "source": [ 685 | "worker.nthreads # Number of threads in the worker's ThreadPoolExecutor" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "id": "ab866a75-7e95-4c5f-b3ff-e54306492fa2", 692 | "metadata": {}, 693 | "outputs": [], 694 | "source": [ 695 | "worker.executor # Worker's ThreadPoolExecutor where it computes tasks" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "id": "257fd911-8d37-421a-87ad-1f87e4f68f3e", 702 | "metadata": {}, 703 | "outputs": [], 704 | "source": [ 705 | "worker.keys() # Keys the worker currently has in memory" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "id": "c1b9afab-edad-4a99-8471-71a943f81d35", 712 | "metadata": {}, 713 | "outputs": [], 714 | "source": [ 715 | "worker.data # Where the worker stores task results" 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": null, 721 | "id": "5d45c97e-d8c5-4a50-882c-689ec62378b9", 722 | "metadata": {}, 723 | "outputs": [], 724 | "source": [ 725 | "{key: worker.data[key] for key in worker.keys()}" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "id": "c1e20d0f-38d7-4694-a65e-e223557d5d3a", 731 | "metadata": {}, 732 | "source": [ 733 | "## Accessing remote scheduler and workers\n", 734 | "\n", 735 | "As we noted earlier, often times you won't have direct access to the `Scheduler` or `Worker` Python objects for your cluster. However, in these cases it's still possible to examine the state of the scheduler and workers in your cluster using the `Client.run` ([docs](https://distributed.dask.org/en/latest/api.html#distributed.Client.run)) and `Client.run_on_scheduler`([docs](https://distributed.dask.org/en/latest/api.html#distributed.Client.run_on_scheduler)) methods.\n", 736 | "\n", 737 | "`Client.run` allows you to run a function on worker processes in your cluster. If the function has a `dask_worker` parameter, then that variable will be populated with the `Worker` instance when the function is run. Likewise, `Client.run_on_scheduler` allows you to run a function on the scheduler processes in your cluster. If the function has a `dask_scheduler` parameter, then that variable will be populated with the `Scheduler` instance when the function is run.\n", 738 | "\n", 739 | "Let's look at some examples." 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "id": "63168d7a-0db6-41be-81d7-9ea51a3bf2dd", 746 | "metadata": {}, 747 | "outputs": [], 748 | "source": [ 749 | "import os\n", 750 | "\n", 751 | "result = client.run(os.getpid)\n", 752 | "result" 753 | ] 754 | }, 755 | { 756 | "cell_type": "markdown", 757 | "id": "11856721-488a-4a92-ab32-131ec137be32", 758 | "metadata": {}, 759 | "source": [ 760 | "`Client.run` also accepts a `workers=` keyword argument which is the list of workers you want to run the specified function on (by default it will run on all workers in the cluster)." 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": null, 766 | "id": "5d53735e-7133-4a54-866c-8156070e873f", 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "workers = list(result.keys())[:2]\n", 771 | "workers" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "id": "8bd655cf-be15-4dbb-907a-0dd1bc691be3", 778 | "metadata": {}, 779 | "outputs": [], 780 | "source": [ 781 | "import os\n", 782 | "\n", 783 | "client.run(os.getpid, workers=workers)" 784 | ] 785 | }, 786 | { 787 | "cell_type": "markdown", 788 | "id": "1454872f-b7e3-4cdd-9bb2-20735d778e5f", 789 | "metadata": {}, 790 | "source": [ 791 | "You can even run custom function you've written yourself! If the function has a `dask_worker` parameter ..." 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": null, 797 | "id": "f2567d89-6409-42ec-b147-c24ab84ac643", 798 | "metadata": {}, 799 | "outputs": [], 800 | "source": [ 801 | "def get_worker_name(dask_worker):\n", 802 | " return dask_worker.name\n", 803 | "\n", 804 | "client.run(get_worker_name)" 805 | ] 806 | }, 807 | { 808 | "cell_type": "markdown", 809 | "id": "09ac4f6b-c0ae-4c7e-a66b-1f39a3ceb83b", 810 | "metadata": {}, 811 | "source": [ 812 | "Similarly, we can do the same thing on the scheduler by using `Client.run_on_scheduler`" 813 | ] 814 | }, 815 | { 816 | "cell_type": "code", 817 | "execution_count": null, 818 | "id": "6a89cf9b-0f78-456b-94df-7a7fb87b22e0", 819 | "metadata": {}, 820 | "outputs": [], 821 | "source": [ 822 | "client.run_on_scheduler(os.getpid)" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "id": "0daf0c15-c20d-421e-a21c-3ee27df6e723", 829 | "metadata": { 830 | "tags": [] 831 | }, 832 | "outputs": [], 833 | "source": [ 834 | "def get_erred_tasks(dask_scheduler):\n", 835 | " return [key for key, ts in dask_scheduler.tasks.items() if ts.state == \"erred\"]\n", 836 | "\n", 837 | "client.run_on_scheduler(get_erred_tasks)" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": null, 843 | "id": "780f7e4f-8ad2-4e93-ac86-e2d8c5331902", 844 | "metadata": {}, 845 | "outputs": [], 846 | "source": [ 847 | "client.close()\n", 848 | "cluster.close()" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "id": "aec78fc7-3125-4d40-a03e-3595c384382d", 854 | "metadata": {}, 855 | "source": [ 856 | "#### Related Documentation\n", 857 | "\n", 858 | "- [Dask worker](https://distributed.dask.org/en/latest/worker.html)\n", 859 | "- [Scheduling state](https://distributed.dask.org/en/latest/scheduling-state.html)" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "id": "71b05ace-1e10-4716-ae83-f7c659c46d28", 865 | "metadata": {}, 866 | "source": [ 867 | "# Extending the scheduler and workers: Dask's plugin system\n", 868 | "\n", 869 | "In this section we'll siscuss Dask's scheduler and worker plugin systems and write our own plugin to extend the scheduler's functionality.\n", 870 | "\n", 871 | "So far we've primarily focused on inspecting the state of a cluster. However, there are times when it's useful to extend the functionality of the scheduler and/or workers in a cluster. To help facilitate this, Dask has scheduler and worker plugin systems which enable you to hook into different events that happen throughout a cluster's lifecycle. This allows you to run custom code when a specific type of event occurs on the cluster.\n", 872 | "\n", 873 | "Specifically, the [scheduler plugin system](https://distributed.dask.org/en/latest/plugins.html#scheduler-plugins) enables you run custom code when the following events occur:\n", 874 | "\n", 875 | "1. Scheduler starts, stops, or is restarted\n", 876 | "2. Client connects or disconnects to the scheduler\n", 877 | "3. Workers enters or leaves the cluster\n", 878 | "4. When a new task enters the scheduler\n", 879 | "5. When a task changes state (e.g. from \"processing\" to \"memory\")\n", 880 | "\n", 881 | "While the [worker plugin system](https://distributed.dask.org/en/latest/plugins.html#worker-plugins) enables you run custom code when the following events occur:\n", 882 | "\n", 883 | "1. Worker starts or stops\n", 884 | "2. When a worker releases a task\n", 885 | "3. When a task changes state (e.g. \"processing\" to \"memory\")\n", 886 | "\n", 887 | "Implementing your own custom plugin consists of creating a Python class with certain methods (each method corresponds to a particular lifecycle event)." 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "execution_count": null, 893 | "id": "de6f767a-52e6-4c9f-a26a-7a642b4bb953", 894 | "metadata": {}, 895 | "outputs": [], 896 | "source": [ 897 | "from distributed import SchedulerPlugin, WorkerPlugin" 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": null, 903 | "id": "6f35b157-5502-4809-a1e6-5991799c8d31", 904 | "metadata": {}, 905 | "outputs": [], 906 | "source": [ 907 | "# Lifecycle SchedulerPlugin methods\n", 908 | "[attr for attr in dir(SchedulerPlugin) if not attr.startswith(\"_\")]" 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": null, 914 | "id": "69ac1db2-1170-49dd-a043-a646df6cedc6", 915 | "metadata": {}, 916 | "outputs": [], 917 | "source": [ 918 | "# Lifecycle WorkerPlugin methods\n", 919 | "[attr for attr in dir(WorkerPlugin) if not attr.startswith(\"_\")]" 920 | ] 921 | }, 922 | { 923 | "cell_type": "markdown", 924 | "id": "131294ca-74ee-4754-bd3e-15eba1de758b", 925 | "metadata": {}, 926 | "source": [ 927 | "For the exact signature of each method, please refer to the [`SchedulerPlugin`](https://distributed.dask.org/en/latest/plugins.html#scheduler-plugins) and [`WorkerPlugin`](https://distributed.dask.org/en/latest/plugins.html#worker-plugins) documentation.\n", 928 | "\n", 929 | "Let's looks at an example scheduler plugin." 930 | ] 931 | }, 932 | { 933 | "cell_type": "code", 934 | "execution_count": null, 935 | "id": "4fc97c90-a858-40ea-9f31-e7f90aba3e85", 936 | "metadata": {}, 937 | "outputs": [], 938 | "source": [ 939 | "class Counter(SchedulerPlugin):\n", 940 | " \"\"\"Keeps a running count of the total number of completed tasks\"\"\"\n", 941 | " def __init__(self):\n", 942 | " self.n_tasks = 0\n", 943 | "\n", 944 | " def transition(self, key, start, finish, *args, **kwargs):\n", 945 | " if start == \"processing\" and finish == \"memory\":\n", 946 | " self.n_tasks += 1\n", 947 | "\n", 948 | " def restart(self, scheduler):\n", 949 | " self.n_tasks = 0" 950 | ] 951 | }, 952 | { 953 | "cell_type": "markdown", 954 | "id": "e2586c29-e5e3-495d-84aa-d30f02639ff7", 955 | "metadata": {}, 956 | "source": [ 957 | "To add a custom scheduler plugin to your cluster, use the `Scheduler.add_plugin` method:" 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": null, 963 | "id": "56d0b325-2524-4c33-809c-5b853c94594b", 964 | "metadata": {}, 965 | "outputs": [], 966 | "source": [ 967 | "# Create LocalCluster and Client\n", 968 | "cluster = LocalCluster()\n", 969 | "client = Client(cluster)\n", 970 | "\n", 971 | "# Instantiate and add the Counter to our cluster\n", 972 | "counter = Counter()\n", 973 | "cluster.scheduler.add_plugin(counter)" 974 | ] 975 | }, 976 | { 977 | "cell_type": "code", 978 | "execution_count": null, 979 | "id": "485d1388-4cec-4a16-a2d1-2f2e5d3d2448", 980 | "metadata": {}, 981 | "outputs": [], 982 | "source": [ 983 | "counter.n_tasks" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": null, 989 | "id": "220a76d3-25d4-4a3b-800e-fea64b055cb7", 990 | "metadata": {}, 991 | "outputs": [], 992 | "source": [ 993 | "from distributed import wait\n", 994 | "futures = client.map(lambda x: x + 1, range(27))\n", 995 | "wait(futures);" 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "id": "5a92ad14-2976-4552-b654-8a7ad8645d8a", 1002 | "metadata": {}, 1003 | "outputs": [], 1004 | "source": [ 1005 | "counter.n_tasks" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": null, 1011 | "id": "c4a05fc9-8292-48c7-a67d-21ac536e4d3c", 1012 | "metadata": {}, 1013 | "outputs": [], 1014 | "source": [ 1015 | "client.close()\n", 1016 | "cluster.close()" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "id": "2137bfc5-255d-4047-821d-54a10791b451", 1022 | "metadata": {}, 1023 | "source": [ 1024 | "This is a relatively straightforward plugin one could write. Let's look at the `distributed`s built-in `PipInstall` worker plugin to see a more real-world example." 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "code", 1029 | "execution_count": null, 1030 | "id": "d3ed51ef-9342-4f3b-a5f4-459191be8db7", 1031 | "metadata": { 1032 | "tags": [] 1033 | }, 1034 | "outputs": [], 1035 | "source": [ 1036 | "from distributed import PipInstall\n", 1037 | "\n", 1038 | "PipInstall??" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "markdown", 1043 | "id": "530d27a2-8f93-4735-bc9f-b4e508b7ad53", 1044 | "metadata": {}, 1045 | "source": [ 1046 | "To add a custom worker plugin to your cluster, use the `Client.register_worker_plugin` method." 1047 | ] 1048 | }, 1049 | { 1050 | "cell_type": "markdown", 1051 | "id": "efde3b89-47c8-4d75-b015-be7641284ea6", 1052 | "metadata": {}, 1053 | "source": [ 1054 | "## Exercise 2\n", 1055 | "\n", 1056 | "Over the next 10 minutes, create a `TaskTimerPlugin` scheduler plugin which keeps tracks of how long each task takes to run.\n", 1057 | "\n", 1058 | "```python\n", 1059 | "\n", 1060 | "class TaskTimerPlugin(SchedulerPlugin):\n", 1061 | " ...\n", 1062 | "\n", 1063 | "# Create LocalCluster and Client\n", 1064 | "cluster = LocalCluster()\n", 1065 | "client = Client(cluster)\n", 1066 | "\n", 1067 | "# Instantiate and add the TaskTimerPlugin to our cluster\n", 1068 | "plugin = TaskTimerPlugin()\n", 1069 | "cluster.scheduler.add_plugin(plugin)\n", 1070 | "\n", 1071 | "import dask.array as da\n", 1072 | "\n", 1073 | "x = da.random.random((20_000, 20_000), chunks=(5_000, 1_000))\n", 1074 | "result = (x + x.T).mean(axis=0).sum()\n", 1075 | "result.compute()\n", 1076 | "```" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "id": "46fd60d2-87c0-47df-a857-a37cf6ee26f7", 1083 | "metadata": {}, 1084 | "outputs": [], 1085 | "source": [ 1086 | "# Your solution to Exercise 2 here" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "code", 1091 | "execution_count": null, 1092 | "id": "628ceef7-8602-4e13-8191-02d6078cdcaf", 1093 | "metadata": { 1094 | "jupyter": { 1095 | "source_hidden": true 1096 | }, 1097 | "scrolled": true, 1098 | "tags": [] 1099 | }, 1100 | "outputs": [], 1101 | "source": [ 1102 | "# Solution to Exercise 2\n", 1103 | "import time\n", 1104 | "\n", 1105 | "class TaskTimerPlugin(SchedulerPlugin):\n", 1106 | " def __init__(self):\n", 1107 | " self.start_times = {}\n", 1108 | " self.stop_times = {}\n", 1109 | " self.task_durations = {}\n", 1110 | "\n", 1111 | " def transition(self, key, start, finish, *args, **kwargs):\n", 1112 | " if finish == \"processing\":\n", 1113 | " self.start_times[key] = time.time()\n", 1114 | " elif finish == \"memory\":\n", 1115 | " self.stop_times[key] = time.time()\n", 1116 | " self.task_durations[key] = self.stop_times[key] - self.start_times[key]\n", 1117 | "\n", 1118 | "# Create LocalCluster and Client\n", 1119 | "cluster = LocalCluster()\n", 1120 | "client = Client(cluster)\n", 1121 | "\n", 1122 | "# Instantiate and add the TaskTimerPlugin to our cluster\n", 1123 | "plugin = TaskTimerPlugin()\n", 1124 | "cluster.scheduler.add_plugin(plugin)\n", 1125 | "\n", 1126 | "import dask.array as da\n", 1127 | "\n", 1128 | "x = da.random.random((20_000, 20_000), chunks=(5_000, 1_000))\n", 1129 | "result = (x + x.T).mean(axis=0).sum()\n", 1130 | "result.compute()\n", 1131 | "\n", 1132 | "plugin.task_durations" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "markdown", 1137 | "id": "9e230a95-353e-4d7a-a5b1-16f8b6e97b9f", 1138 | "metadata": {}, 1139 | "source": [ 1140 | "**Bonus**: If you have extra time, make a plot of the task duration distribution (hint: `pandas` and `matplotlib` are installed)" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "execution_count": null, 1146 | "id": "3ab78fb2-ef32-4587-a0fe-0b4c31930926", 1147 | "metadata": {}, 1148 | "outputs": [], 1149 | "source": [ 1150 | "# Your plotting code here" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "code", 1155 | "execution_count": null, 1156 | "id": "d5369156-5524-49c0-bb00-37adffa1d175", 1157 | "metadata": { 1158 | "jupyter": { 1159 | "source_hidden": true 1160 | }, 1161 | "tags": [] 1162 | }, 1163 | "outputs": [], 1164 | "source": [ 1165 | "import pandas as pd\n", 1166 | "\n", 1167 | "df = pd.DataFrame([(key, 1_000 * value) for key, value in plugin.task_durations.items()],\n", 1168 | " columns=[\"key\", \"duration\"])\n", 1169 | "ax = df.duration.plot(kind=\"hist\", bins=50, logy=True)\n", 1170 | "ax.set_xlabel(\"Task duration [ms]\")\n", 1171 | "ax.set_ylabel(\"Counts\");" 1172 | ] 1173 | }, 1174 | { 1175 | "cell_type": "code", 1176 | "execution_count": null, 1177 | "id": "5bf57a9e-8fb6-4355-9aff-3f426efab9ce", 1178 | "metadata": {}, 1179 | "outputs": [], 1180 | "source": [ 1181 | "client.close()\n", 1182 | "cluster.close()" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "markdown", 1187 | "id": "f09c76dd-7523-42f5-a483-30d8ba69f2ec", 1188 | "metadata": {}, 1189 | "source": [ 1190 | "# Summary\n", 1191 | "\n", 1192 | "This notebook we took a detailed look at the components of a Dask cluster, illustrated how to inspect the internal state of a cluster (both the scheduler and workers), and how you can use Dask's plugin system to execute custom code during a cluster's lifecycle.\n", 1193 | "\n", 1194 | "# Additional Resources\n", 1195 | "\n", 1196 | "- Repositories on GitHub:\n", 1197 | " - Dask https://github.com/dask/dask\n", 1198 | " - Distributed https://github.com/dask/distributed\n", 1199 | " \n", 1200 | "- Documentation:\n", 1201 | " - Dask documentation https://docs.dask.org\n", 1202 | " - Distributed documentation https://distributed.dask.org\n", 1203 | "\n", 1204 | "- If you have a Dask usage questions, please ask it on the [Dask GitHub discussions board](https://github.com/dask/dask/discussions).\n", 1205 | "\n", 1206 | "- If you run into a bug, feel free to file a report on the [Dask GitHub issue tracker](https://github.com/dask/dask/issues).\n", 1207 | "\n", 1208 | "- If you're interested in getting involved and contributing to Dask. Please check out our [contributing guide](https://docs.dask.org/en/latest/develop.html).\n", 1209 | "\n", 1210 | "# Thank you!" 1211 | ] 1212 | } 1213 | ], 1214 | "metadata": { 1215 | "kernelspec": { 1216 | "display_name": "Python 3", 1217 | "language": "python", 1218 | "name": "python3" 1219 | }, 1220 | "language_info": { 1221 | "codemirror_mode": { 1222 | "name": "ipython", 1223 | "version": 3 1224 | }, 1225 | "file_extension": ".py", 1226 | "mimetype": "text/x-python", 1227 | "name": "python", 1228 | "nbconvert_exporter": "python", 1229 | "pygments_lexer": "ipython3", 1230 | "version": "3.9.4" 1231 | } 1232 | }, 1233 | "nbformat": 4, 1234 | "nbformat_minor": 5 1235 | } 1236 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 James Bourbeau 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hacking Dask: Diving into Dask's Internals 2 | 3 | [![Build](https://github.com/jrbourbeau/hacking-dask/actions/workflows/build.yml/badge.svg)](https://github.com/jrbourbeau/hacking-dask/actions/workflows/build.yml) 4 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jrbourbeau/hacking-dask/main?urlpath=lab) 5 | 6 | This repository contains materials for the "Hacking Dask: Diving Into Dask’s Internals" tutorial. 7 | 8 | ## Running the tutorial 9 | 10 | There are two different ways in which you can set up and go through the tutorial materials. Both of which are outlined in the table below. 11 | 12 | | Method | Setup | Description | 13 | | :-----------: | :-----------: | ----------- | 14 | | Binder | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jrbourbeau/hacking-dask/main?urlpath=lab) | Run the tutorial notebooks on mybinder.org without installing anything locally. | 15 | | Local install | [Instructions](#Local-installation-instructions) | Download the tutorial notebooks and install the necessary packages (via `conda`) locally. Setting things up locally can take a few minutes, so we recommend going through the installation steps prior to the tutorial. | 16 | 17 | 18 | ## Local installation instructions 19 | 20 | ### 1. Clone the repository 21 | 22 | First clone this repository to your local machine via: 23 | 24 | ``` 25 | git clone https://github.com/jrbourbeau/hacking-dask 26 | ``` 27 | 28 | ### 2. Download conda (if you haven't already) 29 | 30 | If you do not already have the conda package manager installed, please follow the instructions [here](https://docs.conda.io/en/latest/miniconda.html). 31 | 32 | ### 3. Create a conda environment 33 | 34 | Navigate to the `hacking-dask/` directory and create a new conda environment with the required 35 | packages via: 36 | 37 | ```terminal 38 | cd hacking-dask 39 | conda env create --file binder/environment.yml 40 | ``` 41 | 42 | This will create a new conda environment named "hacking-dask". 43 | 44 | ### 4. Activate the environment 45 | 46 | Next, activate the environment: 47 | 48 | ``` 49 | conda activate hacking-dask 50 | ``` 51 | 52 | ### 5. Launch JupyterLab 53 | 54 | Finally, launch JupyterLab with: 55 | 56 | ``` 57 | jupyter lab 58 | ``` -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: hacking-dask 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.9 6 | - dask=2021.05.0 7 | - python-graphviz 8 | - imageio 9 | - scipy 10 | - matplotlib 11 | - pip 12 | - pyarrow 13 | # JupyterLab + extensions 14 | - jupyterlab>=3 15 | - dask-labextension 16 | - ipywidgets -------------------------------------------------------------------------------- /binder/jupyterlab-workspace.json: -------------------------------------------------------------------------------- 1 | { 2 | "data": { 3 | "file-browser-filebrowser:cwd": { 4 | "path": "" 5 | }, 6 | "dask-dashboard-launcher": { 7 | "url": "DASK_DASHBOARD_URL" 8 | } 9 | }, 10 | "metadata": { 11 | "id": "/lab" 12 | } 13 | } -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | 6 | # Import the workspace 7 | jupyter lab workspaces import binder/jupyterlab-workspace.json 8 | 9 | exec "$@" -------------------------------------------------------------------------------- /images/animation.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/animation.gif -------------------------------------------------------------------------------- /images/custom_operations_blockwise.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/custom_operations_blockwise.png -------------------------------------------------------------------------------- /images/custom_operations_groupby_aggregation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/custom_operations_groupby_aggregation.png -------------------------------------------------------------------------------- /images/custom_operations_map_blocks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/custom_operations_map_blocks.png -------------------------------------------------------------------------------- /images/custom_operations_reduction.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/custom_operations_reduction.png -------------------------------------------------------------------------------- /images/dask-array.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/dask-array.png -------------------------------------------------------------------------------- /images/dask-cluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/dask-cluster.png -------------------------------------------------------------------------------- /images/dask-dataframe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/dask-dataframe.png -------------------------------------------------------------------------------- /images/dask-overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/dask-overview.png -------------------------------------------------------------------------------- /images/dask_horizontal.svg: -------------------------------------------------------------------------------- 1 | dask -------------------------------------------------------------------------------- /images/inc-add.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/images/inc-add.png -------------------------------------------------------------------------------- /puzzle/bicycle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/puzzle/bicycle.png -------------------------------------------------------------------------------- /puzzle/bicycle_0_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/puzzle/bicycle_0_0.png -------------------------------------------------------------------------------- /puzzle/bicycle_0_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/puzzle/bicycle_0_1.png -------------------------------------------------------------------------------- /puzzle/bicycle_1_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/puzzle/bicycle_1_0.png -------------------------------------------------------------------------------- /puzzle/bicycle_1_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jrbourbeau/hacking-dask/4b2d9e02f783e5cf93e268870b62ea8b2a0be8a1/puzzle/bicycle_1_1.png --------------------------------------------------------------------------------