├── images ├── inc-add.png ├── animation.gif ├── dask-array.png ├── custom_operations_reduction.png ├── custom_operations_map_blocks.png └── dask_horizontal.svg ├── requirements.txt ├── environment.yml ├── LICENSE ├── .github └── workflows │ └── build.yml ├── README.md ├── .gitignore ├── 0-welcome.ipynb ├── scripts └── create_data.ipynb ├── appendix-graph-optimization.ipynb ├── 1-overview.ipynb ├── 4-performance-optimization.ipynb ├── 2-custom-operations.ipynb └── 3-distributed-scheduler.ipynb /images/inc-add.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask-contrib/dask-tutorial-advanced/HEAD/images/inc-add.png -------------------------------------------------------------------------------- /images/animation.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask-contrib/dask-tutorial-advanced/HEAD/images/animation.gif -------------------------------------------------------------------------------- /images/dask-array.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask-contrib/dask-tutorial-advanced/HEAD/images/dask-array.png -------------------------------------------------------------------------------- /images/custom_operations_reduction.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask-contrib/dask-tutorial-advanced/HEAD/images/custom_operations_reduction.png -------------------------------------------------------------------------------- /images/custom_operations_map_blocks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask-contrib/dask-tutorial-advanced/HEAD/images/custom_operations_map_blocks.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | dask==2023.7.0 2 | ipycytoscape # Not using `python-graphviz` here 3 | imageio 4 | scipy 5 | matplotlib 6 | pip 7 | pyarrow==12 8 | s3fs==2023.6.0 9 | coiled==0.8.3 10 | gilknocker 11 | # JupyterLab + extensions 12 | jupyterlab>=3,<4 13 | dask-labextension 14 | ipywidgets -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-tutorial-advanced 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.11 6 | - dask=2023.7.0 7 | - python-graphviz 8 | - imageio 9 | - scipy 10 | - matplotlib 11 | - pip 12 | - pyarrow=12 13 | - s3fs=2023.6.0 14 | - coiled=0.8.3 15 | - gilknocker 16 | # JupyterLab + extensions 17 | - jupyterlab>=3,<4 18 | - dask-labextension 19 | - ipywidgets -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Dask Contributors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.github/workflows/build.yml: -------------------------------------------------------------------------------- 1 | name: Build 2 | on: 3 | push: 4 | pull_request: 5 | workflow_dispatch: 6 | 7 | jobs: 8 | conda: 9 | runs-on: ${{ matrix.os }} 10 | strategy: 11 | matrix: 12 | os: [windows-latest, ubuntu-latest, macos-latest] 13 | 14 | steps: 15 | - name: Checkout source 16 | uses: actions/checkout@v3 17 | 18 | - name: Setup Conda Environment 19 | uses: conda-incubator/setup-miniconda@v2.2.0 20 | with: 21 | miniforge-variant: Mambaforge 22 | miniforge-version: latest 23 | use-mamba: true 24 | channel-priority: strict 25 | environment-file: environment.yml 26 | activate-environment: dask-tutorial-advanced 27 | auto-activate-base: false 28 | 29 | pip: 30 | runs-on: ${{ matrix.os }} 31 | strategy: 32 | matrix: 33 | os: [windows-latest, ubuntu-latest, macos-latest] 34 | 35 | steps: 36 | - name: Checkout source 37 | uses: actions/checkout@v3 38 | 39 | - uses: actions/setup-python@v4 40 | with: 41 | python-version: "3.11" 42 | 43 | - run: pip install -r requirements.txt -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dask Tutorial - Advanced 2 | 3 | [![Build](https://github.com/dask-contrib/dask-tutorial-advanced/actions/workflows/build.yml/badge.svg)](https://github.com/dask-contrib/dask-tutorial-advanced/actions/workflows/build.yml) 4 | 5 | This repository contains materials for the "Advanced Dask Tutorial" that will be presented at SciPy 2023. 6 | 7 | ## Running the tutorial 8 | 9 | Download the tutorial notebooks and install the necessary packages (via `conda`) locally. Setting things up locally can take a few minutes, so we recommend going through the installation steps prior to the tutorial. 10 | 11 | ### 1. Clone the repository 12 | 13 | First clone this repository to your local machine via: 14 | 15 | ``` 16 | git clone https://github.com/dask-contrib/dask-tutorial-advanced 17 | ``` 18 | 19 | ### 2. Download conda (if you haven't already) 20 | 21 | If you do not already have the conda package manager installed, please follow the instructions [here](https://docs.conda.io/en/latest/miniconda.html). 22 | 23 | ### 3. Create a conda environment 24 | 25 | Navigate to the `dask-tutorial-advanced/` directory and create a new conda environment with the required 26 | packages via: 27 | 28 | ```terminal 29 | cd dask-tutorial-advanced 30 | conda env create --file environment.yml 31 | ``` 32 | 33 | This will create a new conda environment named "dask-tutorial-advanced". 34 | 35 | ### 4. Activate the environment 36 | 37 | Next, activate the environment: 38 | 39 | ``` 40 | conda activate dask-tutorial-advanced 41 | ``` 42 | 43 | ### 5. Launch JupyterLab 44 | 45 | Finally, launch JupyterLab with: 46 | 47 | ``` 48 | jupyter lab 49 | ``` 50 | 51 | ### Instructions for Notebook 4 52 | 53 | We will be launching the last notebook on the cloud and creating bigger clusters. From a terminal where you have the `dask-tutorial-advanced` environment activated: 54 | 55 | ``` 56 | coiled login --token f5924259c2b04a54a8f25a1e07941177-bed7217e66f325cdd64c69d4c63e2a893dc02b86 --account dask-tutorials 57 | coiled notebook start --software dask-tutorial-advanced 58 | ``` 59 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | dask-worker-space/ 2 | my_report.html 3 | example_data/ 4 | mydask.png 5 | .DS_Store 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | *$py.class 11 | 12 | # C extensions 13 | *.so 14 | 15 | # Distribution / packaging 16 | .Python 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | .eggs/ 23 | lib/ 24 | lib64/ 25 | parts/ 26 | sdist/ 27 | var/ 28 | wheels/ 29 | share/python-wheels/ 30 | *.egg-info/ 31 | .installed.cfg 32 | *.egg 33 | MANIFEST 34 | 35 | # PyInstaller 36 | # Usually these files are written by a python script from a template 37 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 38 | *.manifest 39 | *.spec 40 | 41 | # Installer logs 42 | pip-log.txt 43 | pip-delete-this-directory.txt 44 | 45 | # Unit test / coverage reports 46 | htmlcov/ 47 | .tox/ 48 | .nox/ 49 | .coverage 50 | .coverage.* 51 | .cache 52 | nosetests.xml 53 | coverage.xml 54 | *.cover 55 | *.py,cover 56 | .hypothesis/ 57 | .pytest_cache/ 58 | cover/ 59 | 60 | # Translations 61 | *.mo 62 | *.pot 63 | 64 | # Django stuff: 65 | *.log 66 | local_settings.py 67 | db.sqlite3 68 | db.sqlite3-journal 69 | 70 | # Flask stuff: 71 | instance/ 72 | .webassets-cache 73 | 74 | # Scrapy stuff: 75 | .scrapy 76 | 77 | # Sphinx documentation 78 | docs/_build/ 79 | 80 | # PyBuilder 81 | .pybuilder/ 82 | target/ 83 | 84 | # Jupyter Notebook 85 | .ipynb_checkpoints 86 | 87 | # IPython 88 | profile_default/ 89 | ipython_config.py 90 | 91 | # pyenv 92 | # For a library or package, you might want to ignore these files since the code is 93 | # intended to run in multiple environments; otherwise, check them in: 94 | # .python-version 95 | 96 | # pipenv 97 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 98 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 99 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 100 | # install all needed dependencies. 101 | #Pipfile.lock 102 | 103 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 104 | __pypackages__/ 105 | 106 | # Celery stuff 107 | celerybeat-schedule 108 | celerybeat.pid 109 | 110 | # SageMath parsed files 111 | *.sage.py 112 | 113 | # Environments 114 | .env 115 | .venv 116 | env/ 117 | venv/ 118 | ENV/ 119 | env.bak/ 120 | venv.bak/ 121 | 122 | # Spyder project settings 123 | .spyderproject 124 | .spyproject 125 | 126 | # Rope project settings 127 | .ropeproject 128 | 129 | # mkdocs documentation 130 | /site 131 | 132 | # mypy 133 | .mypy_cache/ 134 | .dmypy.json 135 | dmypy.json 136 | 137 | # Pyre type checker 138 | .pyre/ 139 | 140 | # pytype static type analyzer 141 | .pytype/ 142 | 143 | # Cython debug symbols 144 | cython_debug/ 145 | -------------------------------------------------------------------------------- /images/dask_horizontal.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /0-welcome.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "\"Dask\n", 15 | " \n", 16 | "# Dask Tutorial - Advanced\n", 17 | "\n", 18 | "## Materials and setup\n", 19 | "\n", 20 | "The materials for this tutorial are available at https://github.com/dask-contrib/dask-tutorial-advanced\n", 21 | "\n", 22 | "## About the instructors\n", 23 | "\n", 24 | "#### [Charles Blackmon-Luca](https://github.com/charlesbluca) — Software Engineer, [RAPIDS](https://rapids.ai/)\n", 25 | "#### [James Bourbeau](https://www.jamesbourbeau.com) — Lead Open Source Software Engineer, [Coiled](https://coiled.io/)\n", 26 | "#### [Naty Clementi](https://github.com/ncclementi) — Software Engineer, [Coiled](https://coiled.io/)\n", 27 | "#### [Julia Signell](https://jsignell.github.io) — Software Engineer, [Element 84](https://www.element84.com/)\n", 28 | "\n", 29 | "## Tutorial goals\n", 30 | "\n", 31 | "The goal of this tutorial is to cover more advanced features of Dask like task graph optimization, the worker and scheduler plugin system, how to inspect the internal state of a cluster, debugging distributed computations, diagnosing performance issues, and more.\n", 32 | "\n", 33 | "Attendees should walk away with an introduction to more advanced features, ideas of how they can apply these features effectively to their own data intensive workloads, and a deeper understanding of Dask’s internals.\n", 34 | "\n", 35 | "> ℹ️ NOTE: While there is a brief overview notebook, this tutorial largely assumes some prior knowledge of Dask. If you are new to Dask, we recommend going through the [Dask tutorial](https://tutorial.dask.org) to get an introduction to Dask prior to going through this tutorial.\n", 36 | "\n", 37 | "## Outline\n", 38 | "\n", 39 | "The tutorial consists of several Jupyter notebooks which we will cover in the order listed below:\n", 40 | "\n", 41 | "- [1-overview.ipynb](1-overview.ipynb)\n", 42 | "- [2-custom-operations.ipynb](2-custom-operations.ipynb)\n", 43 | "- [3-distributed-scheduler.ipynb](3-distributed-scheduler.ipynb)\n", 44 | "- [4-performance-optimization.ipynb](4-performance-optimization.ipynb)\n", 45 | "\n", 46 | "Each notebook also contains hands-on exercises to illustrate the concepts being presented. Let's look at our first example to get a sense for how they work.\n", 47 | "\n", 48 | "### Exercise: Print \"Hello world!\"\n", 49 | "\n", 50 | "Use Python to print the string \"Hello world!\" to the screen." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# Your solution here" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "jupyter": { 67 | "source_hidden": true 68 | }, 69 | "tags": [] 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "# A solution\n", 74 | "print(\"Hello world!\")" 75 | ] 76 | }, 77 | { 78 | "attachments": {}, 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "### Next steps\n", 83 | "\n", 84 | "Let's start by going through a brief overview of Dask's basics over in [1-overview.ipynb](1-overview.ipynb)." 85 | ] 86 | } 87 | ], 88 | "metadata": { 89 | "kernelspec": { 90 | "display_name": "Python 3 (ipykernel)", 91 | "language": "python", 92 | "name": "python3" 93 | }, 94 | "language_info": { 95 | "codemirror_mode": { 96 | "name": "ipython", 97 | "version": 3 98 | }, 99 | "file_extension": ".py", 100 | "mimetype": "text/x-python", 101 | "name": "python", 102 | "nbconvert_exporter": "python", 103 | "pygments_lexer": "ipython3", 104 | "version": "3.10.12" 105 | } 106 | }, 107 | "nbformat": 4, 108 | "nbformat_minor": 4 109 | } 110 | -------------------------------------------------------------------------------- /scripts/create_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "736f0576-2486-4551-9b48-3d4288af366e", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import coiled\n", 11 | "from dask.distributed import Client\n", 12 | "import dask" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "id": "6ae7d14d-7822-4c67-b316-1440689355b4", 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import dask.dataframe as dd" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "id": "b7256c61-35a8-4d69-9fab-650220e47ece", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "cluster = coiled.Cluster(n_workers=100)\n", 33 | "client = cluster.get_client()" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "id": "99af056e-fa71-487a-8ffa-0e4c0ea81512", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "client" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "id": "a62cced4-7e5a-4db4-a2d2-abde5c259936", 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "dask.config.set({\"dataframe.convert-string\": True})" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "id": "11a35129-451c-41a0-9930-a86dccbecab1", 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "df = dd.read_parquet(\n", 64 | " \"s3://coiled-datasets/uber-lyft-tlc/\",\n", 65 | " storage_options={'anon': True}\n", 66 | ")" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "id": "ce0d1d63-70c4-4bdc-b208-d84e3c8e81db", 73 | "metadata": { 74 | "tags": [] 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "df_full = df" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "id": "a4087889-a37a-4179-97bb-a695a0839ba7", 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "df.dtypes" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "id": "b44160f5-b4ba-4f9b-9d51-c20bfc963011", 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "df.head()" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "id": "21560272-24c0-4c65-b43e-7c712b570797", 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "dask.utils.format_bytes(\n", 109 | " df.memory_usage(deep=True).sum().compute()\n", 110 | ")" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "id": "d4bb9737-8e38-4aee-8b85-df43d941a03c", 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "df_sample = df.sample(frac=0.2)" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "id": "bbdaedf0-71f2-44be-97ee-0d251b38ec51", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "df_sample = df_sample.persist()" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "id": "882dd628-920e-4621-9f9b-531b4a309889", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "dask.utils.format_bytes(\n", 141 | " df_sample.memory_usage(deep=True).sum().compute()\n", 142 | ")" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "id": "339e5b8b-14f3-4c4c-ba74-43823488e3d9", 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "from dask.sizeof import sizeof" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "id": "c18d100f-e842-4ace-aad6-3a179792be53", 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "partitions_mem_stats = df_sample.map_partitions(sizeof).compute()" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "id": "eec0e812-7070-4715-be6b-588a7d8b743e", 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "(partitions_mem_stats / 1024**2).describe() #in MiB" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "135514a0-cfc0-4d2d-9187-758ff5e2d2f4", 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "dask.utils.format_bytes(df_sample.partitions[0].memory_usage(deep=True).compute().sum())" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "id": "71493360-320b-461c-99c8-33cbbcc4cc27", 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "%%time\n", 193 | "#passanger fare\n", 194 | "df_sample.base_passenger_fare.sum().compute() / 1e9" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "id": "cde07a38-b3f4-42cb-aadd-ba91360170f1", 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "%%time\n", 205 | "#tip\n", 206 | "df_sample.tips.sum().compute() / 1e6" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "08182a38-04c2-49f7-bfc4-68b474824d47", 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "%%time\n", 217 | "df_sample.loc[lambda x: x.tips > 0].groupby(\"hvfhs_license_num\").tips.agg([\"sum\", \"mean\"]).compute()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "id": "231ab6a3-edfe-4b8e-8c21-0dbf5577327b", 223 | "metadata": {}, 224 | "source": [ 225 | "## Partition size 1MB \n", 226 | "\n", 227 | "Runs are ~11X slower compared to 13MB partitions" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "id": "b3754d13-31c4-49db-b8db-de9bff627b23", 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "df_sample = df_sample.repartition(partition_size=\"10MB\").persist()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "id": "8a0dc205-9369-467f-97bc-0c122dff3fa3", 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "dask.utils.format_bytes(df_sample.partitions[0].memory_usage(deep=True).compute().sum())" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "id": "6d943324-99b0-4fd0-b8ec-7d0cca68488a", 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "%%time\n", 258 | "#passanger fare\n", 259 | "df_sample.base_passenger_fare.sum().compute() / 1e9" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "id": "f323d74a-7968-4243-9b7d-02737dbee77d", 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "%%time\n", 270 | "#tip\n", 271 | "df_sample.tips.sum().compute() / 1e6" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "id": "a44878c6-c38b-42a8-9d29-e5e2f2a47eaf", 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "%%time\n", 282 | "df_sample.loc[lambda x: x.tips > 0].groupby(\"hvfhs_license_num\").tips.agg([\"sum\", \"mean\"]).compute()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "id": "640a6505-9b86-4c3d-88a8-4ac845de9c4b", 288 | "metadata": {}, 289 | "source": [ 290 | "## Write 1MB partition data to parquet and csv" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "id": "419db173-6634-4e04-8b46-29b778caebb4", 297 | "metadata": { 298 | "tags": [] 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "df_sample.to_parquet(\"s3://coiled-datasets/uber-lyft-tlc-sample/parquet-0.2-10/\");" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "id": "fa488722-b8e0-4c5f-a909-305f1dc8f5c0", 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "df_sample.to_csv(\"s3://coiled-datasets/uber-lyft-tlc-sample/csv-0.2-10/\");" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "id": "72a11c39-bcbb-435d-a348-76901598b338", 319 | "metadata": { 320 | "tags": [] 321 | }, 322 | "outputs": [], 323 | "source": [ 324 | "ddf = dd.read_csv(\n", 325 | " \"s3://coiled-datasets/uber-lyft-tlc-sample/csv-0.2-10/*\", \n", 326 | " dtype={\"wav_match_flag\": \"category\"},\n", 327 | ")" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "id": "5e861ff6-d380-4fcc-bc63-9b633b9e8ca3", 334 | "metadata": { 335 | "tags": [] 336 | }, 337 | "outputs": [], 338 | "source": [ 339 | "dask.utils.format_bytes(\n", 340 | " ddf.memory_usage(deep=True).sum().compute()\n", 341 | ")" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "id": "84ab159d-613d-4d04-95f5-104c3e17b75a", 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [] 351 | } 352 | ], 353 | "metadata": { 354 | "kernelspec": { 355 | "display_name": "Python 3 (ipykernel)", 356 | "language": "python", 357 | "name": "python3" 358 | }, 359 | "language_info": { 360 | "codemirror_mode": { 361 | "name": "ipython", 362 | "version": 3 363 | }, 364 | "file_extension": ".py", 365 | "mimetype": "text/x-python", 366 | "name": "python", 367 | "nbconvert_exporter": "python", 368 | "pygments_lexer": "ipython3", 369 | "version": "3.10.12" 370 | } 371 | }, 372 | "nbformat": 4, 373 | "nbformat_minor": 5 374 | } 375 | -------------------------------------------------------------------------------- /appendix-graph-optimization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "id": "recovered-trouble", 7 | "metadata": {}, 8 | "source": [ 9 | "\"Dask\n", 10 | "\n", 11 | "# Graph Optimizations\n", 12 | "\n", 13 | "In general, there are two goals when doing graph optimizations:\n", 14 | "\n", 15 | "1. Simplify computation\n", 16 | "2. Improve parallelism\n", 17 | "\n", 18 | "Simplifying computation can be done on a graph level by removing unnecessary tasks (``cull``).\n", 19 | "\n", 20 | "Parallelism can be improved by reducing\n", 21 | "inter-task communication, whether by fusing many tasks into one (``fuse``), or\n", 22 | "by inlining cheap operations (``inline``, ``inline_functions``).\n", 23 | "\n", 24 | "\n", 25 | "**Related Documentation**\n", 26 | "\n", 27 | " - [Optimization](https://docs.dask.org/en/latest/optimize.html)\n", 28 | "\n", 29 | "## Example\n", 30 | "\n", 31 | "Suppose you had a custom Dask graph for doing a word counting task:" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "id": "disabled-greek", 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import dask\n", 42 | "\n", 43 | "def print_and_return(string):\n", 44 | " print(string)\n", 45 | " return string\n", 46 | "\n", 47 | "def format_str(count, val, nwords):\n", 48 | " return (f'word list has {count} occurrences of '\n", 49 | " f'{val}, out of {nwords} words')\n", 50 | "\n", 51 | "dsk = {'words': 'apple orange apple pear orange pear pear',\n", 52 | " 'nwords': (len, (str.split, 'words')),\n", 53 | " 'val1': 'orange',\n", 54 | " 'val2': 'apple',\n", 55 | " 'val3': 'pear',\n", 56 | " 'count1': (str.count, 'words', 'val1'),\n", 57 | " 'count2': (str.count, 'words', 'val2'),\n", 58 | " 'count3': (str.count, 'words', 'val3'),\n", 59 | " 'format1': (format_str, 'count1', 'val1', 'nwords'),\n", 60 | " 'format2': (format_str, 'count2', 'val2', 'nwords'),\n", 61 | " 'format3': (format_str, 'count3', 'val3', 'nwords'),\n", 62 | " 'print1': (print_and_return, 'format1'),\n", 63 | " 'print2': (print_and_return, 'format2'),\n", 64 | " 'print3': (print_and_return, 'format3'),\n", 65 | "}\n", 66 | "\n", 67 | "dask.visualize(dsk, verbose=True, collapse_outputs=True)" 68 | ] 69 | }, 70 | { 71 | "attachments": {}, 72 | "cell_type": "markdown", 73 | "id": "da3a382a", 74 | "metadata": {}, 75 | "source": [ 76 | "In this example we are:\n", 77 | "\n", 78 | "1. counting the frequency of the words ``'orange'``, ``'apple'``, and ``'pear'`` in the list of words\n", 79 | "2. formatting an output string reporting the results\n", 80 | "3. printing the output and returning the output string" 81 | ] 82 | }, 83 | { 84 | "attachments": {}, 85 | "cell_type": "markdown", 86 | "id": "circular-driving", 87 | "metadata": {}, 88 | "source": [ 89 | "### Cull\n", 90 | "\n", 91 | "To perform the computation, we first remove unnecessary components from the\n", 92 | "graph using the ``cull`` function and then pass the Dask graph and the desired\n", 93 | "output keys to a scheduler ``get`` function:" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "id": "meaningful-knife", 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "from dask.threaded import get\n", 104 | "from dask.optimization import cull\n", 105 | "\n", 106 | "outputs = ['print1', 'print2']\n", 107 | "dsk1, dependencies = cull(dsk, outputs) # remove unnecessary tasks from the graph\n", 108 | "\n", 109 | "results = get(dsk1, outputs)\n", 110 | "dask.visualize(dsk1, verbose=True, collapse_outputs=True)" 111 | ] 112 | }, 113 | { 114 | "attachments": {}, 115 | "cell_type": "markdown", 116 | "id": "afraid-asbestos", 117 | "metadata": {}, 118 | "source": [ 119 | "As can be seen above, the scheduler computed only the requested outputs\n", 120 | "(``'print3'`` was never computed). This is because we called the\n", 121 | "``dask.optimization.cull`` function, which removes the unnecessary tasks from\n", 122 | "the graph.\n", 123 | "\n", 124 | "Culling is part of the default optimization pass of almost all collections.\n", 125 | "Often you want to call it somewhat early to reduce the amount of work done in\n", 126 | "later steps." 127 | ] 128 | }, 129 | { 130 | "attachments": {}, 131 | "cell_type": "markdown", 132 | "id": "integral-handling", 133 | "metadata": {}, 134 | "source": [ 135 | "### Inline\n", 136 | "\n", 137 | "Looking at the word counting task graph, there are multiple accesses to constants such\n", 138 | "as ``'val1'`` or ``'val2'``. These can be inlined into the\n", 139 | "tasks to improve efficiency using the ``inline`` function. For example:" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "id": "lightweight-gamma", 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "from dask.optimization import inline\n", 150 | "\n", 151 | "dsk2 = inline(dsk1, dependencies=dependencies)\n", 152 | "results = get(dsk2, outputs)\n", 153 | "dask.visualize(dsk2, verbose=True, collapse_outputs=True)" 154 | ] 155 | }, 156 | { 157 | "attachments": {}, 158 | "cell_type": "markdown", 159 | "id": "promising-retro", 160 | "metadata": {}, 161 | "source": [ 162 | "Now we have two sets of *almost* linear task chains. The only link between them\n", 163 | "is the word counting function. For cheap operations like this, the\n", 164 | "serialization cost may be larger than the actual computation, so it may be\n", 165 | "faster to do the computation more than once, rather than passing the results to\n", 166 | "all nodes. To perform this function inlining, the ``inline_functions`` function\n", 167 | "can be used:" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "stretch-doctor", 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "from dask.optimization import inline_functions\n", 178 | "\n", 179 | "dsk3 = inline_functions(dsk2, outputs, [len, str.split],\n", 180 | " dependencies=dependencies)\n", 181 | "results = get(dsk3, outputs)\n", 182 | "dask.visualize(dsk3, verbose=True, collapse_outputs=True)" 183 | ] 184 | }, 185 | { 186 | "attachments": {}, 187 | "cell_type": "markdown", 188 | "id": "5e54f219", 189 | "metadata": {}, 190 | "source": [ 191 | "Now we have a set of purely linear tasks. We’d like to have the scheduler run all of these on the same worker to reduce data serialization between workers. One option is just to merge these linear chains into one big task using the ``fuse`` function:" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "id": "cleared-shoulder", 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "from dask.optimization import fuse\n", 202 | "\n", 203 | "dsk4, dependencies = fuse(dsk3)\n", 204 | "results = get(dsk4, outputs)\n", 205 | "dask.visualize(dsk4, verbose=True, collapse_outputs=True)" 206 | ] 207 | }, 208 | { 209 | "attachments": {}, 210 | "cell_type": "markdown", 211 | "id": "weird-blade", 212 | "metadata": {}, 213 | "source": [ 214 | "### Result\n", 215 | "\n", 216 | "Putting it all together:" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "id": "experimental-special", 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "def optimize_and_get(dsk, keys):\n", 227 | " dsk1, deps = cull(dsk, keys)\n", 228 | " dsk2 = inline(dsk1, dependencies=deps)\n", 229 | " dsk3 = inline_functions(dsk2, keys, [len, str.split],\n", 230 | " dependencies=deps)\n", 231 | " dsk4, deps = fuse(dsk3)\n", 232 | " return get(dsk4, keys)\n", 233 | "\n", 234 | "optimize_and_get(dsk, outputs)" 235 | ] 236 | }, 237 | { 238 | "attachments": {}, 239 | "cell_type": "markdown", 240 | "id": "hearing-rochester", 241 | "metadata": {}, 242 | "source": [ 243 | "In summary, the above operations accomplish the following:\n", 244 | "\n", 245 | "1. Removed tasks unnecessary for the desired output using ``cull``\n", 246 | "2. Inlined constants using ``inline``\n", 247 | "3. Inlined cheap computations using ``inline_functions``, improving parallelism\n", 248 | "4. Fused linear tasks together to ensure they run on the same worker using ``fuse``\n", 249 | "\n", 250 | "These optimizations are already performed automatically in the Dask collections." 251 | ] 252 | }, 253 | { 254 | "attachments": {}, 255 | "cell_type": "markdown", 256 | "id": "b222c529-d522-47ec-853d-d23d2744b7b1", 257 | "metadata": {}, 258 | "source": [ 259 | "## Conclusion\n", 260 | "\n", 261 | "Optimizations in Dask let's you simplify computation and improve parallelism. There are some great ones included by default (`cull`, `inline`, `fuse`), but sometimes it can be really powerful to write custom optimizations and either use them on existing collections or on custom collections. We'll touch on this a bit more in the next section about custom collections." 262 | ] 263 | } 264 | ], 265 | "metadata": { 266 | "jupytext": { 267 | "cell_metadata_filter": "-all", 268 | "main_language": "python", 269 | "notebook_metadata_filter": "-all", 270 | "text_representation": { 271 | "extension": ".md", 272 | "format_name": "markdown" 273 | } 274 | }, 275 | "kernelspec": { 276 | "display_name": "Python 3 (ipykernel)", 277 | "language": "python", 278 | "name": "python3" 279 | }, 280 | "language_info": { 281 | "codemirror_mode": { 282 | "name": "ipython", 283 | "version": 3 284 | }, 285 | "file_extension": ".py", 286 | "mimetype": "text/x-python", 287 | "name": "python", 288 | "nbconvert_exporter": "python", 289 | "pygments_lexer": "ipython3", 290 | "version": "3.10.11" 291 | }, 292 | "toc-autonumbering": false, 293 | "toc-showcode": false, 294 | "toc-showmarkdowntxt": false, 295 | "toc-showtags": false 296 | }, 297 | "nbformat": 4, 298 | "nbformat_minor": 5 299 | } 300 | -------------------------------------------------------------------------------- /1-overview.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "\"Dask\n", 15 | " \n", 16 | "# Parallel Computing in Python with Dask\n", 17 | "\n", 18 | "This notebook provides a high-level overview of Dask. We discuss why you might want to use Dask, high-level and low-level APIs for generating computational graphs, and Dask's schedulers which enable the parallel execution of these graphs." 19 | ] 20 | }, 21 | { 22 | "attachments": {}, 23 | "cell_type": "markdown", 24 | "metadata": { 25 | "slideshow": { 26 | "slide_type": "subslide" 27 | } 28 | }, 29 | "source": [ 30 | "# Overview\n", 31 | "\n", 32 | "[Dask](https://docs.dask.org) is a flexible, [open source](https://github.com/dask/dask) library for parallel and distributed computing in Python. Dask is designed to scale the existing Python ecosystem.\n", 33 | "\n", 34 | "You might want to use Dask because it:\n", 35 | "\n", 36 | "- Enables parallel and larger-than-memory computations\n", 37 | "\n", 38 | "- Uses familiar APIs you're used to from projects like NumPy, pandas, and scikit-learn\n", 39 | "\n", 40 | "- Allows you to scale existing workflows with minimal code changes\n", 41 | "\n", 42 | "- Dask works on your laptop, but also scales out to large clusters\n", 43 | "\n", 44 | "- Offers great built-in diagnosic tools" 45 | ] 46 | }, 47 | { 48 | "attachments": {}, 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Components of Dask\n", 53 | "\n", 54 | "From a high level, Dask is comprised of two main components:\n", 55 | "\n", 56 | "1. **Dask collections** which extend common interfaces like NumPy, pandas, and Python iterators to larger-than-memory or distributed environments by creating *task graphs*\n", 57 | "2. **Schedulers** which compute task graphs produced by Dask collections in parallel\n", 58 | "\n", 59 | "\"Dask" 62 | ] 63 | }, 64 | { 65 | "attachments": {}, 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### Task Graphs" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "def inc(i):\n", 79 | " return i + 1\n", 80 | "\n", 81 | "def add(a, b):\n", 82 | " return a + b\n", 83 | "\n", 84 | "a, b = 1, 12\n", 85 | "c = inc(a)\n", 86 | "d = inc(b)\n", 87 | "output = add(c, d)\n", 88 | "\n", 89 | "print(f'output = {output}')" 90 | ] 91 | }, 92 | { 93 | "attachments": {}, 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "This computation can be encoded in the following task graph:" 98 | ] 99 | }, 100 | { 101 | "attachments": {}, 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "![](images/inc-add.png)" 106 | ] 107 | }, 108 | { 109 | "attachments": {}, 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "\n", 114 | "- Graph of inter-related tasks with dependencies between them\n", 115 | "\n", 116 | "- Circular nodes in the graph are Python function calls\n", 117 | "\n", 118 | "- Square nodes are Python objects that are created by one task as output and can be used as inputs in another task" 119 | ] 120 | }, 121 | { 122 | "attachments": {}, 123 | "cell_type": "markdown", 124 | "metadata": { 125 | "slideshow": { 126 | "slide_type": "subslide" 127 | } 128 | }, 129 | "source": [ 130 | "# Dask Collections\n", 131 | "\n", 132 | "Let's looks at two Dask user interfaces: Dask Array and Dask Delayed.\n", 133 | "\n", 134 | "## Dask Arrays\n", 135 | "\n", 136 | "- Dask arrays are chunked, n-dimensional arrays\n", 137 | "\n", 138 | "- Can think of a Dask array as a collection of NumPy `ndarray` arrays\n", 139 | "\n", 140 | "- Dask arrays implement a large subset of the NumPy API using blocked algorithms\n", 141 | "\n", 142 | "- For many purposes Dask arrays can serve as drop-in replacements for NumPy arrays" 143 | ] 144 | }, 145 | { 146 | "attachments": {}, 147 | "cell_type": "markdown", 148 | "metadata": { 149 | "slideshow": { 150 | "slide_type": "subslide" 151 | } 152 | }, 153 | "source": [ 154 | "" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "slideshow": { 162 | "slide_type": "subslide" 163 | } 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "import numpy as np\n", 168 | "import dask.array as da" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "x_np = np.random.random(size=(1_000, 1_000))\n", 178 | "x_np" 179 | ] 180 | }, 181 | { 182 | "attachments": {}, 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "We can create a Dask array in a similar manner, but need to specify a `chunks` argument to tell Dask how to break up the underlying array into chunks." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "x = da.random.random(size=(1_000, 1_000), chunks=(250, 500))\n", 196 | "x # Dask arrays have nice HTML output in Jupyter notebooks" 197 | ] 198 | }, 199 | { 200 | "attachments": {}, 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "Dask arrays look and feel like NumPy arrays. For example, they have `dtype` and `shape` attributes:" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "print(x.dtype)\n", 214 | "print(x.shape)" 215 | ] 216 | }, 217 | { 218 | "attachments": {}, 219 | "cell_type": "markdown", 220 | "metadata": { 221 | "slideshow": { 222 | "slide_type": "subslide" 223 | } 224 | }, 225 | "source": [ 226 | "Dask collections are _lazily_ evaluated; the result from a computation isn't computed until you ask for it. Instead, a Dask task graph for the computation is produced. You can visualize this task graph using the `visualize()` method." 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": { 233 | "slideshow": { 234 | "slide_type": "-" 235 | } 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "x.visualize()" 240 | ] 241 | }, 242 | { 243 | "attachments": {}, 244 | "cell_type": "markdown", 245 | "metadata": { 246 | "slideshow": { 247 | "slide_type": "subslide" 248 | } 249 | }, 250 | "source": [ 251 | "To compute a task graph call the `compute()` method" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "result = x.compute() # We'll go into more detail about .compute() later on\n", 261 | "result" 262 | ] 263 | }, 264 | { 265 | "attachments": {}, 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "The result of this computation is a familiar NumPy `ndarray`:" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "type(result)" 279 | ] 280 | }, 281 | { 282 | "attachments": {}, 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "Dask arrays support a large portion of the NumPy interface:\n", 287 | "\n", 288 | "- Arithmetic and scalar mathematics: `+`, `*`, `exp`, `log`, ...\n", 289 | "\n", 290 | "- Reductions along axes: `sum()`, `mean()`, `std()`, `sum(axis=0)`, ...\n", 291 | "\n", 292 | "- Tensor contractions / dot products / matrix multiply: `tensordot`\n", 293 | "\n", 294 | "- Axis reordering / transpose: `transpose`\n", 295 | "\n", 296 | "- Slicing: `x[:100, 500:100:-2]`\n", 297 | "\n", 298 | "- Fancy indexing along single axes with lists or numpy arrays: `x[:, [10, 1, 5]]`\n", 299 | "\n", 300 | "- Array protocols like `__array__` and `__array_ufunc__`\n", 301 | "\n", 302 | "- Some linear algebra: `svd`, `qr`, `solve`, `solve_triangular`, `lstsq`, ...\n", 303 | "\n", 304 | "- ...\n", 305 | "\n", 306 | "See the [Dask array API docs](http://docs.dask.org/en/latest/array-api.html) for full details about what portion of the NumPy API is implemented for Dask arrays." 307 | ] 308 | }, 309 | { 310 | "attachments": {}, 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "We can build more complex computations using the familiar NumPy operations we're used to." 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": { 321 | "slideshow": { 322 | "slide_type": "subslide" 323 | } 324 | }, 325 | "outputs": [], 326 | "source": [ 327 | "result = (x + x.T).sum(axis=0).mean()\n", 328 | "result" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "result.visualize()" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "result.compute()" 347 | ] 348 | }, 349 | { 350 | "attachments": {}, 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "**Note**: Dask can be used to scale other array-like libraries that support the NumPy `ndarray` interface. For example, [pydata/sparse](https://sparse.pydata.org/en/latest/) for sparse arrays or [CuPy](https://cupy.dev/) for GPU-accelerated arrays." 355 | ] 356 | }, 357 | { 358 | "attachments": {}, 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "## Dask Delayed\n", 363 | "\n", 364 | "Sometimes problems don’t fit nicely into one of the high-level collections like Dask arrays or Dask DataFrames. In these cases, you can parallelize custom algorithms using the lower-level Dask `delayed` interface. This allows one to manually create task graphs with a light annotation of normal Python code." 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "import time\n", 374 | "import random\n", 375 | "\n", 376 | "def inc(x):\n", 377 | " time.sleep(random.random())\n", 378 | " return x + 1\n", 379 | "\n", 380 | "def double(x):\n", 381 | " time.sleep(random.random())\n", 382 | " return 2 * x\n", 383 | " \n", 384 | "def add(x, y):\n", 385 | " time.sleep(random.random())\n", 386 | " return x + y " 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "%%time\n", 396 | "\n", 397 | "data = [1, 2, 3, 4]\n", 398 | "\n", 399 | "output = []\n", 400 | "for i in data:\n", 401 | " a = inc(i)\n", 402 | " b = double(i)\n", 403 | " c = add(a, b)\n", 404 | " output.append(c)\n", 405 | "\n", 406 | "total = sum(output)" 407 | ] 408 | }, 409 | { 410 | "attachments": {}, 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "Dask `delayed` wraps function calls and delays their execution. `delayed` functions record what we want to compute (a function and input parameters) as a task in a graph that we’ll run later on parallel hardware by calling `compute`." 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "from dask import delayed" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "@delayed\n", 433 | "def lazy_inc(x):\n", 434 | " time.sleep(random.random())\n", 435 | " return x + 1\n", 436 | "\n", 437 | "lazy_inc" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "inc_output = lazy_inc(3) # lazily evaluate inc(3)\n", 447 | "inc_output" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "inc_output.compute()" 457 | ] 458 | }, 459 | { 460 | "attachments": {}, 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "Using `delayed` functions, we can build up a task graph for the particular computation we want to perform" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "metadata": {}, 471 | "outputs": [], 472 | "source": [ 473 | "double_inc_output = lazy_inc(inc_output)\n", 474 | "double_inc_output" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "double_inc_output.visualize()" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "double_inc_output.compute()" 493 | ] 494 | }, 495 | { 496 | "attachments": {}, 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "We can use `delayed` to make our previous example computation lazy by wrapping all the function calls with delayed:" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "metadata": {}, 507 | "outputs": [], 508 | "source": [ 509 | "import time\n", 510 | "import random\n", 511 | "\n", 512 | "@delayed\n", 513 | "def inc(x):\n", 514 | " time.sleep(random.random())\n", 515 | " return x + 1\n", 516 | "\n", 517 | "@delayed\n", 518 | "def double(x):\n", 519 | " time.sleep(random.random())\n", 520 | " return 2 * x\n", 521 | "\n", 522 | "@delayed\n", 523 | "def add(x, y):\n", 524 | " time.sleep(random.random())\n", 525 | " return x + y" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "%%time\n", 535 | "\n", 536 | "data = [1, 2, 3, 4]\n", 537 | "\n", 538 | "output = []\n", 539 | "for i in data:\n", 540 | " a = inc(i)\n", 541 | " b = double(i)\n", 542 | " c = add(a, b)\n", 543 | " output.append(c)\n", 544 | "\n", 545 | "total = delayed(sum)(output)\n", 546 | "total" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": null, 552 | "metadata": {}, 553 | "outputs": [], 554 | "source": [ 555 | "total.visualize()" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": null, 561 | "metadata": {}, 562 | "outputs": [], 563 | "source": [ 564 | "%%time\n", 565 | "\n", 566 | "total.compute()" 567 | ] 568 | }, 569 | { 570 | "attachments": {}, 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "We highly recommend checking out the [Dask delayed best practices](http://docs.dask.org/en/latest/delayed-best-practices.html) page to avoid some common pitfalls when using `delayed`. " 575 | ] 576 | }, 577 | { 578 | "attachments": {}, 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "# Schedulers\n", 583 | "\n", 584 | "High-level collections like Dask arrays and Dask DataFrames, as well as the low-level `dask.delayed` interface build up task graphs for a computation. After these graphs are generated, they need to be executed (potentially in parallel). This is the job of a task scheduler. Different task schedulers exist within Dask. Each will consume a task graph and compute the same result, but with different performance characteristics. " 585 | ] 586 | }, 587 | { 588 | "attachments": {}, 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "![grid-search](images/animation.gif \"grid-search\")\n" 593 | ] 594 | }, 595 | { 596 | "attachments": {}, 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "Dask has two different classes of schedulers: \n", 601 | "\n", 602 | "1. **Single machine schedulers** provide basic features on a local process or thread pool with minimal setup.\n", 603 | "2. **Distributed schedulers** are more sophisticated and offer more features, but also require a bit more effort to set up." 604 | ] 605 | }, 606 | { 607 | "attachments": {}, 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "## Single Machine Schedulers\n", 612 | "\n", 613 | "Single machine schedulers provide basic features on a local process or thread pool and require no setup (only use the Python standard library). The different single machine schedulers Dask provides are:\n", 614 | "\n", 615 | "- `'threads'`: The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`. The threaded scheduler is the default choice for Dask arrays, Dask DataFrames, and Dask delayed. \n", 616 | "\n", 617 | "- `'processes'`: The multiprocessing scheduler executes computations with a local `concurrent.futures.ProcessPoolExecutor`.\n", 618 | "\n", 619 | "- `'synchronous'`: The single-threaded synchronous scheduler executes all computations in the local thread, with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes." 620 | ] 621 | }, 622 | { 623 | "attachments": {}, 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "You can configure which scheduler is used in a few different ways. You can set the scheduler globally by using the `dask.config.set(scheduler=)` command:" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "import dask\n", 637 | "\n", 638 | "dask.config.set(scheduler='threads')\n", 639 | "x.compute() # Will use the multi-threading scheduler" 640 | ] 641 | }, 642 | { 643 | "attachments": {}, 644 | "cell_type": "markdown", 645 | "metadata": {}, 646 | "source": [ 647 | "or use it as a context manager to set the scheduler for a block of code:" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "with dask.config.set(scheduler='processes'):\n", 657 | " x.compute() # Will use the multi-processing scheduler" 658 | ] 659 | }, 660 | { 661 | "attachments": {}, 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "or even within a single compute call:" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "metadata": {}, 672 | "outputs": [], 673 | "source": [ 674 | "x.compute(scheduler='threads') # Will use the multi-threading scheduler" 675 | ] 676 | }, 677 | { 678 | "attachments": {}, 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "The `num_workers` argument can be used to specify the number of threads or processes to use:" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "execution_count": null, 688 | "metadata": {}, 689 | "outputs": [], 690 | "source": [ 691 | "x.compute(scheduler='threads', num_workers=4)" 692 | ] 693 | }, 694 | { 695 | "attachments": {}, 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "## Distributed Scheduler\n", 700 | "\n", 701 | "Despite having \"distributed\" in its name, the distributed scheduler works well on both single and multiple machines. Think of it as an \"advanced\" scheduler.\n", 702 | "\n", 703 | "A Dask distributed cluster is composed of a single centralized scheduler and one or more worker processes. A `Client` object is used as the user-facing entry point to interact with the cluster. We will talk about the components of Dask clusters in more detail later on in [3-distributed-scheduler.ipynb](3-distributed-scheduler.ipynb).\n", 704 | "\n", 705 | "\"Dask" 708 | ] 709 | }, 710 | { 711 | "attachments": {}, 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "The distributed scheduler has many features:\n", 716 | "\n", 717 | "- [Real-time, `concurrent.futures`-like interface](https://docs.dask.org/en/latest/futures.html)\n", 718 | "\n", 719 | "- [Sophisticated memory management](https://distributed.dask.org/en/latest/memory.html)\n", 720 | "\n", 721 | "- [Data locality](https://distributed.dask.org/en/latest/locality.html)\n", 722 | "\n", 723 | "- [Adaptive deployments](https://distributed.dask.org/en/latest/adaptive.html)\n", 724 | "\n", 725 | "- [Cluster resilience](https://distributed.dask.org/en/latest/resilience.html)\n", 726 | "\n", 727 | "- ...\n", 728 | "\n", 729 | "See the [Dask distributed documentation](https://distributed.dask.org) for full details about all the distributed scheduler features." 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "from dask.distributed import Client\n", 739 | "\n", 740 | "# Creates a local Dask cluster\n", 741 | "client = Client()\n", 742 | "client" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "metadata": {}, 749 | "outputs": [], 750 | "source": [ 751 | "x = da.ones((20_000, 20_000), chunks=(400, 400))\n", 752 | "result = (x + x.T).sum(axis=0).mean()" 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": null, 758 | "metadata": {}, 759 | "outputs": [], 760 | "source": [ 761 | "result.compute()" 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": null, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "client.close()" 771 | ] 772 | }, 773 | { 774 | "attachments": {}, 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "# Next steps\n", 779 | "\n", 780 | "Next, let's learn more about Dask internals and how to do custom operations in the [2-custom-operations.ipynb](2-custom-operations.ipynb) notebook." 781 | ] 782 | } 783 | ], 784 | "metadata": { 785 | "kernelspec": { 786 | "display_name": "Python 3 (ipykernel)", 787 | "language": "python", 788 | "name": "python3" 789 | }, 790 | "language_info": { 791 | "codemirror_mode": { 792 | "name": "ipython", 793 | "version": 3 794 | }, 795 | "file_extension": ".py", 796 | "mimetype": "text/x-python", 797 | "name": "python", 798 | "nbconvert_exporter": "python", 799 | "pygments_lexer": "ipython3", 800 | "version": "3.10.12" 801 | } 802 | }, 803 | "nbformat": 4, 804 | "nbformat_minor": 4 805 | } 806 | -------------------------------------------------------------------------------- /4-performance-optimization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f06a2f61-89b5-4a1f-9928-da4fe14d62a5", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 9 | "\n", 10 | "# Performance Optimization\n", 11 | "\n", 12 | "This notebook walks through a Dask DataFrame ETL workload. We'll demonstrate how to diagnose performance issues, utilize the Dask dashboard, and cover several common DataFrame best practices. \n", 13 | "\n", 14 | "## Dataset: Uber/Lyft TLC Trip Records\n", 15 | "\n", 16 | "The New York City Taxi and Limousine Commission (TLC) collects trip information for each taxi and for-hire vehicle trip completed by licensed drivers and vehicles; Here we'll analyze a subset of the [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page) datset stored which provides a good example of an out-of-core dataset that's too large for a standard laptop due to memory limitations.\n", 17 | "\n", 18 | "Some characteristics of the dataset:\n", 19 | "\n", 20 | "- CSV dataset that's ~115 GB in memory\n", 21 | "- Stored in `s3://coiled-datasets/uber-lyft-tlc-sample/csv-10/`\n", 22 | "- In region `us-east-2`\n", 23 | "\n", 24 | "## Cluster setup\n", 25 | "\n", 26 | "Because the dataset is too large for a laptop, we'll create a larger Dask cluster on AWS using [Coiled](https://www.coiled.io).\n", 27 | "(Disclaimer: Some of the instructors for this tutorial are employed by Coiled.):" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "id": "29f3298c-6626-4eb4-8389-1e3ca7f83f67", 34 | "metadata": { 35 | "tags": [] 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "import coiled\n", 40 | "\n", 41 | "cluster = coiled.Cluster(\n", 42 | " n_workers=20,\n", 43 | " region=\"us-east-2\", # start workers close to data to minimize costs\n", 44 | ")\n", 45 | "client = cluster.get_client()" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "id": "547788e0", 51 | "metadata": {}, 52 | "source": [ 53 | "Once we have initialized a cluster and client, we can easily view the Dask dashboard either through widgets provided by [dask-labextension](https://github.com/dask/dask-labextension), or by visiting the dashboard URL directly:" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "id": "b3396b6a-6432-4e86-b826-17556efee8ec", 60 | "metadata": { 61 | "tags": [] 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "client" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "id": "2a4c4623-e950-4685-849c-9793b5f18e59", 71 | "metadata": {}, 72 | "source": [ 73 | "Using `dask.dataframe.read_csv()`, we can lazily read this data in and do some low-level exploration before performing more complex computations:" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "id": "3b486301-0df1-4700-b618-343bbb397013", 80 | "metadata": { 81 | "tags": [] 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "%%time\n", 86 | "\n", 87 | "import dask.dataframe as dd\n", 88 | "\n", 89 | "ddf = dd.read_csv(\n", 90 | " \"s3://coiled-datasets/uber-lyft-tlc-sample/csv-0.2-10/*\", \n", 91 | " dtype={\"wav_match_flag\": \"category\"},\n", 92 | ")" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "id": "72a8ba8c-7ebe-4c26-81c4-76ac345e135b", 99 | "metadata": { 100 | "tags": [] 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "ddf.dtypes" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "id": "da2e59c0", 110 | "metadata": {}, 111 | "source": [ 112 | "After some initial exploration, we see that the columns representing on-scene and pickup times are stored as `object`s. We decide to do some feature engineering by converting these to datetimes and moving relevant date components into separate columns." 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "id": "63f2390e-5983-4844-8c37-2651afd8d940", 119 | "metadata": { 120 | "tags": [] 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "%%time\n", 125 | "\n", 126 | "# Convert to datetime\n", 127 | "ddf[\"on_scene_datetime\"] = dd.to_datetime(ddf[\"on_scene_datetime\"], format=\"mixed\")\n", 128 | "ddf[\"pickup_datetime\"] = dd.to_datetime(ddf[\"pickup_datetime\"], format=\"mixed\")\n", 129 | "\n", 130 | "# Unpack columns\n", 131 | "ddf = ddf.assign(\n", 132 | " accessible_vehicle=ddf.on_scene_datetime.isnull(),\n", 133 | " pickup_month=ddf.pickup_datetime.dt.month,\n", 134 | " pickup_dow=ddf.pickup_datetime.dt.dayofweek,\n", 135 | " pickup_hour=ddf.pickup_datetime.dt.hour,\n", 136 | ")\n", 137 | "ddf = ddf.drop(columns=[\"on_scene_datetime\", \"pickup_datetime\"])" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "id": "166a0eb0", 143 | "metadata": {}, 144 | "source": [ 145 | "From here, some data sanitization and improvements to readability:\n", 146 | "\n", 147 | "- Normalize airport fees to non-null floats\n", 148 | "- Remove trip time outliers\n", 149 | "- Rename service codes to their corresponding rideshare companies" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "id": "3b502560", 156 | "metadata": { 157 | "tags": [] 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "%%time\n", 162 | "\n", 163 | "# Format airport_fee\n", 164 | "ddf[\"airport_fee\"] = ddf[\"airport_fee\"].fillna(0)\n", 165 | "\n", 166 | "# Remove outliers\n", 167 | "lower_bound = 0\n", 168 | "Q3 = ddf[\"trip_time\"].quantile(0.75)\n", 169 | "upper_bound = Q3 + (1.5 * (Q3 - lower_bound))\n", 170 | "ddf = ddf.loc[(ddf[\"trip_time\"] >= lower_bound) & (ddf[\"trip_time\"] <= upper_bound)]\n", 171 | "\n", 172 | "service_names = {\n", 173 | " \"HV0002\": \"juno\",\n", 174 | " \"HV0005\": \"lyft\",\n", 175 | " \"HV0003\": \"uber\",\n", 176 | " \"HV0004\": \"via\",\n", 177 | "}\n", 178 | "\n", 179 | "ddf[\"service_names\"] = ddf[\"hvfhs_license_num\"].map(service_names)\n", 180 | "ddf = ddf.drop(columns=[\"hvfhs_license_num\"])" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "id": "24e4e66b-24f7-4f6a-907a-96a34d48b7d0", 186 | "metadata": { 187 | "tags": [] 188 | }, 189 | "source": [ 190 | "Now that the data is cleaned up, we can do some computations on our data.\n", 191 | "\n", 192 | "First, let's compute the average tip amount across all riders:" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "id": "abfae138-c7cc-4bf4-9b83-446e3cfd7492", 199 | "metadata": { 200 | "tags": [] 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "%%time\n", 205 | "\n", 206 | "(ddf.tips > 0).mean().compute()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "id": "c932453a", 212 | "metadata": {}, 213 | "source": [ 214 | "Or some metrics of tipping grouped by rideshare company:" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "id": "7ceab8e1-6696-4ca1-b870-931c93ca684a", 221 | "metadata": { 222 | "tags": [] 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "%%time\n", 227 | "\n", 228 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.sum().compute()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "id": "24e6b445-fa87-40aa-aa53-45e410a34e37", 235 | "metadata": { 236 | "tags": [] 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "%%time\n", 241 | "\n", 242 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.mean().compute()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "f3bb9d04-dd26-45bb-87be-3b57ef8358b6", 248 | "metadata": {}, 249 | "source": [ 250 | "# Persist when possible\n", 251 | "\n", 252 | "Looking at the dashboard while performing the above analysis, it should become clear that whenever we compute operations on `ddf`, we must also run through all the dependent operations that read in and sanitize `ddf`, which forces several repeated computation steps.\n", 253 | "\n", 254 | "When doing mutliple computations on the same dataset, it can save both time and money to `.persist()` it first - this incurs the time and cost of computing the dataset once, in exchange for future computations on the dataset working with an in-memory copy of the computed data:" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "id": "4e7489e9-0408-4550-a756-baf7fdeadccc", 261 | "metadata": { 262 | "tags": [] 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "%%time\n", 267 | "\n", 268 | "ddf = ddf.persist()" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "id": "ef63458e-25fa-4be1-bd60-fad2e7b0987d", 275 | "metadata": { 276 | "tags": [] 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "%%time\n", 281 | "\n", 282 | "from distributed import wait\n", 283 | "wait(ddf);" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "id": "d5b14a0e", 289 | "metadata": {}, 290 | "source": [ 291 | "Now that `ddf` has been persisted, we can see that the same analysis as above can be computed much faster, with the initial creation of `ddf` no longer being included:" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "id": "830696a8-f007-4be7-a482-375d7abbeb8e", 298 | "metadata": { 299 | "tags": [] 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "%%time\n", 304 | "\n", 305 | "(ddf.tips > 0).mean().compute()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "id": "b4c6e467-ebb7-4e6f-a56c-b97be7f19f96", 312 | "metadata": { 313 | "tags": [] 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "%%time\n", 318 | "\n", 319 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.sum().compute()" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "f1f97070-2081-4803-9ae9-bc1ce6a17f70", 326 | "metadata": { 327 | "tags": [] 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "%%time\n", 332 | "\n", 333 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.mean().compute()" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "id": "8eea0988", 339 | "metadata": {}, 340 | "source": [ 341 | "Note that the choice to persist data depends on several factors, including:\n", 342 | "\n", 343 | "- Whether or not it fits into your clusters memory\n", 344 | "- If it's being reused in enough computations\n", 345 | "\n", 346 | "In general, a best practice to follow is persisting the dataset(s) you expect to use the most throughout computations." 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "id": "2b87bd8b-96fe-4e88-9a19-79871ece71a0", 352 | "metadata": {}, 353 | "source": [ 354 | "# Avoid repeated compute calls\n", 355 | "\n", 356 | "When working with related results that share computations between one another, calling `.compute()` on each object individually forces us to discard shared work that could otherwise be used to speed up future computations.\n", 357 | "\n", 358 | "For example:" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "id": "09785813-271d-4314-8d57-5024c5972a13", 365 | "metadata": { 366 | "tags": [] 367 | }, 368 | "outputs": [], 369 | "source": [ 370 | "trip_frac = (ddf.tips > 0).mean()\n", 371 | "gb_sum = ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.sum()\n", 372 | "gb_mean = ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.mean()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "id": "d84a1502", 378 | "metadata": {}, 379 | "source": [ 380 | "Intuitively, we know that `gb_sum` and `gb_mean` both depend on `ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\")`, but calling `.compute()` on each object forces us to compute this result twice.\n", 381 | "\n", 382 | "To compute all of these objects in parallel and compute shared parts of the computation only once, we can use [`dask.compute()`](https://docs.dask.org/en/stable/api.html#dask.compute):" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "id": "bfd793ed-6a19-40f0-ac1a-3fd8c93e336c", 389 | "metadata": { 390 | "tags": [] 391 | }, 392 | "outputs": [], 393 | "source": [ 394 | "%%time\n", 395 | "\n", 396 | "import dask\n", 397 | "\n", 398 | "trip_frac, gb_sum, gb_mean = dask.compute(trip_frac, gb_sum, gb_mean)" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "id": "b8d460cc-d6c3-46c8-983a-9f4a88663db7", 404 | "metadata": {}, 405 | "source": [ 406 | "# Store data efficiently\n", 407 | "\n", 408 | "Up until this point, all of our performance optimizations have taken place after the initial reading of the data.\n", 409 | "However, as ability to compute increases, data access and I/O become more significant bottlenecks.\n", 410 | "Additionally, parallel computing will often add new constraints to how your store your data, particularly around providing random access to blocks of your data that are in line with how you plan to compute on it.\n", 411 | "\n", 412 | "## File format\n", 413 | "\n", 414 | "[Parquet](https://parquet.apache.org) is a popular, columnar file format designed for efficient data storage and retrieval. It handles random access, metadata storage, and binary encoding well. We [recommend using Parquet](https://docs.dask.org/en/stable/dataframe-best-practices.html#use-parquet) when working with tabular data." 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "id": "343c83ef-aeb0-460b-9853-3ba026cffb62", 421 | "metadata": { 422 | "tags": [] 423 | }, 424 | "outputs": [], 425 | "source": [ 426 | "%%time\n", 427 | "\n", 428 | "import dask.dataframe as dd\n", 429 | "\n", 430 | "# ddf = dd.read_csv(\n", 431 | "# \"s3://coiled-datasets/uber-lyft-tlc-sample/csv-ill/*\", \n", 432 | "# dtype={\"wav_match_flag\": \"category\"},\n", 433 | "# )\n", 434 | "\n", 435 | "ddf = dd.read_parquet(\"s3://coiled-datasets/uber-lyft-tlc-sample/parquet-0.2-10/\")" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "id": "26a2c940-bf12-4f25-b1e8-bedc4a0b7909", 442 | "metadata": { 443 | "tags": [] 444 | }, 445 | "outputs": [], 446 | "source": [ 447 | "ddf.dtypes" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "id": "6f2e173e", 453 | "metadata": {}, 454 | "source": [ 455 | "From here, we can see that the same data sanitization as earlier can be done much faster:" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "id": "16ca27f6-5d46-451e-bccf-3e2f3b24d0cc", 462 | "metadata": { 463 | "tags": [] 464 | }, 465 | "outputs": [], 466 | "source": [ 467 | "%%time\n", 468 | "\n", 469 | "# # Convert to datetime\n", 470 | "# ddf[\"on_scene_datetime\"] = dd.to_datetime(ddf[\"on_scene_datetime\"], format=\"mixed\")\n", 471 | "# ddf[\"pickup_datetime\"] = dd.to_datetime(ddf[\"pickup_datetime\"], format=\"mixed\")\n", 472 | "\n", 473 | "# Unpack columns\n", 474 | "ddf = ddf.assign(\n", 475 | " accessible_vehicle=ddf.on_scene_datetime.isnull(),\n", 476 | " pickup_month=ddf.pickup_datetime.dt.month,\n", 477 | " pickup_dow=ddf.pickup_datetime.dt.dayofweek,\n", 478 | " pickup_hour=ddf.pickup_datetime.dt.hour,\n", 479 | ")\n", 480 | "ddf = ddf.drop(columns=[\"on_scene_datetime\", \"pickup_datetime\"])\n", 481 | "\n", 482 | "# Format airport_fee\n", 483 | "ddf[\"airport_fee\"] = ddf[\"airport_fee\"].fillna(0)\n", 484 | "\n", 485 | "# Remove outliers\n", 486 | "lower_bound = 0\n", 487 | "Q3 = ddf[\"trip_time\"].quantile(0.75)\n", 488 | "upper_bound = Q3 + (1.5 * (Q3 - lower_bound))\n", 489 | "ddf = ddf.loc[(ddf[\"trip_time\"] >= lower_bound) & (ddf[\"trip_time\"] <= upper_bound)]\n", 490 | "\n", 491 | "service_names = {\n", 492 | " \"HV0002\": \"juno\",\n", 493 | " \"HV0005\": \"lyft\",\n", 494 | " \"HV0003\": \"uber\",\n", 495 | " \"HV0004\": \"via\",\n", 496 | "}\n", 497 | "\n", 498 | "ddf[\"service_names\"] = ddf[\"hvfhs_license_num\"].map(service_names)\n", 499 | "ddf = ddf.drop(columns=[\"hvfhs_license_num\"])" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "id": "8a6455e3", 505 | "metadata": {}, 506 | "source": [ 507 | "Following best practices, we will now persist this sanitized dataset, so we no longer need to incur repeated I/O costs:" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "id": "4f53c0e3-ad21-4c0a-98dd-a138adf0c378", 514 | "metadata": { 515 | "tags": [] 516 | }, 517 | "outputs": [], 518 | "source": [ 519 | "ddf = ddf.persist()" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "id": "bf8bcba0-1510-45eb-a1be-2bf8604c2ecf", 526 | "metadata": { 527 | "tags": [] 528 | }, 529 | "outputs": [], 530 | "source": [ 531 | "%%time\n", 532 | "\n", 533 | "from distributed import wait\n", 534 | "wait(ddf);" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "id": "159af1ee", 540 | "metadata": {}, 541 | "source": [ 542 | "From here, analysis can continue as normally:" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "id": "39634c19-079a-46d0-9e0a-ddca38921df2", 549 | "metadata": { 550 | "tags": [] 551 | }, 552 | "outputs": [], 553 | "source": [ 554 | "%%time\n", 555 | "\n", 556 | "(ddf.tips > 0).mean().compute()" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": null, 562 | "id": "f4278dd9-c037-4b06-a3a6-ff7d41a7d4cd", 563 | "metadata": { 564 | "tags": [] 565 | }, 566 | "outputs": [], 567 | "source": [ 568 | "%%time\n", 569 | "\n", 570 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.sum().compute()" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "id": "4610c9db-f83d-4b62-ad7d-e4a1fff3e788", 577 | "metadata": { 578 | "tags": [] 579 | }, 580 | "outputs": [], 581 | "source": [ 582 | "%%time\n", 583 | "\n", 584 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.mean().compute()" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "id": "98af8e81-e9fd-4e0d-b0de-b7e6755a3fbf", 590 | "metadata": {}, 591 | "source": [ 592 | "Note that since we persisted the data, the impact of the improved I/O is gone by the time we get to the analysis.\n", 593 | "This is because at this point, the data is stored in memory with pandas objects and datatypes; how it was originally stored no longer matters.\n", 594 | "Put differently, all analysis beyond I/O and sanitization creates an identical task graph to the previous dataset.\n", 595 | "In the next section, we will see how to troubleshoot and optimize our analysis independent of I/O.\n", 596 | "\n", 597 | "## Partition size\n", 598 | "\n", 599 | "So far, we've been working with the default partition size which, for this dataset, is pretty small (~10 MB).\n", 600 | "A small partition size results in very many partition, which in turn results in very many tasks in our computation graphs.\n", 601 | "\n", 602 | "When choosing a partition size, the goal is to give Dask enough to do per task that the scheduler overhead isn't taking up a disproportionate amount of time, but not so much that the workers run out of memory.\n", 603 | "A good rule of thumb for partition sizes is between 100 MB and 1 GB per partition ([excellent blog post on this](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)).\n", 604 | "\n", 605 | "So the first step is to see what our partiton size currently is:" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": null, 611 | "id": "01581f63-3060-4bba-b97f-3e34ed0aae49", 612 | "metadata": { 613 | "tags": [] 614 | }, 615 | "outputs": [], 616 | "source": [ 617 | "import dask\n", 618 | "dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "id": "d7fb4d26-c657-417a-b031-0827711c2a72", 624 | "metadata": {}, 625 | "source": [ 626 | "Let's repartition to a bigger size." 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "id": "d0654d49-f504-4a5f-9eda-ce1dcd1f0c36", 633 | "metadata": { 634 | "tags": [] 635 | }, 636 | "outputs": [], 637 | "source": [ 638 | "%%time\n", 639 | "\n", 640 | "ddf = ddf.repartition(\"100MiB\")\n", 641 | "ddf = ddf.persist()\n", 642 | "wait(ddf);" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "id": "7e08158e-8d36-4b8d-8025-f977270093fb", 648 | "metadata": {}, 649 | "source": [ 650 | "Note that we persist after we repartition so we don't repeat the repartitioning work every time we compute.\n", 651 | "\n", 652 | "As a sanity check, let's check the new partition size:" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": null, 658 | "id": "280633da-95ac-4f04-ab23-c0e8a132eed9", 659 | "metadata": { 660 | "tags": [] 661 | }, 662 | "outputs": [], 663 | "source": [ 664 | "dask.utils.format_bytes(ddf.partitions[0].compute().memory_usage(deep=True).sum())" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "id": "5ad83a25-b00a-46f9-9ae5-22e8ad4cb7b8", 670 | "metadata": {}, 671 | "source": [ 672 | "Nice! Now let's do our analyses again.\n", 673 | "Remember that this time, the task graph will be much smaller.\n", 674 | "You can always inspect the graph by calling `.visualize()` rather than `.compute()` or by looking at the \"Graph\" page in the dashboard." 675 | ] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "execution_count": null, 680 | "id": "27254373-f82a-4fe6-b57e-64217b308e64", 681 | "metadata": { 682 | "tags": [] 683 | }, 684 | "outputs": [], 685 | "source": [ 686 | "%%time\n", 687 | "\n", 688 | "(ddf.tips > 0).mean().compute()" 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "execution_count": null, 694 | "id": "b460febd-ad73-4b2e-8606-bd29a2868e5f", 695 | "metadata": { 696 | "tags": [] 697 | }, 698 | "outputs": [], 699 | "source": [ 700 | "%%time\n", 701 | "\n", 702 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.sum().compute()" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": null, 708 | "id": "99548028-fc5b-46dc-8300-5a1d0c9e69b5", 709 | "metadata": { 710 | "tags": [] 711 | }, 712 | "outputs": [], 713 | "source": [ 714 | "%%time\n", 715 | "\n", 716 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.mean().compute()" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "id": "4dff0b64-e4d3-4f2d-b00e-b950985d9ee0", 722 | "metadata": {}, 723 | "source": [ 724 | "That was fast 🔥\n", 725 | "\n", 726 | "Here we improved on the task graph by increasing the partition size, but we haven't improved the performance of the tasks themselves.\n", 727 | "In the next section, we'll explore how changing the data type of your columns can make individual tasks more perfomant." 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "id": "a309e787-5189-41ef-85ed-5aed652054ff", 733 | "metadata": {}, 734 | "source": [ 735 | "# Use efficient data types\n", 736 | "\n", 737 | "Up until this point, we've been using the default data types inferred by Dask for most of our columns. In the case of string data, this means we are using the Python `object` type, which can be slow to process:" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": null, 743 | "id": "5f0820b9-c715-4c4f-b861-88b384c0a1d1", 744 | "metadata": { 745 | "tags": [] 746 | }, 747 | "outputs": [], 748 | "source": [ 749 | "ddf.dtypes" 750 | ] 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "id": "3f58a23d", 755 | "metadata": {}, 756 | "source": [ 757 | "Recent versions of [Dask and pandas have improved support for PyArrow data types, most notably PyArrow strings](https://medium.com/coiled-hq/pyarrow-strings-in-dask-dataframes-55a0c4871586), which are faster and more memory efficient than Python `objects`.\n", 758 | "\n", 759 | "Let's enjoy some of the benefits of PyArrow strings by casting relevant string columns to `string[pyarrow]`:" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": null, 765 | "id": "da37e748-9f2a-4e88-a176-6b48bf82f94e", 766 | "metadata": { 767 | "tags": [] 768 | }, 769 | "outputs": [], 770 | "source": [ 771 | "%%time\n", 772 | "\n", 773 | "ddf = ddf.astype({\n", 774 | " \"service_names\": \"string[pyarrow]\",\n", 775 | " \"dispatching_base_num\": \"string[pyarrow]\",\n", 776 | " \"originating_base_num\": \"string[pyarrow]\",\n", 777 | "})\n", 778 | "\n", 779 | "ddf = ddf.persist()\n", 780 | "wait(ddf);" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "id": "644e13ff-3dfa-446e-900a-c47e76317428", 787 | "metadata": { 788 | "tags": [] 789 | }, 790 | "outputs": [], 791 | "source": [ 792 | "ddf.dtypes" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "id": "6e518106", 798 | "metadata": {}, 799 | "source": [ 800 | "With that done, let's revisit our partition sizes to see how they've been impacted:" 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": null, 806 | "id": "4da5a812-ff2a-4cf0-bfad-b65f7108d3de", 807 | "metadata": { 808 | "tags": [] 809 | }, 810 | "outputs": [], 811 | "source": [ 812 | "dask.utils.format_bytes(ddf.partitions[1].compute().memory_usage(deep=True).sum())" 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "id": "76399f57", 818 | "metadata": {}, 819 | "source": [ 820 | "Nice! With PyArrow strings, our partitions are noticeably smaller, and we can once again repartition our data to land at a solid 100 MB partition size:" 821 | ] 822 | }, 823 | { 824 | "cell_type": "code", 825 | "execution_count": null, 826 | "id": "2da4346f-a50f-4c32-9596-73f6d95c1dd1", 827 | "metadata": { 828 | "tags": [] 829 | }, 830 | "outputs": [], 831 | "source": [ 832 | "%%time\n", 833 | "\n", 834 | "ddf = ddf.repartition(\"100MB\")\n", 835 | "ddf = ddf.persist()\n", 836 | "wait(ddf);" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "id": "29c5a0a3", 842 | "metadata": {}, 843 | "source": [ 844 | "With these new data types, we can now see that the analyses results in an even smaller task graph; on top of that, the improved performance of the PyArrow strings means that each individual task is more performant:" 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": null, 850 | "id": "c8512af8-3a9f-4c85-a3c5-0643fb8b6331", 851 | "metadata": { 852 | "tags": [] 853 | }, 854 | "outputs": [], 855 | "source": [ 856 | "%%time\n", 857 | "\n", 858 | "(ddf.tips != 0).mean().compute()" 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "execution_count": null, 864 | "id": "e4122917-3f67-42e2-83c6-e9c40d6e954d", 865 | "metadata": { 866 | "tags": [] 867 | }, 868 | "outputs": [], 869 | "source": [ 870 | "%%time\n", 871 | "\n", 872 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.sum().compute()" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": null, 878 | "id": "9aedb328-377d-4e35-9c8b-793700dc9a16", 879 | "metadata": { 880 | "tags": [] 881 | }, 882 | "outputs": [], 883 | "source": [ 884 | "%%time\n", 885 | "\n", 886 | "ddf.loc[lambda x: x.tips > 0].groupby(\"service_names\").tips.mean().compute()" 887 | ] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "id": "a15d4325", 892 | "metadata": {}, 893 | "source": [ 894 | "Note that as of `dask=2023.3.1`, we can skip the effort of manually recasting Python object columns to PyArrow strings by modifying the value of `dataframe.convert-string` in our Dask config:" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": null, 900 | "id": "29eef754-42e6-4251-b7da-acb4468305dd", 901 | "metadata": { 902 | "tags": [] 903 | }, 904 | "outputs": [], 905 | "source": [ 906 | "# dask.config.set({\"dataframe.convert-string\": True});" 907 | ] 908 | }, 909 | { 910 | "cell_type": "markdown", 911 | "id": "410f2fbf", 912 | "metadata": {}, 913 | "source": [ 914 | "The benefits of PyArrow strings aren't just limited to computation. By setting them as the default data type when reading in Parquet data, we can also improve the performance of I/O.\n", 915 | "\n", 916 | "# Summary\n", 917 | "\n", 918 | "In this notebook, we took a look at a representative Dask DataFrame workload that could benefit from Dask.\n", 919 | "\n", 920 | "Starting from a suboptimal place performance-wise, we explored the dashboard to find potentials areas for improvement.\n", 921 | "We then went through some basic Dask best practices that allowed us to shrink our task graph and improve the performance of individual tasks, which was reflected both in our analyses runtimes and dashboard plots.\n", 922 | "\n", 923 | "# Additional Resources\n", 924 | "\n", 925 | "- Repositories on GitHub:\n", 926 | " - Dask https://github.com/dask/dask\n", 927 | " - Distributed https://github.com/dask/distributed\n", 928 | "\n", 929 | "- Documentation:\n", 930 | " - Dask documentation https://docs.dask.org\n", 931 | " - Distributed documentation https://distributed.dask.org\n", 932 | "\n", 933 | "- If you have a Dask usage questions, please ask it on the [Dask GitHub discussions board](https://github.com/dask/dask/discussions).\n", 934 | "\n", 935 | "- If you run into a bug, feel free to file a report on the [Dask GitHub issue tracker](https://github.com/dask/dask/issues).\n", 936 | "\n", 937 | "- If you're interested in getting involved and contributing to Dask. Please check out our [contributing guide](https://docs.dask.org/en/latest/develop.html).\n", 938 | "\n", 939 | "# Thank you!" 940 | ] 941 | } 942 | ], 943 | "metadata": { 944 | "kernelspec": { 945 | "display_name": "Python 3 (ipykernel)", 946 | "language": "python", 947 | "name": "python3" 948 | }, 949 | "language_info": { 950 | "codemirror_mode": { 951 | "name": "ipython", 952 | "version": 3 953 | }, 954 | "file_extension": ".py", 955 | "mimetype": "text/x-python", 956 | "name": "python", 957 | "nbconvert_exporter": "python", 958 | "pygments_lexer": "ipython3", 959 | "version": "3.10.12" 960 | } 961 | }, 962 | "nbformat": 4, 963 | "nbformat_minor": 5 964 | } 965 | -------------------------------------------------------------------------------- /2-custom-operations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f3e6c3f1-9673-4eb0-81b1-3fe9ac7e84cf", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 9 | " \n", 10 | "# Custom Operations\n", 11 | "\n", 12 | "In the overview notebook, we discussed some of the many algorithms that are pre-defined for different types of Dask collections\n", 13 | "(such as Arrays and DataFrames).\n", 14 | "These include operations like `mean`, `max`, `value_counts`, and many other standard operations.\n", 15 | "\n", 16 | "In this notebook we'll:\n", 17 | " - explore how those operations are implemented\n", 18 | " - learn how to construct our own custom operations and optimizations\n", 19 | " - get a sense of when you might need to write a custom operations\n", 20 | "\n", 21 | "**Related Documentation**\n", 22 | "\n", 23 | " - [Array Tutorial](https://tutorial.dask.org/02_array.html)\n", 24 | " - [Best Practices for Customization](https://docs.dask.org/en/latest/best-practices.html#learn-techniques-for-customization)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "id": "7c0fe241-96cf-414d-b5ee-d700b51651df", 30 | "metadata": {}, 31 | "source": [ 32 | "## Blocked Algorithms\n", 33 | "\n", 34 | "Dask computations are implemented using _blocked algorithms_. These algorithms break up a computation on a large array into many computations on smaller pieces of the array. This minimizes the memory load (amount of RAM) of computations and allows for working with larger-than-memory datasets in parallel." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "id": "fca494a4-362a-4503-9725-e79ebeee84c4", 41 | "metadata": { 42 | "slideshow": { 43 | "slide_type": "subslide" 44 | }, 45 | "tags": [] 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "import dask.array as da\n", 50 | "\n", 51 | "x = da.random.random(size=(1_000, 1_000), chunks=(250, 500))\n", 52 | "x" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "id": "849c5b30-fab9-4cd8-9e7f-a2dd4bbf1f4c", 58 | "metadata": {}, 59 | "source": [ 60 | "In the overview notebook we looked at the task graph for the following computation:" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "id": "776c7645-3c80-4390-a8ed-dde0d4463712", 67 | "metadata": { 68 | "slideshow": { 69 | "slide_type": "subslide" 70 | }, 71 | "tags": [] 72 | }, 73 | "outputs": [], 74 | "source": [ 75 | "result = (x + x.T).sum(axis=0).mean()" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "id": "b9f181e4-0205-4f9a-89db-4f8f3a54d801", 81 | "metadata": {}, 82 | "source": [ 83 | "Now let's break that down a bit and look at the task graph for just one part of that computation." 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "id": "ad44e637-1782-4eb1-89c3-0025230e1fce", 90 | "metadata": { 91 | "tags": [] 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "x.T.visualize()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "id": "4087e539-2d92-4ed4-98ab-03471de15c8d", 101 | "metadata": {}, 102 | "source": [ 103 | "This graph demonstrates how blocked algorithms work. In the perfectly parallelizable situation, Dask can operate on each block in isolation and then reassemble the results from the outputs. Dask makes it easy to contruct graphs like this using a numpy-like API. " 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "id": "negative-minority", 109 | "metadata": {}, 110 | "source": [ 111 | "## Custom Block Computations\n", 112 | "Block computations operate on a per-block basis. So each block gets the function applied to it, and the output has the same chunk location as the input.\n", 113 | "\n", 114 | "Some examples include the following:\n", 115 | "- custom IO operations\n", 116 | "- applying embarassingly parallel functions for which there is no exising Dask implementation\n", 117 | "\n", 118 | "![map_blocks](images/custom_operations_map_blocks.png)\n", 119 | "\n", 120 | "**Related Documentation**\n", 121 | "\n", 122 | " - [`dask.array.map_blocks`](https://docs.dask.org/en/latest/generated/dask.array.map_blocks.html#dask.array.map_blocks)\n", 123 | " - [`dask.dataframe.map_partitions`](https://dask.pydata.org/en/latest/generated/dask.dataframe.DataFrame.map_partitions.html#dask.dataframe.DataFrame.map_partitions)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "id": "wired-albert", 129 | "metadata": {}, 130 | "source": [ 131 | "### `map_blocks`\n", 132 | "\n", 133 | "Let's imagine that there was no `da.random.random` method. We can create our own version using `map_blocks`. " 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "id": "8262fd43-b432-4e40-aced-4f9ad633d0c7", 140 | "metadata": { 141 | "tags": [] 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "import numpy as np\n", 146 | "\n", 147 | "def random_sample():\n", 148 | " return np.random.random(size=(250, 500))\n", 149 | "\n", 150 | "x = da.map_blocks(random_sample, chunks=((250, 250, 250, 250), (500, 500)), dtype=float)\n", 151 | "x" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "id": "8660b3ed-494e-4c41-af0f-59b0e1626194", 158 | "metadata": { 159 | "tags": [] 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "x.visualize()" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "id": "requested-congo", 169 | "metadata": {}, 170 | "source": [ 171 | "#### Understanding `chunks` argument\n", 172 | "\n", 173 | "In the example above we explicitly declare what the size of the output chunks will be ``chunks=((250, 250, 250, 250), (500, 500))`` this means 8 chunks each with shape `(250, 500)` you'll also see the chunks argument written in the short version where only the shape of one chunk is defined ``chunks=(250, 500)``. These mean the same thing.\n", 174 | "\n", 175 | "Specifying the output chunks is very useful when doing more involved operations with ``map_blocks``. By specifying ``chunks``, you can guarantee that the output will have the right shape. Having the right shape lets you properly chain together other operations. " 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "id": "555f0551-bf38-4e95-a8a4-3aa9ac5b26fd", 181 | "metadata": {}, 182 | "source": [ 183 | "In that example we created an array from scratch by passing in `dtype` and `chunks`. Next we'll consider the case of applying `map_blocks` to existing arrays." 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "id": "nervous-devon", 189 | "metadata": {}, 190 | "source": [ 191 | "#### Multiple arrays\n", 192 | "\n", 193 | "``map_blocks`` can be used on single arrays or to combine several arrays. When multiple arrays are passed, ``map_blocks``\n", 194 | "aligns blocks by block location without regard to shape.\n", 195 | "\n", 196 | "In the following example we have two arrays with the same number of blocks\n", 197 | "but with different shape and chunk sizes." 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "id": "effective-correspondence", 204 | "metadata": { 205 | "tags": [] 206 | }, 207 | "outputs": [], 208 | "source": [ 209 | "a = da.arange(1000, chunks=(100,))\n", 210 | "b = da.arange(100, chunks=(10,))" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "id": "biblical-maldives", 216 | "metadata": {}, 217 | "source": [ 218 | "Let's take a look at these arrays:" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "id": "representative-twins", 225 | "metadata": { 226 | "tags": [] 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "a" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "id": "sound-johns", 237 | "metadata": { 238 | "tags": [] 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "b" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "polish-excess", 248 | "metadata": {}, 249 | "source": [ 250 | "We can pass these arrays into ``map_blocks`` using a function that takes two inputs, calculates the max of each, then returns a numpy array of the outputs. " 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "id": "exceptional-aberdeen", 257 | "metadata": { 258 | "tags": [] 259 | }, 260 | "outputs": [], 261 | "source": [ 262 | "def func(a, b):\n", 263 | " return np.array([a.max(), b.max()])\n", 264 | "\n", 265 | "result = da.map_blocks(func, a, b, chunks=(2,))\n", 266 | "result.visualize()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "id": "adapted-output", 272 | "metadata": {}, 273 | "source": [ 274 | "#### Special arguments\n", 275 | "\n", 276 | "There are special arguments (``block_info`` and ``block_id``) that you can use within ``map_blocks`` functions. \n", 277 | "\n", 278 | " - ``block_id`` gives the index of the block within the chunks, so for a 1D array it will be something like `(i,)`. \n", 279 | " - ``block_info`` is a dictionary where there is an integer key for each input dask array and a `None` key for the output array.\n", 280 | " \n", 281 | "These special arguments let you know where you are in the ``map_blocks`` from within the custom function." 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "id": "atlantic-interim", 287 | "metadata": {}, 288 | "source": [ 289 | "### ``map_partitions``\n", 290 | "\n", 291 | "In Dask dataframe there is a similar method to ``map_blocks`` but it is called ``map_partitions``.\n", 292 | "\n", 293 | "Here is an example of using it to check if the sum of two columns is greater than some arbitrary threshold." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "id": "d01aeac1-a313-4ab4-b940-719ee5fbd116", 300 | "metadata": { 301 | "tags": [] 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "import dask\n", 306 | "import dask.dataframe as dd\n", 307 | "\n", 308 | "ddf = dask.datasets.timeseries()\n", 309 | "\n", 310 | "result = ddf.map_partitions(lambda df, threshold: (df.x + df.y) > 0, threshold=0)\n", 311 | "result.visualize()" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "collected-preparation", 317 | "metadata": {}, 318 | "source": [ 319 | "#### Internal uses\n", 320 | "\n", 321 | "In practice ``map_partitions`` is used to implement many of the helper dataframe methods\n", 322 | "that let Dask dataframe mimic Pandas. Here is the [implementation](https://github.com/dask/dask/blob/67e648922512615f94f8a90726423e721d0e3eb2/dask/dataframe/core.py#L662-L671) of `ddf.index` for instance:\n", 323 | "\n", 324 | "```python\n", 325 | "@property\n", 326 | "def index(self):\n", 327 | " \"\"\"Return dask Index instance\"\"\"\n", 328 | " return self.map_partitions(\n", 329 | " getattr,\n", 330 | " \"index\",\n", 331 | " token=key_split(self._name) + \"-index\",\n", 332 | " meta=self._meta.index,\n", 333 | " enforce_metadata=False,\n", 334 | " )\n", 335 | "```" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "id": "optical-norwegian", 341 | "metadata": {}, 342 | "source": [ 343 | "#### Understanding `meta` argument\n", 344 | "\n", 345 | "Dask dataframes and dask arrays have a special attribute called `_meta` that allows them to know metadata about the type of dataframe/array that they represent. This metadata includes:\n", 346 | " - dtype (int, float)\n", 347 | " - column names and order\n", 348 | " - name\n", 349 | " - type (pandas dataframe, cudf dataframe)\n", 350 | " \n", 351 | "**Related documentation**\n", 352 | "\n", 353 | "- [Dataframe metadata](https://docs.dask.org/en/latest/dataframe-design.html#metadata)\n", 354 | "\n", 355 | "This information is stored in an empty object of the proper type." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "id": "external-treaty", 362 | "metadata": { 363 | "tags": [] 364 | }, 365 | "outputs": [], 366 | "source": [ 367 | "print(ddf._meta)" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "id": "tender-architect", 373 | "metadata": {}, 374 | "source": [ 375 | "That's how Dask knows what to render when you display a dask object:" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "id": "operating-estimate", 382 | "metadata": { 383 | "tags": [] 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "print(ddf)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "id": "conscious-prairie", 393 | "metadata": {}, 394 | "source": [ 395 | "When you add an item to the task graph, Dask tries to run the function on the meta before you call compute. \n", 396 | "\n", 397 | "This approach has several benefits:\n", 398 | "\n", 399 | "- it gives Dask a sense of what the output will look like. \n", 400 | "- if there are fundamental issues, Dask will fail fast\n", 401 | "\n", 402 | "Here's a few examples. " 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "id": "bd3fb018-de20-4f75-9d7d-441a6ec3a97b", 409 | "metadata": { 410 | "tags": [] 411 | }, 412 | "outputs": [], 413 | "source": [ 414 | "ddf.sum()" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "id": "essential-excitement", 421 | "metadata": { 422 | "tags": [] 423 | }, 424 | "outputs": [], 425 | "source": [ 426 | "ddf.name.str.startswith(\"A\")" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "id": "dcfa14a6-fd76-4dff-89de-def4039b0103", 432 | "metadata": {}, 433 | "source": [ 434 | "See how the output looks right? The dtypes are correct, the type is a `Series` rather than a `DataFrame` like the input.\n", 435 | "\n", 436 | "**Exercise**\n", 437 | "\n", 438 | "Try using `startswith` on a different column and see what you get :)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "id": "7344e1d3-e417-4324-82e0-57e8efab5fd4", 445 | "metadata": { 446 | "jupyter": { 447 | "source_hidden": true 448 | }, 449 | "tags": [] 450 | }, 451 | "outputs": [], 452 | "source": [ 453 | "# solution\n", 454 | "ddf.x.str.startswith(\"A\")" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "id": "congressional-memory", 460 | "metadata": {}, 461 | "source": [ 462 | "### Declaring meta\n", 463 | "\n", 464 | "When setting up custom operations, sometimes running the function on a miniature version of the data doesn't produce a result that is similar enough to your expected output.\n", 465 | "\n", 466 | "In those cases you can manually provide a `meta` to use as the `meta` of the output. You're basically telling Dask what to expect from the output. " 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "id": "royal-carroll", 473 | "metadata": { 474 | "tags": [] 475 | }, 476 | "outputs": [], 477 | "source": [ 478 | "result = ddf.map_partitions(lambda df, threshold: (df.x + df.y) > threshold, threshold=0, meta=(\"greater_than\", bool))\n", 479 | "result" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "id": "4627e5ae-f31a-4868-85d4-3b8119870c49", 485 | "metadata": {}, 486 | "source": [ 487 | "Note that Dask is trusting you to give it the correct meta. There is no enforcement." 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "id": "expressed-sharing", 493 | "metadata": {}, 494 | "source": [ 495 | "### `map_overlap`\n", 496 | "Sometimes you want to operate on a per-block basis, but you need some information from neighboring blocks. \n", 497 | "\n", 498 | "Example operations include the following:\n", 499 | "\n", 500 | "- Convolve a filter across an image\n", 501 | "- Rolling sum/mean/max, …\n", 502 | "- Search for image motifs like a Gaussian blob that might span the border of a block\n", 503 | "- Evaluate a partial derivative\n", 504 | "\n", 505 | "Dask Array supports these operations by creating a new array where each block is slightly expanded by the borders of its neighbors. \n", 506 | "\n", 507 | "![](https://docs.dask.org/en/latest/_images/overlapping-neighbors.svg)\n", 508 | "\n", 509 | "This costs an excess copy and the communication of many small chunks, but allows localized functions to evaluate in an embarrassingly parallel manner.\n", 510 | "\n", 511 | "**Related Documentation**\n", 512 | " - [Array Overlap](https://docs.dask.org/en/latest/array-overlap.html)\n", 513 | "\n", 514 | "The main API for these computations is the ``map_overlap`` method. ``map_overlap`` is very similar to ``map_blocks`` but has the additional arguments: ``depth``, ``boundary``, and ``trim``.\n", 515 | "\n", 516 | "Here is an example of calculating the derivative:" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "id": "8e78c226-a318-4772-959d-f099f8564185", 523 | "metadata": { 524 | "tags": [] 525 | }, 526 | "outputs": [], 527 | "source": [ 528 | "import numpy as np\n", 529 | "import dask.array as da\n", 530 | "import matplotlib.pyplot as plt\n", 531 | "\n", 532 | "a = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])\n", 533 | "a = da.from_array(a, chunks=5)\n", 534 | "\n", 535 | "plt.plot(a)" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "id": "cde80b97-1781-4104-8ca2-b43e4544fc4c", 542 | "metadata": { 543 | "tags": [] 544 | }, 545 | "outputs": [], 546 | "source": [ 547 | "def derivative(a):\n", 548 | " return a - np.roll(a, 1)\n", 549 | "\n", 550 | "b = a.map_overlap(derivative, depth=1, boundary=None)\n", 551 | "b.compute()" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "id": "restricted-lying", 557 | "metadata": {}, 558 | "source": [ 559 | "In this case each block shares 1 value from its neighboring block: ``depth``. And since we set ``boundary=0``on the outer edges of the array, the first and last block are padded with the integer 0. Since we haven't specified ``trim`` it is true by default meaning that the overlap is removed before returning the results.\n", 560 | "\n", 561 | "If you inspect the task graph you'll see two mostly independent towers of tasks, with just some value sharing at the edges." 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "id": "eight-mirror", 568 | "metadata": { 569 | "tags": [] 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "b.visualize(collapse_outputs=True)" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "id": "great-crowd", 579 | "metadata": {}, 580 | "source": [ 581 | "**Exercise**\n", 582 | "\n", 583 | "Lets apply a Gaussian filter to an image following the example from the [SciPy docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html).\n", 584 | "\n", 585 | "First create a Dask array from the NumPy array:" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": null, 591 | "id": "handmade-infrared", 592 | "metadata": { 593 | "tags": [] 594 | }, 595 | "outputs": [], 596 | "source": [ 597 | "from scipy.datasets import ascent\n", 598 | "import dask.array as da\n", 599 | "\n", 600 | "a = da.from_array(ascent(), chunks=(128, 128))\n", 601 | "a" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "id": "disabled-editing", 607 | "metadata": {}, 608 | "source": [ 609 | "Now use ``map_overlap`` to apply ``gaussian_filter`` to each block.\n", 610 | "\n", 611 | "```python\n", 612 | "from scipy.ndimage import gaussian_filter\n", 613 | "\n", 614 | "b = a.map_overlap(gaussian_filter, sigma=5, ...)\n", 615 | "```" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": null, 621 | "id": "informative-worcester", 622 | "metadata": { 623 | "jupyter": { 624 | "source_hidden": true 625 | }, 626 | "tags": [] 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "# solution\n", 631 | "from scipy.ndimage import gaussian_filter\n", 632 | "\n", 633 | "b = a.map_overlap(gaussian_filter, sigma=5, depth=10, boundary=\"periodic\")" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "id": "printable-anaheim", 639 | "metadata": {}, 640 | "source": [ 641 | "Check what you've come up with by plotting the results:" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "id": "satisfactory-basket", 648 | "metadata": { 649 | "tags": [] 650 | }, 651 | "outputs": [], 652 | "source": [ 653 | "import matplotlib.pyplot as plt\n", 654 | "\n", 655 | "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))\n", 656 | "ax1.imshow(a)\n", 657 | "ax2.imshow(b)\n", 658 | "plt.show()" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "id": "2908a29e-061e-4964-9018-c1ea1240ba42", 664 | "metadata": {}, 665 | "source": [ 666 | "Notice that if you set the depth to a smaller value, you can see the edges of the blocks in the output image." 667 | ] 668 | }, 669 | { 670 | "cell_type": "markdown", 671 | "id": "22cb0c3f-1aca-42db-9ec3-565a45c68a45", 672 | "metadata": {}, 673 | "source": [ 674 | "## Reduction\n", 675 | "Each dask collection has a `reduction` method. This is the generalized method that supports operations that reduce the dimensionality of the inputs.\n", 676 | "\n", 677 | "![Custom operations: reduction](images/custom_operations_reduction.png)\n", 678 | "\n", 679 | "**Related Documentation**\n", 680 | " - [`dask.array.reduction`](https://dask.pydata.org/en/latest/generated/dask.array.reduction.html#dask.array.reduction)\n", 681 | " - [`dask.dataframe.reduction`](https://dask.pydata.org/en/latest/generated/dask.dataframe.DataFrame.reduction.html#dask.dataframe.DataFrame.reduction)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "id": "unexpected-representation", 687 | "metadata": {}, 688 | "source": [ 689 | "### Internal uses\n", 690 | "\n", 691 | "This is the [internal definition](https://github.com/dask/dask/blob/67e648922512615f94f8a90726423e721d0e3eb2/dask/array/reductions.py#L394-L407) of `sum` on `dask.Array`. In it you can see that there is a\n", 692 | "regular `np.sum` applied across each block and then tree-reduced with `np.sum` again.\n", 693 | "\n", 694 | "```python\n", 695 | "def sum(a, axis=None, dtype=None, keepdims=False, split_every=None, out=None):\n", 696 | " if dtype is None:\n", 697 | " dtype = getattr(np.zeros(1, dtype=a.dtype).sum(), \"dtype\", object)\n", 698 | " result = reduction(\n", 699 | " a,\n", 700 | " chunk.sum,\n", 701 | " chunk.sum,\n", 702 | " axis=axis,\n", 703 | " keepdims=keepdims,\n", 704 | " dtype=dtype,\n", 705 | " split_every=split_every,\n", 706 | " out=out,\n", 707 | " )\n", 708 | " return result\n", 709 | "```\n", 710 | "\n", 711 | "Here is `da.sum` reimplemented as a custom reduction:" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "id": "0eab1d5a-c839-4b7c-8fed-9bf88bfa12d5", 718 | "metadata": { 719 | "tags": [] 720 | }, 721 | "outputs": [], 722 | "source": [ 723 | "da.reduction(x, np.sum, np.sum, dtype=x.dtype).visualize()" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "id": "negative-display", 729 | "metadata": {}, 730 | "source": [ 731 | "By visualizing `b`, we can see how the tree reduction works. First, `sum` is applied to each block, then every 4 chunks are combined using `sum-partial`.\n", 732 | "This keeps going until there are less than 4 results left, then `sum-aggregate` is used to finish up." 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "id": "greek-heritage", 738 | "metadata": {}, 739 | "source": [ 740 | "**Exercise**\n", 741 | "\n", 742 | "See how the graph changes when you set the chunks - maybe to `(100, 250)` or `(250, 250)`:" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "id": "4b7a74a2-4e02-4807-b4d1-2221818ef95d", 749 | "metadata": { 750 | "jupyter": { 751 | "source_hidden": true 752 | }, 753 | "tags": [] 754 | }, 755 | "outputs": [], 756 | "source": [ 757 | "# solution\n", 758 | "x = da.random.random(size=(1_000, 1_000), chunks=(100, 250))\n", 759 | "x.sum().visualize()" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "id": "cb0f3153-ac83-46c3-abaa-3e487a81c0b4", 765 | "metadata": {}, 766 | "source": [ 767 | "### Understanding `split_every`\n", 768 | "\n", 769 | "`split_every` controls the number of chunk outputs that are used as input to each `partial` call. \n", 770 | "\n", 771 | "Here is an example of doing partial aggregation on every 5 blocks along the 0 axis and every 2 blocks along the 1 axis (so 10 blocks go into each `partial-sum`):" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "id": "e57621ca-58c4-4506-964f-9cf53746e962", 778 | "metadata": { 779 | "tags": [] 780 | }, 781 | "outputs": [], 782 | "source": [ 783 | "x.sum(split_every={0: 5, 1: 2}).visualize()" 784 | ] 785 | }, 786 | { 787 | "cell_type": "markdown", 788 | "id": "902b2b79-af4d-4d79-bd2e-daa90e80bcd5", 789 | "metadata": {}, 790 | "source": [ 791 | "**Exercise**\n", 792 | "\n", 793 | "Try setting different values for `split_every` and visualizing the task graph to see the impact." 794 | ] 795 | }, 796 | { 797 | "cell_type": "code", 798 | "execution_count": null, 799 | "id": "37b1dcaa-83fa-474f-9351-dc443500595b", 800 | "metadata": { 801 | "jupyter": { 802 | "source_hidden": true 803 | }, 804 | "tags": [] 805 | }, 806 | "outputs": [], 807 | "source": [ 808 | "# solution\n", 809 | "x.sum(split_every={0: 10, 1: 2}).visualize()" 810 | ] 811 | }, 812 | { 813 | "cell_type": "markdown", 814 | "id": "8e5bd9c4-5aaf-4e09-b655-210bf0cc3169", 815 | "metadata": {}, 816 | "source": [ 817 | "> **Side note**\n", 818 | ">\n", 819 | "> You can use reductions to calculate aggregations per-block reduction even if you don't want to combine and aggregate the results of those blocks:\n", 820 | ">\n", 821 | "> ```python\n", 822 | "> da.reduction(x, np.sum, lambda x, **kwargs: x, dtype=int).compute()\n", 823 | "> ```" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "id": "disturbed-snapshot", 829 | "metadata": {}, 830 | "source": [ 831 | "## When to use which method\n", 832 | "\n", 833 | "In this notebook we've covered several different mechanisms for applying arbitrary functions to the blocks in arrays or dataframes. Here's a brief summary of when you should use these various methods\n", 834 | "\n", 835 | "- `map_block`, `map_partition` - block organization of the input matches the block organization of the output and the function is fully parallelizable. \n", 836 | "- `map_overlap` - block organizations of input and output match, but the function is not fully parallelizable (requires input from neighboring chunks).\n", 837 | "- `blockwise` - same function can be applied to the blocks as to the partial and aggregated versions. Also output blocks can be in different orientations.\n", 838 | "- `reduction` - dimensionality of output does not necessarily match that of input and function is fully parallelizable.\n", 839 | "- `groupby().agg` - data needs to be aggregated per group (the index of the output will be the group keys).\n", 840 | "- `dask.delayed` - data doesn't have a complex block organization or the data is small and the computation is pretty fast." 841 | ] 842 | }, 843 | { 844 | "attachments": {}, 845 | "cell_type": "markdown", 846 | "id": "stone-surfing", 847 | "metadata": {}, 848 | "source": [ 849 | "## Customizing Optimization\n", 850 | "\n", 851 | "Dask defines a default optimization strategy for each collection type (Array,\n", 852 | "Bag, DataFrame, Delayed). However, different applications may have different\n", 853 | "needs. \n", 854 | "\n", 855 | "Oftentimes this looks like turning off particular types of optimization using \n", 856 | "the Dask config settings:\n", 857 | "\n", 858 | "```python\n", 859 | "dask.config.set({\"optimization.fuse.active\": False})\n", 860 | "```\n", 861 | "\n", 862 | "You can also construct your own custom optimization function and use it \n", 863 | "instead of the default. An optimization function takes in a task graph \n", 864 | "and list of desired keys and returns a new task graph:\n", 865 | "\n", 866 | "```python\n", 867 | "def my_optimize_function(dsk, keys):\n", 868 | " new_dsk = {...}\n", 869 | " return new_dsk\n", 870 | "```\n", 871 | "\n", 872 | "You can then register this optimization class against whichever collection type\n", 873 | "you prefer and it will be used instead of the default scheme:\n", 874 | "\n", 875 | "```python\n", 876 | "with dask.config.set(array_optimize=my_optimize_function):\n", 877 | " x, y = dask.compute(x, y)\n", 878 | "```\n", 879 | "\n", 880 | "You can register separate optimization functions for different collections, or\n", 881 | "you can register ``None`` if you do not want particular types of collections to\n", 882 | "be optimized:\n", 883 | "\n", 884 | "```python\n", 885 | "with dask.config.set(array_optimize=my_optimize_function,\n", 886 | " dataframe_optimize=None,\n", 887 | " delayed_optimize=my_other_optimize_function):\n", 888 | " ...\n", 889 | "```\n", 890 | "\n", 891 | "You do not need to specify all collections. Collections will default to their\n", 892 | "standard optimization scheme (which is usually a good choice).\n", 893 | "\n", 894 | "# Next steps\n", 895 | "\n", 896 | "Next, let's learn more about Dask's distributed scheduler in the [3-distributed-scheduler.ipynb](3-distributed-scheduler.ipynb) notebook." 897 | ] 898 | } 899 | ], 900 | "metadata": { 901 | "jupytext": { 902 | "cell_metadata_filter": "-all", 903 | "main_language": "python", 904 | "notebook_metadata_filter": "-all", 905 | "text_representation": { 906 | "extension": ".md", 907 | "format_name": "markdown" 908 | } 909 | }, 910 | "kernelspec": { 911 | "display_name": "Python 3 (ipykernel)", 912 | "language": "python", 913 | "name": "python3" 914 | }, 915 | "language_info": { 916 | "codemirror_mode": { 917 | "name": "ipython", 918 | "version": 3 919 | }, 920 | "file_extension": ".py", 921 | "mimetype": "text/x-python", 922 | "name": "python", 923 | "nbconvert_exporter": "python", 924 | "pygments_lexer": "ipython3", 925 | "version": "3.11.4" 926 | } 927 | }, 928 | "nbformat": 4, 929 | "nbformat_minor": 5 930 | } 931 | -------------------------------------------------------------------------------- /3-distributed-scheduler.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "9d19f489-59e6-48c7-aade-25530dcf253c", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 9 | "\n", 10 | "# Dask clusters\n", 11 | "\n", 12 | "This notebook covers Dask's distributed clusters in more detail. We provide a more in depth look at the components of a cluster, illustrate how to inspect the internal state of a cluster, and how you can extend the functionality of your cluster using Dask's plugin system." 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "c45762cb-a5a2-4f7c-ae29-02f05c78ba89", 18 | "metadata": {}, 19 | "source": [ 20 | "# Cluster overview\n", 21 | "\n", 22 | "In this section we'll discuss:\n", 23 | "\n", 24 | "1. The different components which make up a Dask cluster\n", 25 | "2. Survey different ways to launch a cluster\n", 26 | "\n", 27 | "## Components of a cluster\n", 28 | "\n", 29 | "A Dask cluster is composed of three different types of objects:\n", 30 | "\n", 31 | "1. **Scheduler**: A single, centralized scheduler process which responds to requests for computations, maintains relavant state about tasks and worker, and sends tasks to workers to be computed.\n", 32 | "2. **Workers**: One or more worker processes which compute tasks and store/serve their results.\n", 33 | "3. **Clients**: One or more client objects which are the user-facing entry point to interact with the cluster.\n", 34 | "\n", 35 | "\"Dask\n", 38 | "\n", 39 | "A couple of notes about workers:\n", 40 | "\n", 41 | "- Each worker runs in its own Python process. Each worker Python process has its own `concurrent.futures.ThreadPoolExecutor` which is uses to compute tasks in parallel. The same threads vs. processes considerations we discussed earlier also apply to Dask workers.\n", 42 | "- There's actually a fourth cluster object which is often not discussed: the **Nanny**. By default Dask workers are launched and managed by a separate nanny process. This separate process allows workers to restart themselves if you want to use the `Client.restart` method, or to restart workers automatically if they get above a certain memory limit threshold.\n", 43 | "\n", 44 | "#### Related Documentation\n", 45 | "\n", 46 | "- [Cluster architecture](https://distributed.dask.org/en/latest/#architecture)\n", 47 | "- [Journey of a task](https://distributed.dask.org/en/latest/journey.html)\n", 48 | "\n", 49 | "## Deploying Dask clusters\n", 50 | "\n", 51 | "Deploying a Dask cluster means launching scheduler, worker, and client processes and setting up the appropriate network connections so these processes can communicate with one another. Dask clusters can be lauched in a few different ways which we highlight in the following sections.\n", 52 | "\n", 53 | "\n", 54 | "### Manual setup\n", 55 | "\n", 56 | "Launch a scheduler process using the `dask-scheduler` command line utility:\n", 57 | "\n", 58 | "```terminal\n", 59 | "$ dask-scheduler\n", 60 | "Scheduler at: tcp://192.0.0.100:8786\n", 61 | "```\n", 62 | "\n", 63 | "and then launch several workers by using the `dask-worker` command and providing them the address of the scheduler they should connect to:\n", 64 | "\n", 65 | "```terminal\n", 66 | "$ dask-worker tcp://192.0.0.100:8786\n", 67 | "Start worker at: tcp://192.0.0.1:12345\n", 68 | "Registered to: tcp://192.0.0.100:8786\n", 69 | "\n", 70 | "$ dask-worker tcp://192.0.0.100:8786\n", 71 | "Start worker at: tcp://192.0.0.2:40483\n", 72 | "Registered to: tcp://192.0.0.100:8786\n", 73 | "\n", 74 | "$ dask-worker tcp://192.0.0.100:8786\n", 75 | "Start worker at: tcp://192.0.0.3:27372\n", 76 | "Registered to: tcp://192.0.0.100:8786\n", 77 | "```\n", 78 | "\n", 79 | "### Python API (advanced)\n", 80 | "\n", 81 | "⚠️ **Warning**: Creating `Scheduler` / `Worker` objects explicitly in Python is rarely needed in practice and is intended for more advanced users ⚠️" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "id": "9adfaf00-139c-4a4e-8971-318c9d8c31c7", 88 | "metadata": { 89 | "tags": [] 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "from dask.distributed import Scheduler, Worker, Client" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "id": "392a2886-05e0-4af7-8cb2-63015ba13e5a", 100 | "metadata": { 101 | "tags": [] 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "# Launch a scheduler\n", 106 | "async with Scheduler() as scheduler: # Launch a scheduler\n", 107 | " # Launch a worker which connects to the scheduler\n", 108 | " async with Worker(scheduler.address) as worker:\n", 109 | " # Launch a client which connects to the scheduler\n", 110 | " async with Client(scheduler.address, asynchronous=True) as client:\n", 111 | " result = await client.submit(sum, range(100))\n", 112 | " print(f\"{result = }\")" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "id": "ec8031ac-58a0-40bb-a0b1-07847d54b964", 118 | "metadata": {}, 119 | "source": [ 120 | "### Cluster managers (recommended)\n", 121 | "\n", 122 | "Dask has the notion of cluster manager objects. Cluster managers offer a consistent interface for common activities like adding/removing workers to a cluster, retrieving logs, etc." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "id": "304c076c-5f10-4760-88b2-cf4dff1aed99", 129 | "metadata": { 130 | "tags": [] 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "from dask.distributed import LocalCluster" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "id": "584a2ae1-32ad-4ca0-b563-c5f58595c250", 141 | "metadata": { 142 | "tags": [] 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "# Launch a scheduler and 4 workers on my local machine\n", 147 | "cluster = LocalCluster(n_workers=4, threads_per_worker=2)\n", 148 | "cluster" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "id": "41daa9b9-51c5-4996-85a5-144f19714389", 155 | "metadata": { 156 | "tags": [] 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "# Scale up to 10 workers\n", 161 | "cluster.scale(10)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "id": "c8937b5b-2559-4eb1-96ae-d76b385526c2", 168 | "metadata": { 169 | "tags": [] 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "# Scale down to 2 workers\n", 174 | "cluster.scale(2)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "id": "3b6098c3-1f8b-4e42-8d0a-996f2b44b1e3", 181 | "metadata": { 182 | "tags": [] 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "# Retrieve cluster logs\n", 187 | "cluster.get_logs()" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "id": "e350ead3-229a-4974-bd89-7174d99edbef", 194 | "metadata": { 195 | "tags": [] 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "# Shut down cluster\n", 200 | "cluster.close()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "id": "94169465-79c0-4063-b7bf-dbc9264cdfe4", 206 | "metadata": {}, 207 | "source": [ 208 | "There are several projects in the Dask ecosystem for easily deploying clusters on commonly used computing resources:\n", 209 | "\n", 210 | "- [Dask-Kubernetes](https://kubernetes.dask.org/en/latest/) for deploying Dask using native Kubernetes APIs\n", 211 | "- [Dask-Cloudprovider](https://cloudprovider.dask.org/en/latest/) for deploying Dask clusters on various cloud platforms (e.g. AWS, GCP, Azure, etc.)\n", 212 | "- [Dask-Yarn](https://yarn.dask.org/en/latest/) for deploying Dask on YARN clusters\n", 213 | "- [Dask-MPI](http://mpi.dask.org/en/latest/) for deploying Dask on existing MPI environments\n", 214 | "- [Dask-Jobqueue](https://jobqueue.dask.org/en/latest/) for deploying Dask on job queuing systems (e.g. PBS, Slurm, etc.)\n", 215 | "\n", 216 | "Launching clusters with any of these projects follows a similar pattern as using Dask's built-in `LocalCluster`:\n", 217 | "\n", 218 | "```python\n", 219 | "# Launch a Dask cluster on a Kubernetes cluster\n", 220 | "from dask_kubernetes import KubeCluster\n", 221 | "cluster = KubeCluster(...)\n", 222 | "\n", 223 | "# Launch a Dask cluster on AWS Fargate\n", 224 | "from dask_cloudprovider.aws import FargateCluster\n", 225 | "cluster = FargateCluster(...)\n", 226 | "\n", 227 | "# Launch a Dask cluster on a PBS job queueing system\n", 228 | "from dask_jobqueue import PBSCluster\n", 229 | "cluster = PBSCluster(...)\n", 230 | "```\n", 231 | "\n", 232 | "Additionally, there are companies like [Coiled](https://coiled.io) and [Saturn Cloud](https://www.saturncloud.io) which have Dask deployment-as-a-service offerings. *Disclaimer*: Some of the instructors for this tutorial are employed by Coiled. \n", 233 | "\n", 234 | "#### Related Documentation\n", 235 | "\n", 236 | "- [Deploy Dask Clusters](https://docs.dask.org/en/stable/deploying.html)\n", 237 | "- [Cluster setup](https://docs.dask.org/en/latest/setup.html)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "id": "20ba2911-0ac4-43bd-bdb9-54a2806e5ea9", 243 | "metadata": {}, 244 | "source": [ 245 | "# Inspecting a cluster's state\n", 246 | "\n", 247 | "In this section we'll:\n", 248 | "\n", 249 | "1. Familiarize ourselves with Dask's scheduler and worker processes\n", 250 | "2. Explore the various state that's tracked throughout the cluster\n", 251 | "3. Learn how to inspect remote scheduler and worker processes\n", 252 | "\n", 253 | "Dask has a a variety of ways to provide users insight into what's going on during their computations. For example, Dask's [diagnositc dashboard](https://docs.dask.org/en/latest/diagnostics-distributed.html) displays real-time information about what tasks are current running, overal progress on a computation, worker CPU and memory load, statistical profiling information, and much more. Additionally, Dask's [performance reports](https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports) allow you to save the diagnostic dashboards as static HTML plots. Performance reports are particularly useful when benchmarking/profiling workloads or when sharing workload performance with colleagues." 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "ee7a1ae0-f029-41a8-90bc-7d6a5e6651e8", 260 | "metadata": { 261 | "tags": [] 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "from dask.distributed import LocalCluster, Client, Worker\n", 266 | "\n", 267 | "cluster = LocalCluster(worker_class=Worker, processes=True) \n", 268 | "client = Client(cluster)\n", 269 | "client" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "id": "33ce7759-2113-4dcc-9581-2a473eead63b", 276 | "metadata": { 277 | "tags": [] 278 | }, 279 | "outputs": [], 280 | "source": [ 281 | "import dask.array as da\n", 282 | "from dask.distributed import performance_report\n", 283 | "\n", 284 | "with performance_report(\"my_report.html\"):\n", 285 | " x = da.random.random((10_000, 10_000), chunks=(1_000, 1_000))\n", 286 | " result = (x + x.T).mean(axis=0).mean()\n", 287 | " result.compute()" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "id": "bed8743b-a34e-488a-bc51-abe923914cf0", 293 | "metadata": {}, 294 | "source": [ 295 | "These are invaluable tools and we highly recommend utilizing them. Often times Dask's dashboard is totally sufficient to understand the performance of your computations.\n", 296 | "\n", 297 | "However, sometimes it can be useful to dive more deeply into the internals of your cluster and directly inspect the state of your scheduler and workers. Let's start by submitting some tasks to the cluster to be computed." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "id": "98cd7868-ceb2-4066-bb8f-094ad3fc0053", 304 | "metadata": { 305 | "tags": [] 306 | }, 307 | "outputs": [], 308 | "source": [ 309 | "import random\n", 310 | "\n", 311 | "def double(x):\n", 312 | " random.seed(x)\n", 313 | " # Simulate some random task failures\n", 314 | " if random.random() < 0.1:\n", 315 | " raise ValueError(\"Oh no!\")\n", 316 | " return 2 * x\n", 317 | "\n", 318 | "futures = client.map(double, range(50))" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "id": "3b8a3c19-97d2-47a9-8216-3f73abe6189a", 324 | "metadata": {}, 325 | "source": [ 326 | "One of the nice things about `LocalCluster` is it gives us direct access the `Scheduler` Python object. This allows us to easily inspect the scheduler directly." 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "id": "5b842c4f-64c6-4a0b-b068-45332552d6a9", 333 | "metadata": { 334 | "tags": [] 335 | }, 336 | "outputs": [], 337 | "source": [ 338 | "scheduler = cluster.scheduler\n", 339 | "scheduler" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "id": "6b8e8155-68a2-4cfc-8cfb-424180e98bdb", 345 | "metadata": {}, 346 | "source": [ 347 | "ℹ️ Note that often times you won't have direct access to the `Scheduler` Python object (e.g. when the scheduler is running on separate machine). In these cases it's still possible to inspect the scheduler and we will discuss how to do this later on.\n", 348 | "\n", 349 | "The scheduler tracks **a lot** of state. Let's start to explore the scheduler to get a sense for what information it keeps track of." 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "id": "a4a43b6c-210c-41a0-bfff-38c38bfb6316", 356 | "metadata": { 357 | "tags": [] 358 | }, 359 | "outputs": [], 360 | "source": [ 361 | "scheduler.address # Scheduler's address" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "id": "eb25dc76-df8c-418e-84ea-ba3d33fcd373", 368 | "metadata": { 369 | "tags": [] 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "scheduler.time_started # Time the scheduler was started" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "id": "59c8b1be-3adb-4cdd-893c-f83d71a75a92", 380 | "metadata": { 381 | "tags": [] 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "dict(scheduler.workers)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "id": "f9f68f77-69c6-4b17-90ce-cf179af7d287", 392 | "metadata": { 393 | "tags": [] 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "worker_state = next(iter(scheduler.workers.values()))\n", 398 | "worker_state" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "id": "f5ba8180-8ef7-4bc7-ae0a-cf7b9708c9e6", 404 | "metadata": {}, 405 | "source": [ 406 | "Let's take a look at the `WorkerState` attributes" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": null, 412 | "id": "a73d206f-7a65-49f0-8361-3f9e37f5a73a", 413 | "metadata": { 414 | "tags": [] 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "worker_state.address # Worker's address" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "id": "c0d58de4-ec72-4d16-a77b-f35a05d74563", 425 | "metadata": { 426 | "tags": [] 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "worker_state.status # Current status of the worker (e.g. \"running\", \"closed\")" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "id": "9b0add25-1edc-4bad-92dd-c2b18947d451", 437 | "metadata": { 438 | "tags": [] 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "worker_state.nthreads # Number of threads in the worker's `ThreadPoolExecutor`" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "id": "24be2aab-a7c7-454d-91f9-78897e32eadb", 449 | "metadata": { 450 | "tags": [] 451 | }, 452 | "outputs": [], 453 | "source": [ 454 | "worker_state.executing # Dictionary of all tasks which are currently being processed, along with the current duration of the task" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "id": "8e414215-dc91-4468-af2f-4119945bca3a", 461 | "metadata": { 462 | "tags": [] 463 | }, 464 | "outputs": [], 465 | "source": [ 466 | "worker_state.metrics # Various metrics describing the current state of the worker" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "id": "e2bbacd5-a368-4566-912c-4a4db4b8a211", 472 | "metadata": {}, 473 | "source": [ 474 | "Workers check in with the scheduler inform it when certain event occur (e.g. when a worker has completed a task) so the scheduler can update it's internal state." 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "id": "abf2af29-be3e-4d78-a5c2-ca595d380357", 481 | "metadata": { 482 | "tags": [] 483 | }, 484 | "outputs": [], 485 | "source": [ 486 | "worker_state.last_seen" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "id": "be601b2d-7047-44b2-81c2-5dabd344053c", 493 | "metadata": { 494 | "tags": [] 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "import time\n", 499 | "\n", 500 | "for _ in range(10):\n", 501 | " print(f\"{worker_state.last_seen = }\")\n", 502 | " time.sleep(1)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "id": "13b53603-8d32-40c8-b71d-b3ed1af57848", 508 | "metadata": {}, 509 | "source": [ 510 | "In addition to the state of each worker, the scheduler also tracks information for each task it has been asked to run." 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "id": "ca9ddfa1-7ab7-4585-961d-01522c343b42", 517 | "metadata": { 518 | "tags": [] 519 | }, 520 | "outputs": [], 521 | "source": [ 522 | "scheduler.tasks" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "id": "ac277222-ac14-4743-be25-d2460f5d9d9f", 529 | "metadata": { 530 | "tags": [] 531 | }, 532 | "outputs": [], 533 | "source": [ 534 | "task_state = next(iter(scheduler.tasks.values()))" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "id": "9d8093d0-9542-42d7-a417-1e9b59a6970d", 541 | "metadata": { 542 | "tags": [] 543 | }, 544 | "outputs": [], 545 | "source": [ 546 | "task_state" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": null, 552 | "id": "1bc9a48c-6c3a-4f39-be74-fae5e26ea645", 553 | "metadata": { 554 | "tags": [] 555 | }, 556 | "outputs": [], 557 | "source": [ 558 | "task_state.key # Task's name (unique identifier)" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "id": "cfe7c942-172b-4e54-9548-8e94c17da4c0", 565 | "metadata": { 566 | "tags": [] 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "task_state.state # Task's state (e.g. \"memory\", \"waiting\", \"processing\", \"erred\", etc.)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "id": "b9cd9140-19ce-45dd-adc0-d74c7a671ba1", 577 | "metadata": { 578 | "tags": [] 579 | }, 580 | "outputs": [], 581 | "source": [ 582 | "task_state.who_has # Set of workers (`WorkerState`s) who have this task's result in memory" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "id": "fce9b8e3-e8c7-49d0-8343-00d899c35a8a", 589 | "metadata": { 590 | "tags": [] 591 | }, 592 | "outputs": [], 593 | "source": [ 594 | "task_state.nbytes # The number of bytes of the result of this finished task" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": null, 600 | "id": "4b2ec17e-8bb3-47a7-b078-869b0019f6ae", 601 | "metadata": { 602 | "tags": [] 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "task_state.type # The type of the the task's result (as a string)" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "id": "1ee16f2c-3f95-41f6-ad80-8530b1c33802", 613 | "metadata": { 614 | "tags": [] 615 | }, 616 | "outputs": [], 617 | "source": [ 618 | "task_state.retries # The number of times this task can automatically be retried in case of failure" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "id": "f498bef9-988d-4991-981f-d62158dec477", 624 | "metadata": {}, 625 | "source": [ 626 | "## Exercise 1\n", 627 | "\n", 628 | "Spend the next 5 minutes continuing to explore the attributes the scheduler keeps track of. Try to answer the following questions:\n", 629 | "\n", 630 | "1. What are the keys for the tasks which failed?\n", 631 | "2. How many tasks successfully ran on each worker?" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "id": "0f35a9e7-05e6-4395-9551-bf6ba39d6aad", 638 | "metadata": { 639 | "tags": [] 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "# What are the keys for the tasks which failed?\n", 644 | "# Your solution goes here" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "id": "4b4e6c15-bf52-4bf5-b841-9646770ac99e", 651 | "metadata": { 652 | "jupyter": { 653 | "source_hidden": true 654 | }, 655 | "tags": [] 656 | }, 657 | "outputs": [], 658 | "source": [ 659 | "# Solution to \"What are the keys for the tasks which failed?\"\n", 660 | "erred_tasks = [key for key, ts in scheduler.tasks.items() if ts.state == \"erred\"]\n", 661 | "erred_tasks" 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": null, 667 | "id": "5d22af01-a1e8-400e-b049-9889c24142ae", 668 | "metadata": { 669 | "tags": [] 670 | }, 671 | "outputs": [], 672 | "source": [ 673 | "# How many tasks successfull ran on each worker?\n", 674 | "# Your solution goes here" 675 | ] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "execution_count": null, 680 | "id": "b698f481-b9d6-4e9c-ac0c-325d1ef725d8", 681 | "metadata": { 682 | "jupyter": { 683 | "source_hidden": true 684 | }, 685 | "tags": [] 686 | }, 687 | "outputs": [], 688 | "source": [ 689 | "# Solution to \"How many tasks successfull ran on each worker?\"\n", 690 | "from collections import defaultdict\n", 691 | "\n", 692 | "erred_tasks = [key for key, ts in scheduler.tasks.items() if ts.state == \"erred\"]\n", 693 | "counter = defaultdict(int)\n", 694 | "for key, ts in scheduler.tasks.items():\n", 695 | " if key in erred_tasks:\n", 696 | " continue\n", 697 | " for worker in ts.who_has: \n", 698 | " counter[worker] += 1\n", 699 | "print(counter)\n" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "id": "52e806fc-f8ee-431e-879c-6be9b3ea5d9c", 706 | "metadata": { 707 | "jupyter": { 708 | "source_hidden": true 709 | }, 710 | "tags": [] 711 | }, 712 | "outputs": [], 713 | "source": [ 714 | "# Solution 2 \n", 715 | "counter = {address: worker_state.metrics['task_counts']['memory'] \n", 716 | " for address, worker_state in scheduler.workers.items()}\n", 717 | "print(counter)" 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "id": "5e07c7c2-f157-4fbb-82cb-81b59fff7da6", 723 | "metadata": {}, 724 | "source": [ 725 | "In addition to inspecting the scheduler, we can also investigate the state of each of our workers." 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": null, 731 | "id": "fbc95933-bff8-4a8e-9b28-430c4a042034", 732 | "metadata": { 733 | "tags": [] 734 | }, 735 | "outputs": [], 736 | "source": [ 737 | "cluster.workers" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": null, 743 | "id": "42800bd2-8ef2-456a-87d7-0354cf7c9626", 744 | "metadata": { 745 | "tags": [] 746 | }, 747 | "outputs": [], 748 | "source": [ 749 | "worker = next(iter(cluster.workers.values()))\n", 750 | "worker" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "id": "b3f5ccb3-4495-4290-9209-25b2687c951e", 757 | "metadata": { 758 | "tags": [] 759 | }, 760 | "outputs": [], 761 | "source": [ 762 | "worker.address # Worker's address" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "id": "e89df365-8547-4157-823c-0b33c6ef83e9", 769 | "metadata": { 770 | "tags": [] 771 | }, 772 | "outputs": [], 773 | "source": [ 774 | "worker.state.executing_count # Number of tasks the worker is currenting computing" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": null, 780 | "id": "969a7bed-a6e3-4708-9f95-1900d48b27f5", 781 | "metadata": { 782 | "tags": [] 783 | }, 784 | "outputs": [], 785 | "source": [ 786 | "worker.state.executed_count # Running total of all tasks processed on this worker" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "id": "aaeebb27-e9c3-456b-9267-c5180559e82e", 793 | "metadata": { 794 | "tags": [] 795 | }, 796 | "outputs": [], 797 | "source": [ 798 | "worker.state.nthreads # Number of threads in the worker's ThreadPoolExecutor" 799 | ] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "execution_count": null, 804 | "id": "ab866a75-7e95-4c5f-b3ff-e54306492fa2", 805 | "metadata": { 806 | "tags": [] 807 | }, 808 | "outputs": [], 809 | "source": [ 810 | "worker.executor # Worker's ThreadPoolExecutor where it computes tasks" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "id": "257fd911-8d37-421a-87ad-1f87e4f68f3e", 817 | "metadata": { 818 | "tags": [] 819 | }, 820 | "outputs": [], 821 | "source": [ 822 | "worker.keys() # Keys the worker currently has in memory" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "id": "c1b9afab-edad-4a99-8471-71a943f81d35", 829 | "metadata": { 830 | "tags": [] 831 | }, 832 | "outputs": [], 833 | "source": [ 834 | "worker.data # Where the worker stores task results" 835 | ] 836 | }, 837 | { 838 | "cell_type": "code", 839 | "execution_count": null, 840 | "id": "5d45c97e-d8c5-4a50-882c-689ec62378b9", 841 | "metadata": { 842 | "tags": [] 843 | }, 844 | "outputs": [], 845 | "source": [ 846 | "{key: worker.data[key] for key in worker.keys()} # results stored on the worker" 847 | ] 848 | }, 849 | { 850 | "cell_type": "markdown", 851 | "id": "c1e20d0f-38d7-4694-a65e-e223557d5d3a", 852 | "metadata": {}, 853 | "source": [ 854 | "## Accessing remote scheduler and workers\n", 855 | "\n", 856 | "As we noted earlier, often times you won't have direct access to the `Scheduler` or `Worker` Python objects for your cluster. However, in these cases it's still possible to examine the state of the scheduler and workers in your cluster using the `Client.run` ([docs](https://distributed.dask.org/en/latest/api.html#distributed.Client.run)) and `Client.run_on_scheduler`([docs](https://distributed.dask.org/en/latest/api.html#distributed.Client.run_on_scheduler)) methods.\n", 857 | "\n", 858 | "`Client.run` allows you to run a function on worker processes in your cluster. If the function has a `dask_worker` parameter, then that variable will be populated with the `Worker` instance when the function is run. Likewise, `Client.run_on_scheduler` allows you to run a function on the scheduler processes in your cluster. If the function has a `dask_scheduler` parameter, then that variable will be populated with the `Scheduler` instance when the function is run.\n", 859 | "\n", 860 | "Let's look at an examples of custom function. If the function has a `dask_worker` parameter ..." 861 | ] 862 | }, 863 | { 864 | "cell_type": "code", 865 | "execution_count": null, 866 | "id": "f2567d89-6409-42ec-b147-c24ab84ac643", 867 | "metadata": { 868 | "tags": [] 869 | }, 870 | "outputs": [], 871 | "source": [ 872 | "def get_worker_name(dask_worker):\n", 873 | " return dask_worker.name\n", 874 | "\n", 875 | "client.run(get_worker_name)" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "id": "09ac4f6b-c0ae-4c7e-a66b-1f39a3ceb83b", 881 | "metadata": {}, 882 | "source": [ 883 | "Similarly, we can do the same thing on the scheduler by using `Client.run_on_scheduler`" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": null, 889 | "id": "0daf0c15-c20d-421e-a21c-3ee27df6e723", 890 | "metadata": { 891 | "tags": [] 892 | }, 893 | "outputs": [], 894 | "source": [ 895 | "def get_erred_tasks(dask_scheduler):\n", 896 | " return [key for key, ts in dask_scheduler.tasks.items() if ts.state == \"erred\"]\n", 897 | "\n", 898 | "client.run_on_scheduler(get_erred_tasks)" 899 | ] 900 | }, 901 | { 902 | "cell_type": "code", 903 | "execution_count": null, 904 | "id": "780f7e4f-8ad2-4e93-ac86-e2d8c5331902", 905 | "metadata": { 906 | "tags": [] 907 | }, 908 | "outputs": [], 909 | "source": [ 910 | "client.close()\n", 911 | "cluster.close()" 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "id": "aec78fc7-3125-4d40-a03e-3595c384382d", 917 | "metadata": {}, 918 | "source": [ 919 | "#### Related Documentation\n", 920 | "\n", 921 | "- [Dask worker](https://distributed.dask.org/en/latest/worker.html)\n", 922 | "- [Scheduling state](https://distributed.dask.org/en/latest/scheduling-state.html)" 923 | ] 924 | }, 925 | { 926 | "cell_type": "markdown", 927 | "id": "71b05ace-1e10-4716-ae83-f7c659c46d28", 928 | "metadata": {}, 929 | "source": [ 930 | "# Extending the scheduler and workers: Dask's plugin system\n", 931 | "\n", 932 | "In this section we'll siscuss Dask's scheduler and worker plugin systems and write our own plugin to extend the scheduler's functionality.\n", 933 | "\n", 934 | "So far we've primarily focused on inspecting the state of a cluster. However, there are times when it's useful to extend the functionality of the scheduler and/or workers in a cluster. To help facilitate this, Dask has scheduler and worker plugin systems which enable you to hook into different events that happen throughout a cluster's lifecycle. This allows you to run custom code when a specific type of event occurs on the cluster.\n", 935 | "\n", 936 | "Specifically, the [scheduler plugin system](https://distributed.dask.org/en/latest/plugins.html#scheduler-plugins) enables you run custom code when the following events occur:\n", 937 | "\n", 938 | "1. Scheduler starts, stops, or is restarted\n", 939 | "2. Client connects or disconnects to the scheduler\n", 940 | "3. Workers enters or leaves the cluster\n", 941 | "4. When a new task enters the scheduler\n", 942 | "5. When a task changes state (e.g. from \"processing\" to \"memory\")\n", 943 | "\n", 944 | "While the [worker plugin system](https://distributed.dask.org/en/latest/plugins.html#worker-plugins) enables you run custom code when the following events occur:\n", 945 | "\n", 946 | "1. Worker starts or stops\n", 947 | "2. When a worker releases a task\n", 948 | "3. When a task changes state (e.g. \"processing\" to \"memory\")\n", 949 | "\n", 950 | "Implementing your own custom plugin consists of creating a Python class with certain methods (each method corresponds to a particular lifecycle event)." 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "id": "de6f767a-52e6-4c9f-a26a-7a642b4bb953", 957 | "metadata": { 958 | "tags": [] 959 | }, 960 | "outputs": [], 961 | "source": [ 962 | "from distributed import SchedulerPlugin, WorkerPlugin" 963 | ] 964 | }, 965 | { 966 | "cell_type": "code", 967 | "execution_count": null, 968 | "id": "6f35b157-5502-4809-a1e6-5991799c8d31", 969 | "metadata": { 970 | "tags": [] 971 | }, 972 | "outputs": [], 973 | "source": [ 974 | "# Lifecycle SchedulerPlugin methods\n", 975 | "[attr for attr in dir(SchedulerPlugin) if not attr.startswith(\"_\")]" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "id": "69ac1db2-1170-49dd-a043-a646df6cedc6", 982 | "metadata": { 983 | "tags": [] 984 | }, 985 | "outputs": [], 986 | "source": [ 987 | "# Lifecycle WorkerPlugin methods\n", 988 | "[attr for attr in dir(WorkerPlugin) if not attr.startswith(\"_\")]" 989 | ] 990 | }, 991 | { 992 | "cell_type": "markdown", 993 | "id": "131294ca-74ee-4754-bd3e-15eba1de758b", 994 | "metadata": {}, 995 | "source": [ 996 | "For the exact signature of each method, please refer to the [`SchedulerPlugin`](https://distributed.dask.org/en/latest/plugins.html#scheduler-plugins) and [`WorkerPlugin`](https://distributed.dask.org/en/latest/plugins.html#worker-plugins) documentation.\n", 997 | "\n", 998 | "Let's looks at an example scheduler plugin." 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": null, 1004 | "id": "4fc97c90-a858-40ea-9f31-e7f90aba3e85", 1005 | "metadata": { 1006 | "tags": [] 1007 | }, 1008 | "outputs": [], 1009 | "source": [ 1010 | "class Counter(SchedulerPlugin):\n", 1011 | " \"\"\"Keeps a running count of the total number of completed tasks\"\"\"\n", 1012 | " def __init__(self):\n", 1013 | " self.n_tasks = 0\n", 1014 | "\n", 1015 | " def transition(self, key, start, finish, *args, **kwargs):\n", 1016 | " if start == \"processing\" and finish == \"memory\":\n", 1017 | " self.n_tasks += 1\n", 1018 | "\n", 1019 | " def restart(self, scheduler):\n", 1020 | " self.n_tasks = 0" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "markdown", 1025 | "id": "51ae7356-5a48-4315-abb4-343378b88f1f", 1026 | "metadata": {}, 1027 | "source": [ 1028 | "Check documentation on [Task States](https://distributed.dask.org/en/stable/scheduling-state.html#scheduler-task-state)" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "id": "e2586c29-e5e3-495d-84aa-d30f02639ff7", 1034 | "metadata": {}, 1035 | "source": [ 1036 | "To add a custom scheduler plugin to your cluster, use the `Scheduler.add_plugin` method:" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "code", 1041 | "execution_count": null, 1042 | "id": "56d0b325-2524-4c33-809c-5b853c94594b", 1043 | "metadata": { 1044 | "tags": [] 1045 | }, 1046 | "outputs": [], 1047 | "source": [ 1048 | "# Create LocalCluster and Client\n", 1049 | "cluster = LocalCluster()\n", 1050 | "client = Client(cluster)\n", 1051 | "\n", 1052 | "# Instantiate and add the Counter to our cluster\n", 1053 | "counter = Counter()\n", 1054 | "cluster.scheduler.add_plugin(counter)" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": null, 1060 | "id": "485d1388-4cec-4a16-a2d1-2f2e5d3d2448", 1061 | "metadata": { 1062 | "tags": [] 1063 | }, 1064 | "outputs": [], 1065 | "source": [ 1066 | "counter.n_tasks" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "code", 1071 | "execution_count": null, 1072 | "id": "220a76d3-25d4-4a3b-800e-fea64b055cb7", 1073 | "metadata": { 1074 | "tags": [] 1075 | }, 1076 | "outputs": [], 1077 | "source": [ 1078 | "from distributed import wait\n", 1079 | "futures = client.map(lambda x: x + 1, range(27))\n", 1080 | "wait(futures);" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": null, 1086 | "id": "5a92ad14-2976-4552-b654-8a7ad8645d8a", 1087 | "metadata": { 1088 | "tags": [] 1089 | }, 1090 | "outputs": [], 1091 | "source": [ 1092 | "counter.n_tasks" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": null, 1098 | "id": "a1184a1e-bd18-4384-abf3-9acb1ffeccde", 1099 | "metadata": {}, 1100 | "outputs": [], 1101 | "source": [ 1102 | "client.restart()" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": null, 1108 | "id": "f91f910d-022f-4e51-b762-36113c4f4184", 1109 | "metadata": {}, 1110 | "outputs": [], 1111 | "source": [ 1112 | "counter.n_tasks" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "code", 1117 | "execution_count": null, 1118 | "id": "c4a05fc9-8292-48c7-a67d-21ac536e4d3c", 1119 | "metadata": { 1120 | "tags": [] 1121 | }, 1122 | "outputs": [], 1123 | "source": [ 1124 | "client.close()\n", 1125 | "cluster.close()" 1126 | ] 1127 | }, 1128 | { 1129 | "cell_type": "markdown", 1130 | "id": "2137bfc5-255d-4047-821d-54a10791b451", 1131 | "metadata": {}, 1132 | "source": [ 1133 | "This is a relatively straightforward plugin one could write. Let's look at some of the `distributed`s built-in worker plugins to see two more real-world example." 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": null, 1139 | "id": "02ac484a-f5a4-4c3b-8af2-80f79d98d749", 1140 | "metadata": { 1141 | "tags": [] 1142 | }, 1143 | "outputs": [], 1144 | "source": [ 1145 | "from distributed import Environ, UploadFile, UploadDirectory, PipInstall, CondaInstall, PackageInstall" 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "code", 1150 | "execution_count": null, 1151 | "id": "1060ed83-4a35-4cb4-8e41-49ce2f0b51a5", 1152 | "metadata": { 1153 | "tags": [] 1154 | }, 1155 | "outputs": [], 1156 | "source": [ 1157 | "Environ?? " 1158 | ] 1159 | }, 1160 | { 1161 | "cell_type": "code", 1162 | "execution_count": null, 1163 | "id": "09e73aa0-3e1c-47d6-9372-2f6df644f87f", 1164 | "metadata": {}, 1165 | "outputs": [], 1166 | "source": [ 1167 | "PackageInstall??" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "markdown", 1172 | "id": "530d27a2-8f93-4735-bc9f-b4e508b7ad53", 1173 | "metadata": {}, 1174 | "source": [ 1175 | "To add a custom worker plugin to your cluster, use the `Client.register_worker_plugin` method." 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "id": "efde3b89-47c8-4d75-b015-be7641284ea6", 1181 | "metadata": {}, 1182 | "source": [ 1183 | "## Exercise 2\n", 1184 | "\n", 1185 | "Over the next 10 minutes, create a `TaskTimerPlugin` scheduler plugin which keeps tracks of how long each task takes to run.\n", 1186 | "\n", 1187 | "```python\n", 1188 | "\n", 1189 | "class TaskTimerPlugin(SchedulerPlugin):\n", 1190 | " ...\n", 1191 | "\n", 1192 | "# Create LocalCluster and Client\n", 1193 | "cluster = LocalCluster()\n", 1194 | "client = Client(cluster)\n", 1195 | "\n", 1196 | "# Instantiate and add the TaskTimerPlugin to our cluster\n", 1197 | "plugin = TaskTimerPlugin()\n", 1198 | "cluster.scheduler.add_plugin(plugin)\n", 1199 | "\n", 1200 | "import dask.array as da\n", 1201 | "\n", 1202 | "x = da.random.random((20_000, 20_000), chunks=(5_000, 1_000))\n", 1203 | "result = (x + x.T).mean(axis=0).sum()\n", 1204 | "result.compute()\n", 1205 | "```" 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "execution_count": null, 1211 | "id": "46fd60d2-87c0-47df-a857-a37cf6ee26f7", 1212 | "metadata": {}, 1213 | "outputs": [], 1214 | "source": [ 1215 | "# Your solution to Exercise 2 here" 1216 | ] 1217 | }, 1218 | { 1219 | "cell_type": "code", 1220 | "execution_count": null, 1221 | "id": "628ceef7-8602-4e13-8191-02d6078cdcaf", 1222 | "metadata": { 1223 | "jupyter": { 1224 | "source_hidden": true 1225 | }, 1226 | "scrolled": true, 1227 | "tags": [] 1228 | }, 1229 | "outputs": [], 1230 | "source": [ 1231 | "# Solution to Exercise 2\n", 1232 | "import time\n", 1233 | "\n", 1234 | "class TaskTimerPlugin(SchedulerPlugin):\n", 1235 | " def __init__(self):\n", 1236 | " self.start_times = {}\n", 1237 | " self.stop_times = {}\n", 1238 | " self.task_durations = {}\n", 1239 | "\n", 1240 | " def transition(self, key, start, finish, *args, **kwargs):\n", 1241 | " if finish == \"processing\": \n", 1242 | " self.start_times[key] = time.time()\n", 1243 | " elif start ==\"processing\" and finish == \"memory\":\n", 1244 | " self.stop_times[key] = time.time()\n", 1245 | " self.task_durations[key] = self.stop_times[key] - self.start_times[key]\n", 1246 | "\n", 1247 | "# Create LocalCluster and Client\n", 1248 | "cluster = LocalCluster()\n", 1249 | "client = Client(cluster)\n", 1250 | "\n", 1251 | "# Instantiate and add the TaskTimerPlugin to our cluster\n", 1252 | "plugin = TaskTimerPlugin()\n", 1253 | "cluster.scheduler.add_plugin(plugin)\n", 1254 | "\n", 1255 | "import dask.array as da\n", 1256 | "\n", 1257 | "x = da.random.random((20_000, 20_000), chunks=(5_000, 1_000))\n", 1258 | "result = (x + x.T).mean(axis=0).sum()\n", 1259 | "result.compute()\n", 1260 | "\n", 1261 | "plugin.task_durations" 1262 | ] 1263 | }, 1264 | { 1265 | "cell_type": "markdown", 1266 | "id": "9e230a95-353e-4d7a-a5b1-16f8b6e97b9f", 1267 | "metadata": {}, 1268 | "source": [ 1269 | "**Bonus**: If you have extra time, make a plot of the task duration distribution (hint: `pandas` and `matplotlib` are installed)" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": null, 1275 | "id": "3ab78fb2-ef32-4587-a0fe-0b4c31930926", 1276 | "metadata": { 1277 | "tags": [] 1278 | }, 1279 | "outputs": [], 1280 | "source": [ 1281 | "# Your plotting code here" 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": null, 1287 | "id": "d5369156-5524-49c0-bb00-37adffa1d175", 1288 | "metadata": { 1289 | "jupyter": { 1290 | "source_hidden": true 1291 | }, 1292 | "tags": [] 1293 | }, 1294 | "outputs": [], 1295 | "source": [ 1296 | "#solution\n", 1297 | "import pandas as pd\n", 1298 | "\n", 1299 | "df = pd.DataFrame([(key, 1_000 * value) for key, value in plugin.task_durations.items()],\n", 1300 | " columns=[\"key\", \"duration\"])\n", 1301 | "ax = df.duration.plot(kind=\"hist\", bins=50, logy=True)\n", 1302 | "ax.set_xlabel(\"Task duration [ms]\")\n", 1303 | "ax.set_ylabel(\"Counts\");" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "code", 1308 | "execution_count": null, 1309 | "id": "5bf57a9e-8fb6-4355-9aff-3f426efab9ce", 1310 | "metadata": {}, 1311 | "outputs": [], 1312 | "source": [ 1313 | "client.close()\n", 1314 | "cluster.close()" 1315 | ] 1316 | }, 1317 | { 1318 | "cell_type": "markdown", 1319 | "id": "f09c76dd-7523-42f5-a483-30d8ba69f2ec", 1320 | "metadata": {}, 1321 | "source": [ 1322 | "# Summary\n", 1323 | "\n", 1324 | "This notebook we took a detailed look at the components of a Dask cluster, illustrated how to inspect the internal state of a cluster (both the scheduler and workers), and how you can use Dask's plugin system to execute custom code during a cluster's lifecycle.\n" 1325 | ] 1326 | } 1327 | ], 1328 | "metadata": { 1329 | "kernelspec": { 1330 | "display_name": "Python 3 (ipykernel)", 1331 | "language": "python", 1332 | "name": "python3" 1333 | }, 1334 | "language_info": { 1335 | "codemirror_mode": { 1336 | "name": "ipython", 1337 | "version": 3 1338 | }, 1339 | "file_extension": ".py", 1340 | "mimetype": "text/x-python", 1341 | "name": "python", 1342 | "nbconvert_exporter": "python", 1343 | "pygments_lexer": "ipython3", 1344 | "version": "3.11.4" 1345 | } 1346 | }, 1347 | "nbformat": 4, 1348 | "nbformat_minor": 5 1349 | } 1350 | --------------------------------------------------------------------------------