├── .dask └── config.yaml ├── .github └── workflows │ └── ci-binder.yml ├── .gitignore ├── LICENSE ├── README.md ├── binder ├── environment.yml ├── jupyterlab-workspace.json └── start ├── data └── .gitkeep ├── notebooks ├── 0_Dask_what_and_when.ipynb ├── 1_Delayed.ipynb ├── 2_Schedulers.ipynb ├── 3_DataFrames.ipynb └── 4_Machine_learning.ipynb └── prep_data.py /.dask/config.yaml: -------------------------------------------------------------------------------- 1 | distributed: 2 | dashboard: 3 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 4 | -------------------------------------------------------------------------------- /.github/workflows/ci-binder.yml: -------------------------------------------------------------------------------- 1 | name: Binder 2 | on: [push] 3 | 4 | jobs: 5 | build: 6 | runs-on: ubuntu-latest 7 | steps: 8 | 9 | - name: Build and cache on mybinder.org 10 | uses: jupyterhub/repo2docker-action@master 11 | with: 12 | NO_PUSH: true 13 | MYBINDERORG_TAG: ${{ github.event.ref }} -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.dot 3 | *.pdf 4 | *.png 5 | .ipynb_checkpoints 6 | *.gz 7 | data/accounts.*.csv 8 | data/accounts.h5 9 | data/random.hdf5 10 | data/weather-big 11 | data/myfile.hdf5 12 | data/flightjson 13 | data/holidays 14 | data/nycflights 15 | data/myfile.zarr 16 | data/accounts.parquet 17 | dask-worker-space/ 18 | profile.html 19 | log 20 | .idea/ 21 | _build/ 22 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2021, Coiled 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dask Live by Coiled Tutorial 2 | 3 | The purpose of this tutorial is to introduce folks to Dask and show them how to scale their python data-science and machine learning workflows. The materials covered are: 4 | 5 | 0. Overview of Dask - How it works and when to use it. 6 | 1. Dask Delayed: How to parallelize existing Python code and your custom algorithms. 7 | 2. Schedulers: Single Machine vs Distributed, and the Dashboard. 8 | 3. From pandas to Dask: How to manipulate bigger-than-memory DataFrames using Dask. 9 | 4. Dask-ML: Scalable machine learning using Dask. 10 | 11 | ## Prerequisites 12 | 13 | To follow along and get the most out of this tutorial it would help if you Know: 14 | 15 | - Programming fundamentals in Python (e.g variables, data structures, for loops, etc). 16 | - A bit of or are familiarized with `numpy`, `pandas` and `scikit-learn`. 17 | - Jupyter Lab/ Jupyter Notebooks 18 | - Your way around the shell/terminal 19 | 20 | However, the most important prerequisite is being willing to learn, and everyone is 21 | welcomed to tag along and enjoy the ride. If you would like to watch and not code along, 22 | not a problem. 23 | 24 | ## Get set up 25 | 26 | We have two options for you to follow this tutorial: 27 | 28 | 1. Click on the binder button right below, this will spin up the necessary computational environment for you so you can write and execute the notebooks directly on the browser. Binder is a free service so resources are not guaranteed, but they usually work. One thing 29 | to keep in mind is that the amount of resources are limited and sometimes you won't be able to see the benefits of parallelism due to this limitation. 30 | 31 | *IMPORTANT*: If you are joining the live session, make sure to click on the button few minutes before we start so we are ready to go. 32 | 33 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/coiled/dask-mini-tutorial/HEAD) 34 | 35 | 36 | 2. You can create your own set-up locally. To do this you need to be comfortable with the git and github as well as installing packages and creating software environments. If so, follow the next steps: 37 | 38 | *IMPORTANT:* If you are joining for a live session please make sure you do the setup in advance, and be ready to go once the session starts. 39 | 40 | 1. **Clone this repository** 41 | In your terminal: 42 | 43 | ``` 44 | git clone https://github.com/coiled/dask-mini-tutorial.git 45 | ``` 46 | Alternatively, you can download the zip file of the repository at the top of the main page of the repository. This is a good option if you don't have experience with git. 47 | 48 | 2. Download Anaconda 49 | If you do not have anaconda already install, you will need the Python 3 [Anaconda Distribution](https://www.anaconda.com/products/individual). If you don't want to install anaconda you can install all the packages with `pip`, if you take this route you will need to install `graphviz` separately before installing `pygraphviz`. 50 | 51 | 3. Create a conda environment 52 | In your terminal navigate to the directory where you have cloned/downloaded th `dask-mini-tutorial` repo and install the required packages by doing: 53 | 54 | ``` 55 | conda env create -f binder/environment.yml 56 | ``` 57 | 58 | This will create a new environment called `dask-mini-tutorial`. To activate the environment do: 59 | 60 | ``` 61 | conda activate dask-mini-tutorial 62 | ``` 63 | 64 | 4. Open Jupyter Lab 65 | Once your environment has been activated and you are in the `dask-mini-tutorial` repository, in your terminal do: 66 | 67 | ``` 68 | jupyter lab 69 | ``` 70 | 71 | You will see a notebooks directory, click on there and you will be ready to go. 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-mini-tutorial 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.9.7 6 | - dask=2021.11.2 7 | # JupyterLab extensions 8 | - jupyterlab>=3 9 | - dask-labextension=5.1.0 10 | - ipywidgets=7.6.5 11 | - graphviz=2.49.0 12 | - python-graphviz=0.17 13 | - scikit-learn=1.0.1 14 | - dask-ml=2021.11.16 15 | - coiled=0.0.56 -------------------------------------------------------------------------------- /binder/jupyterlab-workspace.json: -------------------------------------------------------------------------------- 1 | { 2 | "data": { 3 | "file-browser-filebrowser:cwd": { 4 | "path": "" 5 | }, 6 | "dask-dashboard-launcher": { 7 | "url": "DASK_DASHBOARD_URL" 8 | } 9 | }, 10 | "metadata": { 11 | "id": "/lab" 12 | } 13 | } -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | 6 | # Import the workspace 7 | jupyter lab workspaces import binder/jupyterlab-workspace.json 8 | export DASK_TUTORIAL_SMALL=1 9 | 10 | exec "$@" -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coiled/dask-mini-tutorial/38ffa24ed47abe66c305345a3ef7f3b00ef73095/data/.gitkeep -------------------------------------------------------------------------------- /notebooks/0_Dask_what_and_when.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "fa397203", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 11 | "\n", 12 | "\n", 13 | "# What is it and when to use it? \n", 14 | "\n", 15 | "\n", 16 | "If you ever heard of Dask you might have some form of these questions. If you have never heard of Dask but you want to know what it is and when/if you should use it, then you are in the right place. \n", 17 | "\n", 18 | "Before we give a short overview and attempt to answer these questions, we strongly recommend you to check the amazing documentation that the Dask community has in place. \n", 19 | "\n", 20 | "- Documentation: https://docs.dask.org\n", 21 | "\n", 22 | "Contribute to the project:\n", 23 | "\n", 24 | "- Github: https://github.com/dask/dask\n", 25 | "\n", 26 | "Engage with the community:\n", 27 | "\n", 28 | "- Slack: https://dask.slack.com/\n", 29 | "\n", 30 | "Looking for answers about how to use Dask:\n", 31 | "\n", 32 | "- Discourse: https://dask.discourse.group/" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "5c317728", 38 | "metadata": {}, 39 | "source": [ 40 | "### What is Dask? \n", 41 | "\n", 42 | "Dask is a flexible library for parallel computing in Python, that follows the syntax of the PyData ecosystem. If you are familiar with NumPy, pandas and scikit-learn then think of Dask as their faster cousin. For example:\n", 43 | "\n", 44 | "```python\n", 45 | "import pandas as pd import dask.dataframe as dd\n", 46 | "df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv')\n", 47 | "df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()\n", 48 | "```\n", 49 | "\n", 50 | " Since they are all family, Dask allows you to scale your existing workflows with a small amount of changes. Dask enables you to accelerate computations and perform those that don't fit in memory. It works in your laptop but it also scales out to large clusters while providing a dashboard with great diagnostic tools. " 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "c70d4db3", 56 | "metadata": {}, 57 | "source": [ 58 | "\"Dask" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "id": "3a8359e2", 66 | "metadata": {}, 67 | "source": [ 68 | "### Dask jargon: Client, Scheduler and Workers \n", 69 | "\n", 70 | "- Client: The user-facing entry point for cluster users. In other words, the client lives where your python code lives, and it communicates to the scheduler, passing along the tasks to be executed.\n", 71 | "- Scheduler: The task manager, it sends the tasks to the workers.\n", 72 | "- Workers: The ones that compute the tasks.\n", 73 | "\n", 74 | "Note: The Scheduler and the Workers are on the same network, they could live in your laptop or on a separate cluster\n", 75 | "\n", 76 | "\"Dask" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "id": "d78d7794", 84 | "metadata": { 85 | "tags": [] 86 | }, 87 | "source": [ 88 | "## When to use Dask?\n", 89 | "\n", 90 | "Before trying to use Dask, there are some questions to determine if Dask might be suitable for you. \n", 91 | "\n", 92 | "- Does your data fit in memory? \n", 93 | " - Yes: Use pandas or NumPy. \n", 94 | " - No : Dask might be able to help. \n", 95 | "- Do your computations take for ever?\n", 96 | " - Yes: Dask might be able to help. \n", 97 | " - No : Awesome.\n", 98 | "- Do you have embarrassingly parallelizable code?\n", 99 | " - Yes: Dask might be able to help.\n", 100 | " - No?: If you are not sure here are some [examples](https://examples.dask.org/applications/embarrassingly-parallel.html) \n", 101 | " - No: I'm sorry, although Dask might have some hope for you.\n", 102 | " \n", 103 | " \n", 104 | "**Bottom Left:** You don't need Dask. \n", 105 | "**Elsewhere:** Dask fair game.\n", 106 | "\n", 107 | "\n", 108 | "\"Dask\n", 111 | "\n", 112 | "\n", 113 | "**Disclaimers:**\n", 114 | "\n", 115 | "1. When we say \"Dask might be able to help\" it is because you should try first to accelerate your code with NumPy and or Numba, checking types used on your DataFrames, and then maybe consider Dask. Now even when using Dask, we can't guarantee that things will be faster, it depends on what is the code behind. \n", 116 | "\n", 117 | "2. Even when you have large datasets, at some point you want to double check if you have reduced things to a manageable level where going back to pandas or NumPy might be the best call.\n", 118 | "\n", 119 | "**Best practices:**\n", 120 | "\n", 121 | "The learning curve to use Dask can be a bit intimidating, that's why we want to point you out to some best practices links that will make the process smoother. We will go over some of these topics but we want to leave here these links for future reference\n", 122 | "\n", 123 | "- Are you working with arrays? Check this [array best practices](https://docs.dask.org/en/latest/array-best-practices.html)\n", 124 | "- Dealing with DataFrames? Check this [DataFrames best practices](https://docs.dask.org/en/latest/dataframe-best-practices.html)\n", 125 | "- Are you trying to accelerate your code using `delayed`? Check this [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)\n", 126 | "- For overall good practices check [Dask good practices](https://docs.dask.org/en/latest/best-practices.html)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "id": "ea50907f", 132 | "metadata": {}, 133 | "source": [ 134 | "## Why Dask? \n", 135 | "\n", 136 | "If you are interested in knowing why Dask might be a good option for you we recommend you to check the Dask documentation [Why Dask?](https://docs.dask.org/en/latest/why.html)\n", 137 | "\n", 138 | "But if you are already convinced that Dask is right for you and/or want to learn more about it. The topics that we will cover on this mini-tutorial are:\n", 139 | "\n", 140 | "1. Dask Delayed: How to parallelize existing Python code and your custom algorithms. \n", 141 | "2. Schedulers: Single Machine vs Distributed, and the Dashboard. \n", 142 | "3. From pandas to Dask: How to manipulate bigger-than-memory DataFrames using Dask. \n", 143 | "4. Dask-ML: Scalable machine learning using Dask.\n", 144 | "\n", 145 | "## Extra learning material:\n", 146 | "\n", 147 | "1. Self-paced Dask-Tutorial: https://tutorial.dask.org/\n", 148 | "2. Dask training by Coiled: [Scaling Python with Dask](https://coiled.io/course/scaling-python-with-dask/)" 149 | ] 150 | } 151 | ], 152 | "metadata": { 153 | "kernelspec": { 154 | "display_name": "Python 3 (ipykernel)", 155 | "language": "python", 156 | "name": "python3" 157 | }, 158 | "language_info": { 159 | "codemirror_mode": { 160 | "name": "ipython", 161 | "version": 3 162 | }, 163 | "file_extension": ".py", 164 | "mimetype": "text/x-python", 165 | "name": "python", 166 | "nbconvert_exporter": "python", 167 | "pygments_lexer": "ipython3", 168 | "version": "3.9.7" 169 | } 170 | }, 171 | "nbformat": 4, 172 | "nbformat_minor": 5 173 | } 174 | -------------------------------------------------------------------------------- /notebooks/1_Delayed.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "be4dd2fc", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 11 | "\n", 12 | "This notebook was inspired in the materials from: \n", 13 | "\n", 14 | "- https://github.com/coiled/pydata-global-dask/\n", 15 | "- https://github.com/dask/dask-tutorial/\n", 16 | "\n", 17 | "# Dask Delayed\n", 18 | "\n", 19 | "Sometimes we have problems that are parallelizable. Dask Delayed is an interface that can be use to parallelize existing Python code and custom algorithms. \n", 20 | "\n", 21 | "A first step to determine if we can use `dask.delayed` is to identify if there is some level of parallelism that we haven't exploit and hopefully `dask.delayed` will take care of it. We will start showing a simple example inspired on the main [Dask tutorial](https://tutorial.dask.org/), and we will it parallelize using `dask.delayed`.\n", 22 | "\n", 23 | "The following two functions will perform simple computations, where we use the `sleep` to simulate work. " 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "id": "c7cbb4e5", 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from time import sleep\n", 34 | "\n", 35 | "def inc(x):\n", 36 | " \"\"\"Increments x by one\"\"\"\n", 37 | " sleep(1)\n", 38 | " return x + 1\n", 39 | "\n", 40 | "def add(x, y):\n", 41 | " \"\"\"Adds x and y\"\"\"\n", 42 | " sleep(1)\n", 43 | " return x + y" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "id": "700f0a97", 49 | "metadata": {}, 50 | "source": [ 51 | "Let's do some operations and time these functions using the `%%time` magic at the beginning of the cell. " 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "id": "1c0053fb", 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "name": "stdout", 62 | "output_type": "stream", 63 | "text": [ 64 | "CPU times: user 933 µs, sys: 1.62 ms, total: 2.55 ms\n", 65 | "Wall time: 3.01 s\n" 66 | ] 67 | } 68 | ], 69 | "source": [ 70 | "%%time\n", 71 | "\n", 72 | "x = inc(1)\n", 73 | "y = inc(2)\n", 74 | "z = add(x, y)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "id": "a99dd0ec", 80 | "metadata": {}, 81 | "source": [ 82 | "The execution of the cell above took three seconds, this happens because we are calling each function sequentially. The computations above can be represented by the following graph:\n", 83 | "\n", 84 | "\"Dask\n", 87 | "\n", 88 | "\n", 89 | "Where the circles are function calls, squares represent objects that are created by one task as output and can be inputs into other tasks, and arrows represent the dependencies between the tasks. From looking at the task graph, the opportunity for parallelization is more evident since the the two calls to the `inc` function are completely independent of one-another. Let's explore how `dask.delayed` can help us with this.\n", 90 | "\n", 91 | "\n", 92 | "### `dask.delayed` \n", 93 | "\n", 94 | "Using the `dask.delayed` decorator we'll transform the `inc` and `add` functions. " 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 3, 100 | "id": "31abdfa5", 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "from dask import delayed" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "id": "03aed361", 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "CPU times: user 119 µs, sys: 24 µs, total: 143 µs\n", 118 | "Wall time: 132 µs\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "%%time\n", 124 | "\n", 125 | "a = delayed(inc)(1)\n", 126 | "b = delayed(inc)(2)\n", 127 | "c = delayed(add)(a, b)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "id": "f5546859", 133 | "metadata": {}, 134 | "source": [ 135 | "When we call the `delayed` version of the functions by passing the arguments, the original function is isn't actually called yet, that's why the execution finishes very quickly. When we called the `delayed` version of the functions, a `delayed` object is made, which keeps track of the functions to call and what arguments to pass to it. \n", 136 | "\n", 137 | "If we inspect `c`, we will notice that it instead of having the value five, we have what is called a `delayed` object." 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 5, 143 | "id": "33c59a9b", 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "Delayed('add-9b51a77b-5b91-4b92-88d8-48db94783550')\n" 151 | ] 152 | } 153 | ], 154 | "source": [ 155 | "print(c)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "id": "4ae1bfee", 161 | "metadata": {}, 162 | "source": [ 163 | "We can visualize this object by doing:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 6, 169 | "id": "ba3c6980", 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "image/png": "", 175 | "text/plain": [ 176 | "" 177 | ] 178 | }, 179 | "execution_count": 6, 180 | "metadata": {}, 181 | "output_type": "execute_result" 182 | } 183 | ], 184 | "source": [ 185 | "c.visualize()" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "id": "2714579a", 191 | "metadata": {}, 192 | "source": [ 193 | "Up to this point the object `c` holds all the information we need to compute the result. We can evaluate the result with `.compute()`." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 7, 199 | "id": "3cfea4e1", 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "CPU times: user 1.41 ms, sys: 1.33 ms, total: 2.73 ms\n", 207 | "Wall time: 2.01 s\n" 208 | ] 209 | }, 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "5" 214 | ] 215 | }, 216 | "execution_count": 7, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "%%time\n", 223 | "\n", 224 | "c.compute()" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "id": "b69fe21d", 230 | "metadata": {}, 231 | "source": [ 232 | "Notice that now the computation took 2s instead of 3s, this is because the two `inc` computations are run in parallel. \n", 233 | "\n", 234 | "**Note for Binder users**\n", 235 | "\n", 236 | "If you are running this notebook using binder, you will probably not see a speed-up. This happens because binder instances tend to have only one core with no threads so you can't see any parallelism. We can \"fix\" this by setting the number of workers to a higher number, but there is no guarantee that we will get these resources. \n", 237 | "\n", 238 | "For now, you can try copying the following lines in a cell and executing the same computation as before and see what happens. In one cell execute:\n", 239 | "\n", 240 | "\n", 241 | "```python\n", 242 | "import dask\n", 243 | "dask.config.set(scheduler='threads', num_workers=4) #setting num_workers\n", 244 | "```\n", 245 | "\n", 246 | "and in a separate cell try to run this again:\n", 247 | "\n", 248 | "```python\n", 249 | "%%time\n", 250 | "c.compute()\n", 251 | "```\n", 252 | "\n", 253 | "Don't worry about the syntax for now, we will explain this in the next lesson. " 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "id": "833397dd", 259 | "metadata": {}, 260 | "source": [ 261 | "## Parallelizing a `for`-loop\n", 262 | "\n", 263 | "When we perform the same group of operations multiple times in the form of a `for-loop`, there is a chance that we can perform these computations in parallel. For example, the following serial code can be parallelized using `delayed`: " 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 8, 269 | "id": "c79c75b0", 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "data = list(range(8))" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "id": "5926bbf6", 279 | "metadata": {}, 280 | "source": [ 281 | "#### Sequential code" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 9, 287 | "id": "8153b1b6", 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "name": "stdout", 292 | "output_type": "stream", 293 | "text": [ 294 | "CPU times: user 1.31 ms, sys: 1.35 ms, total: 2.67 ms\n", 295 | "Wall time: 8.02 s\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "%%time\n", 301 | "results = []\n", 302 | "for i in data:\n", 303 | " y = inc(i) # do somthing here\n", 304 | " results.append(y)\n", 305 | " \n", 306 | "total = sum(results) # do something here" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 10, 312 | "id": "a18b54f1", 313 | "metadata": {}, 314 | "outputs": [ 315 | { 316 | "name": "stdout", 317 | "output_type": "stream", 318 | "text": [ 319 | "total = 36\n" 320 | ] 321 | } 322 | ], 323 | "source": [ 324 | "print(f'{total = }')" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "id": "5d269dee", 330 | "metadata": {}, 331 | "source": [ 332 | "### Exercise: \n", 333 | "\n", 334 | "Notice that both the `inc` and `sum` operations can be done in parallel, use `delayed` to parallelize the sequential code above, compute the `total` and time it using `%%time` " 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 11, 340 | "id": "d6905529", 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "#solution\n", 345 | "results = []\n", 346 | "for i in data:\n", 347 | " y = delayed(inc)(i) \n", 348 | " results.append(y)\n", 349 | " \n", 350 | "total = delayed(sum)(results) " 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "id": "fd452def", 356 | "metadata": {}, 357 | "source": [ 358 | "In the code above, the `sum` step is not run in parallel, but it depends on each of the `inc` steps, that's why it needs the `delayed` decorator too. The `inc`steps will be parallelized, then aggregated with the `sum` step.\n", 359 | "\n", 360 | "Notice that we can apply delayed to built-in functions, as we did in the case of `sum` in the code above. " 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 12, 366 | "id": "6f52a461", 367 | "metadata": {}, 368 | "outputs": [ 369 | { 370 | "data": { 371 | "text/plain": [ 372 | "Delayed('sum-c14bce50-fd0f-4eae-a781-b94226956e95')" 373 | ] 374 | }, 375 | "execution_count": 12, 376 | "metadata": {}, 377 | "output_type": "execute_result" 378 | } 379 | ], 380 | "source": [ 381 | "total" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 13, 387 | "id": "bd0270ec", 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "data": { 392 | "image/png": "", 393 | "text/plain": [ 394 | "" 395 | ] 396 | }, 397 | "execution_count": 13, 398 | "metadata": {}, 399 | "output_type": "execute_result" 400 | } 401 | ], 402 | "source": [ 403 | "total.visualize()" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 14, 409 | "id": "ab0269d9", 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "name": "stdout", 414 | "output_type": "stream", 415 | "text": [ 416 | "CPU times: user 1.83 ms, sys: 1.35 ms, total: 3.18 ms\n", 417 | "Wall time: 1.01 s\n" 418 | ] 419 | }, 420 | { 421 | "data": { 422 | "text/plain": [ 423 | "36" 424 | ] 425 | }, 426 | "execution_count": 14, 427 | "metadata": {}, 428 | "output_type": "execute_result" 429 | } 430 | ], 431 | "source": [ 432 | "%%time\n", 433 | "total.compute()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "id": "ba56c847", 439 | "metadata": {}, 440 | "source": [ 441 | "**Note:**\n", 442 | "\n", 443 | "When we used `dask.delayed` without having a distributed scheduler (will see this later), we are relying on a single-machine scheduler and dask will use the threadpool executor, which by default will use the resources available on your machine. This can cause you to see different time values for the parallel version, since it'll depend on the resources you have available.\n", 444 | "\n", 445 | "You can check this by doing:" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 15, 451 | "id": "fc6aab81", 452 | "metadata": {}, 453 | "outputs": [ 454 | { 455 | "data": { 456 | "text/plain": [ 457 | "8" 458 | ] 459 | }, 460 | "execution_count": 15, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | } 464 | ], 465 | "source": [ 466 | "import os\n", 467 | "os.cpu_count()" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "id": "38659821", 473 | "metadata": {}, 474 | "source": [ 475 | "### The `@delayed` syntax \n", 476 | "\n", 477 | "The `delayed` decorator can be also used by \"decorating\" with `@delayed` the function you want to parallelize." 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": 16, 483 | "id": "294b1737", 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "@delayed \n", 488 | "def double(x):\n", 489 | " \"\"\"Decrease x by one\"\"\"\n", 490 | " sleep(1)\n", 491 | " return 2*x " 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "id": "acf79fa0", 497 | "metadata": {}, 498 | "source": [ 499 | "Then when we call this new `dec` function we obtain a delayed object:" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 17, 505 | "id": "dd9c01f2", 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "name": "stdout", 510 | "output_type": "stream", 511 | "text": [ 512 | "Delayed('double-ea458133-3bf3-4e8d-8111-531e9158f458')\n" 513 | ] 514 | } 515 | ], 516 | "source": [ 517 | "d = double(4)\n", 518 | "print(d)" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "id": "abdb64b5", 524 | "metadata": {}, 525 | "source": [ 526 | "### Exercise\n", 527 | "\n", 528 | "Using the `delayed` decorator create the parallel versions of `inc` and `add`" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 18, 534 | "id": "57f47d98", 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "#solution\n", 539 | "\n", 540 | "@delayed\n", 541 | "def inc(x):\n", 542 | " \"\"\"Increments x by one\"\"\"\n", 543 | " sleep(1)\n", 544 | " return x + 1\n", 545 | "\n", 546 | "@delayed\n", 547 | "def add(x, y):\n", 548 | " \"\"\"Adds x and y\"\"\"\n", 549 | " sleep(1)\n", 550 | " return x + y" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "id": "b3866d3d", 556 | "metadata": {}, 557 | "source": [ 558 | "``Delayed`` objects support several standard Python operations, each of which creates another ``Delayed`` object representing the result:\n", 559 | "\n", 560 | "- Arithmetic operators, e.g. `*`, `-`, `+`\n", 561 | "- Item access and slicing, e.g. `x[0]`, `x[1:3]`\n", 562 | "- Attribute access, e.g. `x.size`\n", 563 | "- Method calls, e.g. `x.index(0)`\n", 564 | "\n", 565 | "For example you can do:" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 19, 571 | "id": "bfa8a187", 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "image/png": "", 577 | "text/plain": [ 578 | "" 579 | ] 580 | }, 581 | "execution_count": 19, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "result = (inc(5) * inc(7)) + (inc(3) * inc(2))\n", 588 | "result.visualize()" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 20, 594 | "id": "93cee5c9", 595 | "metadata": {}, 596 | "outputs": [ 597 | { 598 | "name": "stdout", 599 | "output_type": "stream", 600 | "text": [ 601 | "CPU times: user 1.54 ms, sys: 1.39 ms, total: 2.93 ms\n", 602 | "Wall time: 1.01 s\n" 603 | ] 604 | }, 605 | { 606 | "data": { 607 | "text/plain": [ 608 | "60" 609 | ] 610 | }, 611 | "execution_count": 20, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "%%time\n", 618 | "result.compute()" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "id": "c5c755fb", 624 | "metadata": {}, 625 | "source": [ 626 | "## Extra resources\n", 627 | "\n", 628 | "For more examples on `dask.delayed` check:\n", 629 | "- Main Dask tutorial: [Delayed lesson](https://github.com/dask/dask-tutorial/blob/main/01_dask.delayed.ipynb)\n", 630 | "- More examples on Delayed: [PyData global - Dask tutorial - Delayed](https://github.com/coiled/pydata-global-dask/blob/master/1-delayed.ipynb)\n", 631 | "- Short screencast on Dask delayed: [How to parallelize Python code with Dask Delayed (3min)](https://www.youtube.com/watch?v=-EUlNJI2QYs)\n", 632 | "- [Dask Delayed documentation](https://docs.dask.org/en/latest/delayed.html)\n", 633 | "- [Delayed Best Practices](https://docs.dask.org/en/latest/delayed-best-practices.html)\n" 634 | ] 635 | } 636 | ], 637 | "metadata": { 638 | "kernelspec": { 639 | "display_name": "Python 3 (ipykernel)", 640 | "language": "python", 641 | "name": "python3" 642 | }, 643 | "language_info": { 644 | "codemirror_mode": { 645 | "name": "ipython", 646 | "version": 3 647 | }, 648 | "file_extension": ".py", 649 | "mimetype": "text/x-python", 650 | "name": "python", 651 | "nbconvert_exporter": "python", 652 | "pygments_lexer": "ipython3", 653 | "version": "3.9.7" 654 | } 655 | }, 656 | "nbformat": 4, 657 | "nbformat_minor": 5 658 | } 659 | -------------------------------------------------------------------------------- /notebooks/2_Schedulers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "ba6b2bc0", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 11 | " \n", 12 | "This notebook was inspired in the materials from: \n", 13 | "\n", 14 | "- https://github.com/coiled/pydata-global-dask/\n", 15 | "\n", 16 | " \n", 17 | "# Schedulers\n", 18 | "\n", 19 | "So far we have only seen the power of `dask.delayed` and we got familiarized with the idea of task graphs and we learn that these task graphs need to be executed to get the results of our computation. But what does it mean \"to be executed\"? Who takes care of this? Well, as you might have guess from the title of this notebook, this is the job of the Dask task scheduler. \n", 20 | "\n", 21 | "\n", 22 | "\"Grid\n", 25 | "\n", 26 | "\n", 27 | "There are different task schedulers in Dask, and even though they will all compute the same result, but they might have different performances. There are two different classes of schedulers: single-machine and distributed schedulers.\n", 28 | "\n", 29 | "\n", 30 | "## Single Machine Schedulers\n", 31 | "\n", 32 | "Single machine schedulers require no setup, they only use the Python standard library, and they provide basic features on on a local process or threadpool. Dask provides different single machine schedulers:\n", 33 | "\n", 34 | "\n", 35 | "- \"threads\": The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`. The threaded scheduler is the default choice for Dask arrays, Dask DataFrames, and Dask delayed.\n", 36 | "\n", 37 | "- \"processes\": The multiprocessing scheduler executes computations with a local `concurrent.futures.ProcessPoolExecutor`. The multiprocessing scheduler is the default choice for Dask Bag.\n", 38 | "\n", 39 | "- \"single-threaded\": The single-threaded synchronous scheduler executes all computations in the local thread, with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.\n", 40 | "\n", 41 | "### Single machine schedulers in action\n", 42 | "\n", 43 | "Using the same examples we used in the Delayed lesson, let's see how we can modify the scheduler and how this affects the performance of our computations. " 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "id": "e15fbe82", 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "import dask\n", 54 | "from dask import delayed\n", 55 | "from time import sleep" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "id": "d0b88f61", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "@delayed\n", 66 | "def inc(x):\n", 67 | " \"\"\"Increments x by one\"\"\"\n", 68 | " sleep(1)\n", 69 | " return x + 1" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "id": "a73e948a", 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "Delayed('sum-de7db1d6-1e32-477b-aded-a34ba2c60cd9')" 82 | ] 83 | }, 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "data = list(range(8))\n", 91 | "\n", 92 | "results = []\n", 93 | "for i in data:\n", 94 | " y = inc(i) \n", 95 | " results.append(y)\n", 96 | " \n", 97 | "total = delayed(sum)(results)\n", 98 | "total" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "4bf0cf9f", 104 | "metadata": {}, 105 | "source": [ 106 | "### The multi-threading scheduler (default)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 4, 112 | "id": "0d6e3709", 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "CPU times: user 2.09 ms, sys: 1.19 ms, total: 3.29 ms\n", 120 | "Wall time: 1.01 s\n" 121 | ] 122 | }, 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "36" 127 | ] 128 | }, 129 | "execution_count": 4, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "%%time \n", 136 | "dask.config.set(scheduler='threads')\n", 137 | "total.compute()" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 5, 143 | "id": "1bc1197d", 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "CPU times: user 4.79 ms, sys: 2.42 ms, total: 7.21 ms\n", 151 | "Wall time: 2.01 s\n" 152 | ] 153 | }, 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "36" 158 | ] 159 | }, 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "%%time \n", 167 | "dask.config.set(scheduler='threads', num_workers=4) #setting num_workers\n", 168 | "total.compute()" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "id": "6a294197", 174 | "metadata": {}, 175 | "source": [ 176 | "### The multi-process scheduler \n", 177 | "\n", 178 | "Notice that we can also set the scheduler as a context manager " 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 6, 184 | "id": "e158bf20", 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | "CPU times: user 10.6 ms, sys: 19 ms, total: 29.7 ms\n", 192 | "Wall time: 6.19 s\n" 193 | ] 194 | } 195 | ], 196 | "source": [ 197 | "%%time\n", 198 | "with dask.config.set(scheduler='processes'): \n", 199 | " total.compute() " 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "id": "d0403b14", 205 | "metadata": {}, 206 | "source": [ 207 | "### The single-threaded scheduler \n", 208 | "\n", 209 | "Tools like `pdb` do not work well with multi threads or process, but you can work around this by using the single-threaded scheduler when debugging." 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 7, 215 | "id": "4ecb2d51", 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "CPU times: user 5.29 ms, sys: 1.45 ms, total: 6.74 ms\n", 223 | "Wall time: 8.04 s\n" 224 | ] 225 | }, 226 | { 227 | "data": { 228 | "text/plain": [ 229 | "36" 230 | ] 231 | }, 232 | "execution_count": 7, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "%%time\n", 239 | "total.compute(scheduler=\"single-threaded\") " 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "b88c0692", 245 | "metadata": {}, 246 | "source": [ 247 | "For more information about single-machine schedulers, and which one to choose you can visit the detailed the Dask documentation on [single-machine schedulers](https://docs.dask.org/en/latest/setup/single-machine.html). " 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "0ffcbd5f", 253 | "metadata": {}, 254 | "source": [ 255 | "## Distributed Scheduler\n", 256 | "\n", 257 | "The Dask distributed scheduler, despite having \"distributed\" in its name, also works well on a single machine. We recommend using the distributed scheduler as it offers more features and diagnostics. You can think of the distributed scheduler as an \"advanced scheduler\". \n", 258 | "\n", 259 | "The distributed scheduler can be used in a cluster as well as locally. Deploying a remote Dask cluster involves additional setup that you can read more about on the Dask [setup documentation](https://docs.dask.org/en/latest/setup.html). Alternatively, you can use [Coiled](https://docs.coiled.io/user_guide/index.html#what-is-coiled) which provides a cluster-as-a-service functionality to provision hosted Dask clusters on demand, and you can try it for free. \n", 260 | "\n", 261 | "For now, we will set up the scheduler locally. To set up the distributed scheduler locally we need to create a `Client` object, which will let you interact with the \"cluster\" (local threads or processes on your machine)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 8, 267 | "id": "5cd38299", 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "from dask.distributed import Client" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 9, 277 | "id": "0354e700", 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "data": { 282 | "text/html": [ 283 | "
\n", 284 | "
\n", 285 | "
\n", 286 | "

Client

\n", 287 | "

Client-3142c864-167e-11ec-9fb4-1e00ea0a0276

\n", 288 | " \n", 289 | "\n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | "\n", 297 | " \n", 298 | " \n", 299 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | "\n", 306 | "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", 300 | " Dashboard: http://127.0.0.1:8787/status\n", 301 | "
\n", 307 | "\n", 308 | " \n", 309 | "
\n", 310 | "

Cluster Info

\n", 311 | "
\n", 312 | "
\n", 313 | "
\n", 314 | "
\n", 315 | "

LocalCluster

\n", 316 | "

a6338b33

\n", 317 | " \n", 318 | " \n", 319 | " \n", 322 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 330 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | "\n", 339 | "\n", 340 | " \n", 341 | "
\n", 320 | " Dashboard: http://127.0.0.1:8787/status\n", 321 | " \n", 323 | " Workers: 4\n", 324 | "
\n", 328 | " Total threads: 8\n", 329 | " \n", 331 | " Total memory: 16.00 GiB\n", 332 | "
Status: runningUsing processes: True
\n", 342 | "\n", 343 | "
\n", 344 | " \n", 345 | "

Scheduler Info

\n", 346 | "
\n", 347 | "\n", 348 | "
\n", 349 | "
\n", 350 | "
\n", 351 | "
\n", 352 | "

Scheduler

\n", 353 | "

Scheduler-3d8c0db7-910f-4c72-8c3f-79d76d73f0b5

\n", 354 | " \n", 355 | " \n", 356 | " \n", 359 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 367 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 375 | " \n", 378 | " \n", 379 | "
\n", 357 | " Comm: tcp://127.0.0.1:57649\n", 358 | " \n", 360 | " Workers: 4\n", 361 | "
\n", 365 | " Dashboard: http://127.0.0.1:8787/status\n", 366 | " \n", 368 | " Total threads: 8\n", 369 | "
\n", 373 | " Started: Just now\n", 374 | " \n", 376 | " Total memory: 16.00 GiB\n", 377 | "
\n", 380 | "
\n", 381 | "
\n", 382 | "\n", 383 | "
\n", 384 | " \n", 385 | "

Workers

\n", 386 | "
\n", 387 | "\n", 388 | " \n", 389 | "
\n", 390 | "
\n", 391 | "
\n", 392 | "
\n", 393 | " \n", 394 | "

Worker: 0

\n", 395 | "
\n", 396 | " \n", 397 | " \n", 398 | " \n", 401 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 409 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 423 | " \n", 424 | "\n", 425 | " \n", 426 | "\n", 427 | " \n", 428 | "\n", 429 | "
\n", 399 | " Comm: tcp://127.0.0.1:57665\n", 400 | " \n", 402 | " Total threads: 2\n", 403 | "
\n", 407 | " Dashboard: http://127.0.0.1:57666/status\n", 408 | " \n", 410 | " Memory: 4.00 GiB\n", 411 | "
\n", 415 | " Nanny: tcp://127.0.0.1:57652\n", 416 | "
\n", 421 | " Local directory: /Users/ncclementi/Documents/git/dask-mini-tutorial/notebooks/dask-worker-space/worker-dop_ge5z\n", 422 | "
\n", 430 | "
\n", 431 | "
\n", 432 | "
\n", 433 | " \n", 434 | "
\n", 435 | "
\n", 436 | "
\n", 437 | "
\n", 438 | " \n", 439 | "

Worker: 1

\n", 440 | "
\n", 441 | " \n", 442 | " \n", 443 | " \n", 446 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 454 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 468 | " \n", 469 | "\n", 470 | " \n", 471 | "\n", 472 | " \n", 473 | "\n", 474 | "
\n", 444 | " Comm: tcp://127.0.0.1:57668\n", 445 | " \n", 447 | " Total threads: 2\n", 448 | "
\n", 452 | " Dashboard: http://127.0.0.1:57669/status\n", 453 | " \n", 455 | " Memory: 4.00 GiB\n", 456 | "
\n", 460 | " Nanny: tcp://127.0.0.1:57654\n", 461 | "
\n", 466 | " Local directory: /Users/ncclementi/Documents/git/dask-mini-tutorial/notebooks/dask-worker-space/worker-1wivjaz7\n", 467 | "
\n", 475 | "
\n", 476 | "
\n", 477 | "
\n", 478 | " \n", 479 | "
\n", 480 | "
\n", 481 | "
\n", 482 | "
\n", 483 | " \n", 484 | "

Worker: 2

\n", 485 | "
\n", 486 | " \n", 487 | " \n", 488 | " \n", 491 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 499 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 513 | " \n", 514 | "\n", 515 | " \n", 516 | "\n", 517 | " \n", 518 | "\n", 519 | "
\n", 489 | " Comm: tcp://127.0.0.1:57660\n", 490 | " \n", 492 | " Total threads: 2\n", 493 | "
\n", 497 | " Dashboard: http://127.0.0.1:57662/status\n", 498 | " \n", 500 | " Memory: 4.00 GiB\n", 501 | "
\n", 505 | " Nanny: tcp://127.0.0.1:57651\n", 506 | "
\n", 511 | " Local directory: /Users/ncclementi/Documents/git/dask-mini-tutorial/notebooks/dask-worker-space/worker-l9wsw5bz\n", 512 | "
\n", 520 | "
\n", 521 | "
\n", 522 | "
\n", 523 | " \n", 524 | "
\n", 525 | "
\n", 526 | "
\n", 527 | "
\n", 528 | " \n", 529 | "

Worker: 3

\n", 530 | "
\n", 531 | " \n", 532 | " \n", 533 | " \n", 536 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 544 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 558 | " \n", 559 | "\n", 560 | " \n", 561 | "\n", 562 | " \n", 563 | "\n", 564 | "
\n", 534 | " Comm: tcp://127.0.0.1:57659\n", 535 | " \n", 537 | " Total threads: 2\n", 538 | "
\n", 542 | " Dashboard: http://127.0.0.1:57661/status\n", 543 | " \n", 545 | " Memory: 4.00 GiB\n", 546 | "
\n", 550 | " Nanny: tcp://127.0.0.1:57653\n", 551 | "
\n", 556 | " Local directory: /Users/ncclementi/Documents/git/dask-mini-tutorial/notebooks/dask-worker-space/worker-6c3pjpi7\n", 557 | "
\n", 565 | "
\n", 566 | "
\n", 567 | "
\n", 568 | " \n", 569 | "\n", 570 | "
\n", 571 | "
\n", 572 | "\n", 573 | "
\n", 574 | "
\n", 575 | "
\n", 576 | "
\n", 577 | " \n", 578 | "\n", 579 | "
\n", 580 | "
" 581 | ], 582 | "text/plain": [ 583 | "" 584 | ] 585 | }, 586 | "execution_count": 9, 587 | "metadata": {}, 588 | "output_type": "execute_result" 589 | } 590 | ], 591 | "source": [ 592 | "client = Client(n_workers=4)\n", 593 | "client" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "id": "8dc55a56", 599 | "metadata": {}, 600 | "source": [ 601 | "When we create a distributed scheduler `Client`, by default it registers itself as the default Dask scheduler. From now on, all `.compute()` calls will start using the distributed scheduler unless otherwise is specified. \n", 602 | "\n", 603 | "The distributed scheduler has many features that you can learn more about in the [Dask distributed documentation](https://distributed.dask.org/en/latest/) but a nice feature to explore is diagnostic the Dashboard. We will be taking a look at the dashboard as we perform computations but for a brief overview of the main components of the dashboard you can check the Dask documentation on [diagnosing performance](https://distributed.dask.org/en/latest/diagnosing-performance.html).\n", 604 | "\n", 605 | "If you click on the link of the dashboard on the cell above and run the computation of `total` as we did before you will see now some action happening on the dashboard. " 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": 10, 611 | "id": "3c9e1199", 612 | "metadata": {}, 613 | "outputs": [ 614 | { 615 | "data": { 616 | "text/plain": [ 617 | "36" 618 | ] 619 | }, 620 | "execution_count": 10, 621 | "metadata": {}, 622 | "output_type": "execute_result" 623 | } 624 | ], 625 | "source": [ 626 | "total.compute()" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "id": "adfd314d", 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "client.close()" 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "id": "ffbd7734", 642 | "metadata": {}, 643 | "source": [ 644 | "## Extra resources\n", 645 | "\n", 646 | "- [Dask documentation on scheduling](https://docs.dask.org/en/latest/scheduling.html)\n", 647 | "- Example Dynamic computations using Futures: [PyData Global Dask tutorial - schedulers](https://github.com/coiled/pydata-global-dask/blob/master/3-schedulers.ipynb)\n", 648 | "- Advance Delayed with distributed scheduler: [Dask tutorial - Advanced delayed](https://github.com/dask/dask-tutorial/blob/main/06_distributed_advanced.ipynb)" 649 | ] 650 | } 651 | ], 652 | "metadata": { 653 | "kernelspec": { 654 | "display_name": "Python 3 (ipykernel)", 655 | "language": "python", 656 | "name": "python3" 657 | }, 658 | "language_info": { 659 | "codemirror_mode": { 660 | "name": "ipython", 661 | "version": 3 662 | }, 663 | "file_extension": ".py", 664 | "mimetype": "text/x-python", 665 | "name": "python", 666 | "nbconvert_exporter": "python", 667 | "pygments_lexer": "ipython3", 668 | "version": "3.9.7" 669 | } 670 | }, 671 | "nbformat": 4, 672 | "nbformat_minor": 5 673 | } 674 | -------------------------------------------------------------------------------- /notebooks/4_Machine_learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "33012a14", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 11 | "\n", 12 | "# Parallel and Distributed Machine Learning\n", 13 | "\n", 14 | "The material in this notebook was based on the open-source content from [Dask's tutorial repository](https://github.com/dask/dask-tutorial) and the [Machine learning notebook](https://github.com/coiled/data-science-at-scale/blob/master/3-machine-learning.ipynb) from data science at scale from coiled\n", 15 | "\n", 16 | "So far we have seen how Dask makes data analysis scalable with parallelization via Dask DataFrames. Let's now see how [Dask-ML](https://ml.dask.org/) allows us to do machine learning in a parallel and distributed manner. Note, machine learning is really just a special case of data analysis (one that automates analytical model building), so the 💪 Dask gains 💪 we've seen will apply here as well!\n", 17 | "\n", 18 | "(If you'd like a refresher on the difference between parallel and distributed computing, [here's a good discussion on StackExchange](https://cs.stackexchange.com/questions/1580/distributed-vs-parallel-computing).)\n", 19 | "\n", 20 | "\n", 21 | "## Types of scaling problems in machine learning\n", 22 | "\n", 23 | "There are two main types of scaling challenges you can run into in your machine learning workflow: scaling the **size of your data** and scaling the **size of your model**. That is:\n", 24 | "\n", 25 | "1. **CPU-bound problems**: Data fits in RAM, but training takes too long. Many hyperparameter combinations, a large ensemble of many models, etc.\n", 26 | "2. **Memory-bound problems**: Data is larger than RAM, and sampling isn't an option.\n", 27 | "\n", 28 | "Here's a handy diagram for visualizing these problems:\n", 29 | "\n", 30 | "\"scaling\n", 33 | "\n", 34 | "\n", 35 | "In the bottom-left quadrant, your datasets are not too large (they fit comfortably in RAM) and your model is not too large either. When these conditions are met, you are much better off using something like scikit-learn, XGBoost, and similar libraries. You don't need to leverage multiple machines in a distributed manner with a library like Dask-ML. However, if you are in any of the other quadrants, distributed machine learning is the way to go.\n", 36 | "\n", 37 | "Summarizing: \n", 38 | "\n", 39 | "* For in-memory problems, just use scikit-learn (or your favorite ML library).\n", 40 | "* For large models, use `dask_ml.joblib` and your favorite scikit-learn estimator.\n", 41 | "* For large datasets, use `dask_ml` estimators.\n", 42 | "\n", 43 | "## Scikit-learn in five minutes\n", 44 | "\n", 45 | "\"sklearn\n", 48 | "\n", 49 | "\n", 50 | "In this section, we'll quickly run through a typical scikit-learn workflow:\n", 51 | "\n", 52 | "* Load some data (in this case, we'll generate it)\n", 53 | "* Import the scikit-learn module for our chosen ML algorithm\n", 54 | "* Create an estimator for that algorithm and fit it with our data\n", 55 | "* Inspect the learned attributes\n", 56 | "* Check the accuracy of our model\n", 57 | "\n", 58 | "Scikit-learn has a nice, consistent API:\n", 59 | "\n", 60 | "* You instantiate an `Estimator` (e.g. `LinearRegression`, `RandomForestClassifier`, etc.). All of the models *hyperparameters* (user-specified parameters, not the ones learned by the estimator) are passed to the estimator when it's created.\n", 61 | "* You call `estimator.fit(X, y)` to train the estimator.\n", 62 | "* Use `estimator` to inspect attributes, make predictions, etc. \n", 63 | "\n", 64 | "Here `X` is an array of *feature variables* (what you're using to predict) and `y` is an array of *target variables* (what we're trying to predict)." 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "id": "b6e7379d", 70 | "metadata": {}, 71 | "source": [ 72 | "### Generate some random data" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "id": "8613c202", 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "from sklearn.datasets import make_classification\n", 83 | "\n", 84 | "# Generate data\n", 85 | "X, y = make_classification(n_samples=10000, n_features=4, random_state=0)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "1fbcf18d", 91 | "metadata": {}, 92 | "source": [ 93 | "**Refreshing some ML concepts**\n", 94 | "\n", 95 | "- `X` is the samples matrix (or design matrix). The size of `X` is typically (`n_samples`, `n_features`), which means that samples are represented as rows and features are represented as columns.\n", 96 | "- A \"feature\" (also called an \"attribute\") is a measurable property of the phenomenon we're trying to analyze. A feature for a dataset of employees might be their hire date, for example.\n", 97 | "- `y` are the target values, which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, `y` does not need to be specified. `y` is usually 1d array where the `i`th entry corresponds to the target of the `i`th sample (row) of `X`." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "85824219", 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "# Let's take a look at X\n", 108 | "X[:8]" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "id": "7dd8ef9a", 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# Let's take a look at y\n", 119 | "y[:8]" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "id": "80d61bf2", 125 | "metadata": {}, 126 | "source": [ 127 | "### Fitting and SVC\n", 128 | "\n", 129 | "For this example, we will fit a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "id": "5d638ce9", 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "from sklearn.svm import SVC\n", 140 | "\n", 141 | "estimator = SVC(random_state=0)\n", 142 | "estimator.fit(X, y)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "id": "4068406a", 148 | "metadata": {}, 149 | "source": [ 150 | "We can inspect the learned features by taking a look a the `support_vectors_`:" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "id": "b891e47a", 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "estimator.support_vectors_[:4]" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "id": "a175852d", 166 | "metadata": {}, 167 | "source": [ 168 | "And we check the accuracy:" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "id": "8300ec8e", 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "estimator.score(X, y)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "id": "1efac1bd", 184 | "metadata": {}, 185 | "source": [ 186 | "There are [3 different approaches](https://scikit-learn.org/0.15/modules/model_evaluation.html) to evaluate the quality of predictions of a model. One of them is the **estimator score method**. Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve, which is discussed in each estimator's documentation.\n", 187 | "\n", 188 | "### Hyperparameter Optimization\n", 189 | "\n", 190 | "There are a few ways to learn the best *hyper*parameters while training. One is `GridSearchCV`.\n", 191 | "As the name implies, this does a brute-force search over a grid of hyperparameter combinations. scikit-learn provides tools to automatically find the best parameter combinations via cross-validation (which is the \"CV\" in `GridSearchCV`)." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "id": "b4659297", 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "from sklearn.model_selection import GridSearchCV" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "id": "f6c889f4", 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "%%time\n", 212 | "estimator = SVC(gamma='auto', random_state=0, probability=True)\n", 213 | "param_grid = {\n", 214 | " 'C': [0.001, 10.0],\n", 215 | " 'kernel': ['rbf', 'poly'],\n", 216 | "}\n", 217 | "\n", 218 | "# Brute-force search over a grid of hyperparameter combinations\n", 219 | "grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2)\n", 220 | "grid_search.fit(X, y)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "id": "ad134157", 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "grid_search.best_params_, grid_search.best_score_" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "id": "228deb1b", 236 | "metadata": {}, 237 | "source": [ 238 | "## Compute Bound: Single-machine parallelism with Joblib\n", 239 | "\n", 240 | "\"Joblib\n", 243 | "\n", 244 | "In this section we'll see how [Joblib](https://joblib.readthedocs.io/en/latest/) (\"*a set of tools to provide lightweight pipelining in Python*\") gives us parallelism on our laptop. Here's what our grid search graph would look like if we set up six training \"jobs\" in parallel:\n", 245 | "\n", 246 | "\"grid\n", 249 | "\n", 250 | "With Joblib, we can say that scikit-learn has *single-machine* parallelism.\n", 251 | "Any scikit-learn estimator that can operate in parallel exposes an `n_jobs` keyword, which tells you how many tasks to run in parallel. Specifying `n_jobs=-1` jobs means running the maximum possible number of tasks in parallel." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "id": "d9bead16", 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "%%time\n", 262 | "grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)\n", 263 | "grid_search.fit(X, y)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "id": "e7c73f87", 269 | "metadata": {}, 270 | "source": [ 271 | "Notice that the computation above it is faster than before. If you are running this computation on binder, you might not see a speed-up and the reason for that is that binder instances tend to have only one core with no threads so you can't see any parallelism. " 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "id": "1fe6255f", 277 | "metadata": {}, 278 | "source": [ 279 | "## Compute Bound: Multi-machine parallelism with Dask\n", 280 | "\n", 281 | "\n", 282 | "In this section we'll see how Dask (plus Joblib and scikit-learn) gives us multi-machine parallelism. Here's what our grid search graph would look like if we allowed Dask to schedule our training \"jobs\" over multiple machines in our cluster:\n", 283 | "\n", 284 | "\"merged\n", 287 | " \n", 288 | "We can say that Dask can talk to scikit-learn (via Joblib) so that our *cluster* is used to train a model. \n", 289 | "\n", 290 | "If we run this on a laptop, it will take quite some time, but the CPU usage will be satisfyingly near 100% for the duration. To run faster, we would need a distributed cluster. For details on how to create a LocalCluster you can check the Dask documentation on [Single Machine: dask.distributed](https://docs.dask.org/en/latest/setup/single-distributed.html). \n", 291 | "\n", 292 | "Let's instantiate a Client with `n_workers=4`, which will give us a `LocalCluster`." 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "id": "cdf776a5", 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "import dask.distributed\n", 303 | "\n", 304 | "client = dask.distributed.Client(n_workers=4)\n", 305 | "client" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "id": "bb9c77aa", 311 | "metadata": {}, 312 | "source": [ 313 | "**Note:** Click on Cluster Info, to see more details about the cluster. You can see the configuration of the cluster and some other specs. \n", 314 | "\n", 315 | "We can expand our problem by specifying more hyperparameters before training, and see how using `dask` as backend can help us. " 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "id": "367f1c5a", 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "param_grid = {\n", 326 | " 'C': [0.001, 0.1, 1.0, 2.5, 5, 10.0],\n", 327 | " 'kernel': ['rbf', 'poly', 'linear'],\n", 328 | " 'shrinking': [True, False],\n", 329 | "}\n", 330 | "\n", 331 | "grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "id": "2854c927", 337 | "metadata": {}, 338 | "source": [ 339 | "### Dask parallel backend\n", 340 | "\n", 341 | "We can fit our estimator with multi-machine parallelism by quickly *switching to a Dask parallel backend* when using joblib. " 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "id": "efb681a4", 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "import joblib" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "id": "4d8228e2", 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [ 361 | "%%time\n", 362 | "with joblib.parallel_backend(\"dask\", scatter=[X, y]):\n", 363 | " grid_search.fit(X, y)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "id": "4b9194cf", 369 | "metadata": {}, 370 | "source": [ 371 | "**What did just happen?**\n", 372 | "\n", 373 | "Dask-ML developers worked with the scikit-learn and Joblib developers to implement a Dask parallel backend. So internally, scikit-learn now talks to Joblib, and Joblib talks to Dask, and Dask is what handles scheduling all of those tasks on multiple machines.\n", 374 | "\n", 375 | "The best parameters and best score:" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "id": "626f0886", 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "grid_search.best_params_, grid_search.best_score_" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "id": "e5ea35f8", 391 | "metadata": {}, 392 | "source": [ 393 | "## Memory Bound: Single/Multi machine parallelism with Dask-ML\n", 394 | "\n", 395 | "We have seen how to work with larger models, but sometimes you'll want to train on a larger than memory dataset. `dask-ml` has implemented estimators that work well on Dask `Arrays` and `DataFrames` that may be larger than your machine's RAM." 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "id": "01ed30f6", 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "import dask.array as da\n", 406 | "import dask.delayed\n", 407 | "from sklearn.datasets import make_blobs\n", 408 | "import numpy as np" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "id": "3ebd444f", 414 | "metadata": {}, 415 | "source": [ 416 | "We'll make a small (random) dataset locally using scikit-learn." 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "id": "5f703b07", 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "n_centers = 12\n", 427 | "n_features = 20\n", 428 | "\n", 429 | "X_small, y_small = make_blobs(n_samples=1000, centers=n_centers, n_features=n_features, random_state=0)\n", 430 | "\n", 431 | "centers = np.zeros((n_centers, n_features))\n", 432 | "\n", 433 | "for i in range(n_centers):\n", 434 | " centers[i] = X_small[y_small == i].mean(0)\n", 435 | " \n", 436 | "centers[:4]" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "id": "c28f3881", 442 | "metadata": {}, 443 | "source": [ 444 | "**Note**: The small dataset will be the template for our large random dataset.\n", 445 | "We'll use `dask.delayed` to adapt `sklearn.datasets.make_blobs`, so that the actual dataset is being generated on our workers. \n", 446 | "\n", 447 | "If you are not in binder and you machine has 16GB of RAM you can make `n_samples_per_block=200_000` and the computations takes around 10 min. If you are in binder the resources are limited and the problem below is big enough. " 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "id": "17d91e43", 454 | "metadata": {}, 455 | "outputs": [], 456 | "source": [ 457 | "n_samples_per_block = 60_000 #on binder replace this for 15_000\n", 458 | "n_blocks = 500\n", 459 | "\n", 460 | "delayeds = [dask.delayed(make_blobs)(n_samples=n_samples_per_block,\n", 461 | " centers=centers,\n", 462 | " n_features=n_features,\n", 463 | " random_state=i)[0]\n", 464 | " for i in range(n_blocks)]\n", 465 | "arrays = [da.from_delayed(obj, shape=(n_samples_per_block, n_features), dtype=X.dtype)\n", 466 | " for obj in delayeds]\n", 467 | "X = da.concatenate(arrays)\n", 468 | "X" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "id": "dc609b5f", 474 | "metadata": {}, 475 | "source": [ 476 | "### KMeans from Dask-ml\n", 477 | "\n", 478 | "The algorithms implemented in Dask-ML are scalable. They handle larger-than-memory datasets just fine.\n", 479 | "\n", 480 | "They follow the scikit-learn API, so if you're familiar with scikit-learn, you'll feel at home with Dask-ML." 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "id": "095e644f", 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "from dask_ml.cluster import KMeans" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "id": "51a4b716", 497 | "metadata": {}, 498 | "outputs": [], 499 | "source": [ 500 | "clf = KMeans(init_max_iter=3, oversampling_factor=10)" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "id": "a0a1e790", 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "%time clf.fit(X)" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "id": "7cfc4c3d", 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [ 520 | "clf.labels_" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "id": "30535766", 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [ 530 | "clf.labels_[:10].compute()" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "id": "1d9a83d7", 537 | "metadata": {}, 538 | "outputs": [], 539 | "source": [ 540 | "client.close()" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "id": "8103498a", 546 | "metadata": {}, 547 | "source": [ 548 | "## Multi-machine parallelism in the cloud with Coiled\n", 549 | "\n", 550 | "
\n", 551 | "\"Coiled\n", 554 | "
\n", 555 | "\n", 556 | "In this section we'll see how Coiled allows us to solve machine learning problems with multi-machine parallelism in the cloud.\n", 557 | "\n", 558 | "Coiled, [among other things](https://coiled.io/product/), provides hosted and scalable Dask clusters. The biggest barriers to entry for doing machine learning at scale are \"Do you have access to a cluster?\" and \"Do you know how to manage it?\" Coiled solves both of those problems. \n", 559 | "\n", 560 | "We'll spin up a Coiled cluster (with 10 workers in this case), then instantiate a Dask Client to use with that cluster.\n", 561 | "\n", 562 | "If you are running on your local machine and not in binder, and you want to give Coiled a try, you can signup [here](https://cloud.coiled.io/login?redirect_uri=/) and you will get some free credits. If you installed the environment by following the steps on the repository's [README](https://github.com/coiled/dask-mini-tutorial/blob/main/README.md) you will have `coiled` installed. You will just need to login, by following the steps on the [setup page](https://docs.coiled.io/user_guide/getting_started.html), and you will be ready to go. \n", 563 | "\n", 564 | "To learn more about how to set up an environment you can visit Coiled documentation on [Creating software environments](https://docs.coiled.io/user_guide/software_environment_creation.html). But for now you can use the envioronment we set up for this tutorial. " 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": null, 570 | "id": "58d6d915", 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "import coiled\n", 575 | "from dask.distributed import Client" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "id": "f4268dc5", 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "# Spin up a Coiled cluster, instantiate a Client\n", 586 | "cluster = coiled.Cluster(n_workers=10, software=\"ncclementi/dask-mini-tutorial\",)" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "id": "e50ddd7c", 593 | "metadata": {}, 594 | "outputs": [], 595 | "source": [ 596 | "client = Client(cluster)\n", 597 | "client" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "id": "ff3980aa", 603 | "metadata": {}, 604 | "source": [ 605 | "### Memory bound: Dask-ML" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "id": "efccea68", 611 | "metadata": {}, 612 | "source": [ 613 | "We can use Dask-ML estimators on the cloud to work with larger datasets." 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "id": "dc52c251", 620 | "metadata": {}, 621 | "outputs": [], 622 | "source": [ 623 | "n_centers = 12\n", 624 | "n_features = 20\n", 625 | "\n", 626 | "X_small, y_small = make_blobs(n_samples=1000, centers=n_centers, n_features=n_features, random_state=0)\n", 627 | "\n", 628 | "centers = np.zeros((n_centers, n_features))\n", 629 | "\n", 630 | "for i in range(n_centers):\n", 631 | " centers[i] = X_small[y_small == i].mean(0)\n" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "id": "ba1c77b3", 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "n_samples_per_block = 200_000\n", 642 | "n_blocks = 500\n", 643 | "\n", 644 | "delayeds = [dask.delayed(make_blobs)(n_samples=n_samples_per_block,\n", 645 | " centers=centers,\n", 646 | " n_features=n_features,\n", 647 | " random_state=i)[0]\n", 648 | " for i in range(n_blocks)]\n", 649 | "arrays = [da.from_delayed(obj, shape=(n_samples_per_block, n_features), dtype=X.dtype)\n", 650 | " for obj in delayeds]\n", 651 | "X = da.concatenate(arrays)\n" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "id": "d891a7b6", 658 | "metadata": {}, 659 | "outputs": [], 660 | "source": [ 661 | "X = X.persist()" 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": null, 667 | "id": "9fef03bb", 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [ 671 | "from dask_ml.cluster import KMeans" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "id": "8a186141", 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "clf = KMeans(init_max_iter=3, oversampling_factor=10)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "id": "56241fad", 688 | "metadata": {}, 689 | "outputs": [], 690 | "source": [ 691 | "%time clf.fit(X)" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "id": "9a3682c1", 697 | "metadata": {}, 698 | "source": [ 699 | "Computing the labels:" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "id": "f0b9980e", 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [ 709 | "clf.labels_[:10].compute()" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": null, 715 | "id": "9988ef14", 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "client.close()" 720 | ] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "id": "6c5f839e", 725 | "metadata": {}, 726 | "source": [ 727 | "## Extra resources:\n", 728 | "\n", 729 | "- [Dask-ML documentation](https://ml.dask.org/)\n", 730 | "- [Getting started with Coiled](https://docs.coiled.io/user_guide/getting_started.html)" 731 | ] 732 | } 733 | ], 734 | "metadata": { 735 | "kernelspec": { 736 | "display_name": "Python 3 (ipykernel)", 737 | "language": "python", 738 | "name": "python3" 739 | }, 740 | "language_info": { 741 | "codemirror_mode": { 742 | "name": "ipython", 743 | "version": 3 744 | }, 745 | "file_extension": ".py", 746 | "mimetype": "text/x-python", 747 | "name": "python", 748 | "nbconvert_exporter": "python", 749 | "pygments_lexer": "ipython3", 750 | "version": "3.9.7" 751 | } 752 | }, 753 | "nbformat": 4, 754 | "nbformat_minor": 5 755 | } 756 | -------------------------------------------------------------------------------- /prep_data.py: -------------------------------------------------------------------------------- 1 | #This script was modify from original https://github.com/coiled/pydata-global-dask/blob/master/prep.py 2 | import time 3 | import sys 4 | import argparse 5 | import os 6 | from glob import glob 7 | import tarfile 8 | import urllib.request 9 | 10 | import pandas as pd 11 | 12 | 13 | DATASETS = ["flights", "all"] 14 | here = os.path.dirname(__file__) 15 | data_dir = os.path.abspath(os.path.join(here, "data")) 16 | 17 | print(f"{data_dir=}") 18 | 19 | def parse_args(args=None): 20 | parser = argparse.ArgumentParser( 21 | description="Downloads, generates and prepares data for the Dask tutorial." 22 | ) 23 | parser.add_argument( 24 | "--no-ssl-verify", 25 | dest="no_ssl_verify", 26 | action="store_true", 27 | default=False, 28 | help="Disables SSL verification.", 29 | ) 30 | parser.add_argument( 31 | "--small", 32 | action="store_true", 33 | default=None, 34 | help="Whether to use smaller example datasets. Checks DASK_TUTORIAL_SMALL environment variable if not specified.", 35 | ) 36 | parser.add_argument( 37 | "-d", "--dataset", choices=DATASETS, help="Datasets to generate.", default="all" 38 | ) 39 | 40 | return parser.parse_args(args) 41 | 42 | 43 | if not os.path.exists(data_dir): 44 | raise OSError( 45 | "data/ directory not found, aborting data preparation. " 46 | 'Restore it with "git checkout data" from the base ' 47 | "directory." 48 | ) 49 | 50 | 51 | def flights(small=None): 52 | start = time.time() 53 | flights_raw = os.path.join(data_dir, "nycflights.tar.gz") 54 | flightdir = os.path.join(data_dir, "nycflights") 55 | if small is None: 56 | small = bool(os.environ.get("DASK_TUTORIAL_SMALL", False)) 57 | 58 | if small: 59 | N = 500 60 | else: 61 | N = 10_000 62 | 63 | if not os.path.exists(flights_raw): 64 | print("- Downloading NYC Flights dataset... ", end="", flush=True) 65 | url = "https://storage.googleapis.com/dask-tutorial-data/nycflights.tar.gz" 66 | urllib.request.urlretrieve(url, flights_raw) 67 | print("done", flush=True) 68 | 69 | if not os.path.exists(flightdir): 70 | print("- Extracting flight data... ", end="", flush=True) 71 | tar_path = os.path.join(data_dir, "nycflights.tar.gz") 72 | with tarfile.open(tar_path, mode="r:gz") as flights: 73 | flights.extractall(data_dir) 74 | 75 | if small: 76 | for path in glob(os.path.join(data_dir, "nycflights", "*.csv")): 77 | with open(path, "r") as f: 78 | lines = f.readlines()[:1000] 79 | 80 | with open(path, "w") as f: 81 | f.writelines(lines) 82 | 83 | print("done", flush=True) 84 | 85 | else: 86 | return 87 | 88 | end = time.time() 89 | print("** Created flights dataset! in {:0.2f}s**".format(end - start)) 90 | 91 | 92 | def main(args=None): 93 | args = parse_args(args) 94 | 95 | if args.no_ssl_verify: 96 | print("- Disabling SSL Verification... ", end="", flush=True) 97 | import ssl 98 | 99 | ssl._create_default_https_context = ssl._create_unverified_context 100 | print("done", flush=True) 101 | 102 | if args.dataset == "flights" or args.dataset == "all": 103 | flights(args.small) 104 | 105 | 106 | if __name__ == "__main__": 107 | sys.exit(main()) 108 | --------------------------------------------------------------------------------