├── data └── .gitkeep ├── binder ├── apt.txt ├── start ├── environment.yml └── jupyterlab-workspace.json ├── images ├── distributed-overview.png └── high_vs_low_level_coll_analogy.png ├── .dask └── config.yaml ├── .github └── workflows │ ├── ci-repo2docker.yaml │ ├── ci-binder-link-comment.yaml │ └── ci-build.yaml ├── .gitignore ├── Dockerfile ├── CONTRIBUTING.md ├── index.rst ├── LICENSE.txt ├── README.md ├── github_deploy_key_dask_dask_tutorial.enc ├── prep.py ├── conf.py ├── 00_overview.ipynb ├── 05_futures.ipynb ├── 04_distributed.ipynb ├── 03_dask.delayed.ipynb ├── 02_array.ipynb └── 01_dataframe.ipynb /data/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /binder/apt.txt: -------------------------------------------------------------------------------- 1 | graphviz 2 | -------------------------------------------------------------------------------- /images/distributed-overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask/dask-tutorial/HEAD/images/distributed-overview.png -------------------------------------------------------------------------------- /images/high_vs_low_level_coll_analogy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask/dask-tutorial/HEAD/images/high_vs_low_level_coll_analogy.png -------------------------------------------------------------------------------- /.dask/config.yaml: -------------------------------------------------------------------------------- 1 | distributed: 2 | logging: 3 | bokeh: critical 4 | 5 | dashboard: 6 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 7 | 8 | admin: 9 | tick: 10 | limit: 5s 11 | -------------------------------------------------------------------------------- /.github/workflows/ci-repo2docker.yaml: -------------------------------------------------------------------------------- 1 | name: repo2docker CI 2 | on: [push, pull_request] 3 | 4 | jobs: 5 | build: 6 | runs-on: ubuntu-latest 7 | steps: 8 | - name: Build and cache on mybinder.org 9 | uses: jupyterhub/repo2docker-action@master 10 | with: 11 | NO_PUSH: true 12 | MYBINDERORG_TAG: ${{ github.event.ref }} 13 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.dot 3 | *.pdf 4 | *.png 5 | .ipynb_checkpoints 6 | *.gz 7 | data/accounts.*.csv 8 | data/accounts.h5 9 | data/random.hdf5 10 | data/weather-big 11 | data/myfile.hdf5 12 | data/flightjson 13 | data/holidays 14 | data/nycflights 15 | data/myfile.zarr 16 | data/accounts.parquet 17 | dask-worker-space/ 18 | profile.html 19 | log 20 | .idea/ 21 | _build/ 22 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM jupyter/base-notebook:lab-2.2.5 2 | 3 | USER root 4 | # python3 setup 5 | RUN apt-get update && apt-get install -y graphviz git 6 | 7 | USER jovyan 8 | 9 | RUN git clone https://github.com/dask/dask-tutorial.git ./dask-tutorial 10 | RUN cd dask-tutorial && conda env update -n base -f binder/environment.yml --prune && . binder/postBuild && cd .. 11 | RUN rm dask-tutorial/github_deploy_key_dask_dask_tutorial.enc 12 | 13 | CMD jupyter lab 14 | -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | export DASK_DISTRIBUTED__DASHBOARD__LINK="${JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status" 6 | 7 | # Import the workspace 8 | jupyter lab workspaces import binder/jupyterlab-workspace.json 9 | 10 | export DASK_TUTORIAL_SMALL=1 11 | 12 | exec "$@" 13 | -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-tutorial 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.10 6 | - jupyterlab=3 7 | - numpy=1.24 8 | - scipy=1.10 9 | - bokeh=2.4 10 | - dask=2023.1.0 11 | - dask-labextension 12 | - distributed=2023.1.0 13 | - cryptography 14 | - matplotlib 15 | - pandas=1.5 16 | - pip 17 | - python-graphviz 18 | - ipycytoscape 19 | - pyarrow 20 | - s3fs 21 | - zarr 22 | - pooch 23 | - xarray 24 | - mamba 25 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | Dask is a community maintained project. We welcome contributions in the form of bug reports, documentation, code, design proposals, and more. 2 | 3 | For general information on how to contribute see https://docs.dask.org/en/latest/develop.html. 4 | 5 | ## Project specific notes 6 | 7 | As this repository mainly consists of Jupyter Notebooks and Python snippets for the solutions please test that you changes load correctly over 8 | at https://mybinder.org/. You can test this by entering the URL of your fork of this repo and the branch you've created and hitting "launch". 9 | -------------------------------------------------------------------------------- /.github/workflows/ci-binder-link-comment.yaml: -------------------------------------------------------------------------------- 1 | name: Comment with Binder link 2 | 3 | on: pull_request_target 4 | 5 | jobs: 6 | binder_link_comment: 7 | runs-on: ubuntu-latest 8 | steps: 9 | - name: Checkout 10 | uses: actions/checkout@v3 11 | 12 | - name: Comment PR 13 | uses: thollander/actions-comment-pull-request@v1 14 | with: 15 | message: | 16 | Beep boop! Here is a Binder link where you can try out this change. [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/${{ (github.event.pull_request_target || github.event.pull_request).head.repo.full_name }}/${{ (github.event.pull_request_target || github.event.pull_request).head.sha }}) 17 | comment_includes: "Binder link" 18 | GITHUB_TOKEN: ${{ secrets.DASK_BOT_TOKEN }} 19 | -------------------------------------------------------------------------------- /index.rst: -------------------------------------------------------------------------------- 1 | Dask Tutorial 2 | ============= 3 | 4 | You can run this tutorial in a live session here: |Binder| 5 | 6 | This tutorial was last given at SciPy 2020 in Austin Texas. 7 | `A video is available online `_. 8 | 9 | .. 10 | .. |Binder| image:: https://static.mybinder.org/badge_logo.svg 11 | :target: https://mybinder.org/v2/gh/dask/dask-tutorial/main?urlpath=lab 12 | 13 | 14 | .. toctree:: 15 | :maxdepth: 1 16 | 17 | 00_overview 18 | 01_dataframe 19 | 02_array 20 | 03_dask.delayed 21 | 04_distributed 22 | 05_futures 23 | 24 | 25 | More tutorials from our community 26 | --------------------------------- 27 | 28 | - You may want to check out these `tutorials from Coiled `_. 29 | - `Quansight `_ offers a number of PyData courses, including Dask and Dask-ML. 30 | 31 | For a more comprehensive list of past talks and other resources see `Talks & Tutorials `_ in the Dask documentation. 32 | -------------------------------------------------------------------------------- /.github/workflows/ci-build.yaml: -------------------------------------------------------------------------------- 1 | name: CI 2 | on: [push, pull_request] 3 | 4 | jobs: 5 | build-and-deploy: 6 | runs-on: ubuntu-latest 7 | steps: 8 | - name: Checkout source 9 | uses: actions/checkout@v2 10 | 11 | - name: Setup Conda Environment 12 | uses: conda-incubator/setup-miniconda@v3 13 | with: 14 | mamba-version: "*" 15 | environment-file: binder/environment.yml 16 | activate-environment: dask-tutorial 17 | auto-activate-base: false 18 | 19 | - name: Install testing and docs dependencies 20 | shell: bash -l {0} 21 | run: | 22 | mamba install -c conda-forge nbconvert nbformat jupyter_client ipykernel 23 | pip install nbsphinx dask-sphinx-theme>=3.0.5 sphinx 24 | 25 | - name: Build 26 | shell: bash -l {0} 27 | run: | 28 | python prep.py --small 29 | sphinx-build -M html . _build -v 30 | 31 | - name: Deploy 32 | if: ${{ github.ref == 'refs/heads/main' && github.event_name != 'pull_request'}} 33 | uses: JamesIves/github-pages-deploy-action@3.7.1 34 | with: 35 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 36 | BRANCH: gh-pages 37 | FOLDER: _build/html 38 | CLEAN: true 39 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017-2018, Anaconda, Inc. and contributors 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without modification, 5 | are permitted provided that the following conditions are met: 6 | 7 | Redistributions of source code must retain the above copyright notice, 8 | this list of conditions and the following disclaimer. 9 | 10 | Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | Neither the name of Anaconda nor the names of any contributors may be used to 15 | endorse or promote products derived from this software without specific prior 16 | written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF 28 | THE POSSIBILITY OF SUCH DAMAGE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dask Tutorial 2 | 3 | This tutorial was last given at SciPy 2022 in Austin Texas. 4 | [A video of the SciPy 2022 tutorial is available online](https://youtu.be/J0NcbvkYPoE). 5 | 6 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dask/dask-tutorial/main?urlpath=lab) 7 | [![Build Status](https://github.com/dask/dask-tutorial/workflows/CI/badge.svg)](https://github.com/dask/dask-tutorial/actions?query=workflow%3ACI) 8 | 9 | Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. Dask can scale up to your full laptop capacity and out to a cloud cluster. 10 | 11 | ## Prepare 12 | 13 | #### 1. You should clone this repository 14 | 15 | git clone http://github.com/dask/dask-tutorial 16 | 17 | and then install necessary packages. 18 | There are three different ways to achieve this, pick the one that best suits you, and ***only pick one option***. 19 | They are, in order of preference: 20 | 21 | #### 2a) Create a conda environment (preferred) 22 | 23 | In the main repo directory 24 | 25 | conda env create -f binder/environment.yml 26 | conda activate dask-tutorial 27 | 28 | #### 2b) Install into an existing environment 29 | 30 | You will need the following core libraries 31 | 32 | conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch pyarrow s3fs scipy dask distributed dask-labextension 33 | 34 | Note that these options will alter your existing environment, potentially changing the versions of packages you already 35 | have installed. 36 | 37 | #### 2c) Use Dockerfile 38 | 39 | You can build a docker image from the provided Dockerfile. 40 | 41 | $ docker build . # This will build using the same env as in a) 42 | 43 | Run a container, replacing the ID with the output of the previous command 44 | 45 | $ docker run -it -p 8888:8888 -p 8787:8787 46 | 47 | The above command will give an URL (`Like http://(container_id or 127.0.0.1):8888/?token=`) which 48 | can be used to access the notebook from browser. You may need to replace the given hostname with "localhost" or 49 | "127.0.0.1". 50 | 51 | #### You should follow only one of the options above! 52 | 53 | ### Launch Jupyter 54 | 55 | From the repo directory 56 | 57 | jupyter lab 58 | 59 | This was already done for method c) and does not need repeating. 60 | 61 | You are welcome to use Jupyter notebook if you prefer, but we'll be using lab in the live tutorial. 62 | 63 | ## Links 64 | 65 | * Reference 66 | * [Docs](https://dask.org/) 67 | * [Examples](https://examples.dask.org/) 68 | * [Code](https://github.com/dask/dask/) 69 | * [Blog](https://blog.dask.org/) 70 | * Ask for help 71 | * [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions 72 | * [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests 73 | * [discourse forum](https://dask.discourse.group/) for general, non-bug, questions and discussion 74 | * Attend a live tutorial 75 | 76 | ## Outline 77 | 78 | 0. [Overview](00_overview.ipynb) - dask's place in the universe. 79 | 80 | 1. [Dataframe](01_dataframe.ipynb) - parallelized operations on many pandas dataframes spread across your cluster. 81 | 82 | 2. [Array](02_array.ipynb) - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster. 83 | 84 | 3. [Delayed](03_dask.delayed.ipynb) - the single-function way to parallelize general python code. 85 | 86 | 4. [Deployment/Distributed](04_distributed.ipynb) - Dask's scheduler for clusters, with details of how to view the UI. 87 | 88 | 5. [Distributed Futures](05_futures.ipynb) - non-blocking results that compute asynchronously. 89 | 90 | 6. Conclusion 91 | -------------------------------------------------------------------------------- /github_deploy_key_dask_dask_tutorial.enc: -------------------------------------------------------------------------------- 1 | gAAAAABdSHGH8t4hcCyDa0rFI26HbD9fHH3OZ4cNb7EQ6ijW5Oozkg6ru7-14LGeE2O2Q40wCYPErBBtJMNCrTHd6MxwXygu5ISFieyRkXujl9ofiokw-rt-0WCDz30FC_7bxxGIvV9uvJcjHYYiEU547O4u8YbrSzrvDf-Cz8qtQ4ehis6k5DqIgVCWfYJPyC_p1SXDOfNtBSqj--bYSt8U_NfL2z2xTkzbehSQiA5rufVFTuSc0bKrap3Z8OrrAT7ZdSi5QvtCAwhDWHCO054NOyjsfzKXw9XEaA7P6Jhge2A4iLY61Tx27USdDtR2aevG6mmqUanFpk-6Pnv-DcAhdaN646yP5fd4avwwLORS5xjs3elOMtH2f1dbfgWGGEpXuG8dLb0d9JxnnYmlb2Ib4ljf4qSJjBj20BqoRSelvvAdhNE4r1EEK8LpNTwRt3Qc3HMLmNKr0almVnJjhiP0hBoz_MKz_kX971e4B2eOYh9F6wSbiu8jxGv0iZqS6usRH2xnowHg38zXBg31862rJNQQw6CWjwN09hkCdycQ1bXwdZ6FH0YCSNr5GWZa_o9j9Wk_pcqIUe1-CjieyQkC4gbF7gSYTVkGL2uyuKDQEO_TfN1pxkp_osyGmmFg3HwxYRgwvu1CHFNJdaSEY_f-PCJXwD5pyGACNmXbSx-udU4nPctF_iH39G62eOOZ5fUqfpA-9h1Ut8UGho1qszbqANYzysACLTEVm8r3mXNr6x_qvRIo_dcKajJINB94kyugeVqcpgdPzHdL9brd_3tlnjx0XsxrcXLLvPaU_3NLbmX4WBuzx4RCfnGRFv8O5nUuRkBWerjh06E70QEKZ93L_3Hy2Jt85lRo7eE0f4WP9bWqobzHeTiw31Znt3QzfvWgDertmvs-HpH41i4ffPrwi2LpnlryTPRtDpTWwFSP4Pa0rhfKOI_E4HwBmSjZ6v5jt7IbFS2IbeOZOJz2MgvRgIRBRqz8o8bqCnGjR09dHQu0XBTVbS_pTq9menHLCXbfCTa4kfEfRzINf4PtAafGWKRlQ1dYw9RBzxCm2sLNmJfzVoT_0aUwUHPBjTZMHQSTsXQPgXNda-tmu3_Q_qb3tKedrwMOqmLqsn8elgUavDXxzF0hqhUM4GJu67jU_s8Fwy7gn-h3UeEFau-HCkeZJNE-t0rDktkm-ijAQqmD3rMfDvn_v06IhahklaUXpetBD5tkiZVZOeyIYPu9BWBuVC-lpA0U_izTZY8Cx_rqPVP7y4B7-n4HcpQaKdx5NWWQH1L-YPr3yR2Scd1jhtAZ1WaPTWPVKARXVo2XTxcCbeZJtS5zNpQWUiZTrj_5I1W6GUa5MNwVNtYjU-JDBt1GwkhjV3WcjbxxGioUGcXgoCy2eotaAKN5lkugJm8fqcYKHtz6osOjnuVjX8-HOP7K3UOd6QVpPbjE3zPhkGi-ufAM_XEnk_zUry0qWTs7ZFRu8grXRmNKrCfRk-RuSo3G83VN87qP75IM_T6tSGGYf0zBP9iKslc-gvnvWNIUjHOz_R8flyPL4M627TM7sykQPhwfceuPmfNBfNtW_jcWfO4MbBBB0ydv8kBCw814NUXb7kvOWWHRBkWrUVEtpALK_mcrdMqfYNa2x9-wA7a9sqm5gMgGuw4BDjjFRMBzbo0YlTmKiSmUkjTRoP2OIcQZbrq4_FqMw9XzrXKoyUfX_--mr-hMgKAQwQGV4hGIT90llcJcJ__zle2B1tJuVQXXuf3iuYkinf3vEgrmEKIoReqfeZBZ93ebwc7HuX-gs4-E5B-Nz-KvYFGFf4CYYyKgiUY3QUAVVW_XJr4gZROme76ej5ZOeM1mLrjGppibB_CBnoy8BNA8CvPyVtiJEk7xbLrDwcxLLtFUCWYXg7v4zYpfKEpIPe3jWStmN9JS69z-O1JTnEIpua4qpPQV6zIzu0xuRGBkdkmpRxEHMpgbNuMhoYwSAEg3xp5aVpOeKzVQiwz5gxMBSGYFr33juXwmIi31imQYIOvz8N3O5SzJMO2r5Jd0MiPPzJkhzvkRbVkM0woJo2SnlWiGSZBEfHzlYjTnGB4ejEh_9GF7PS_Q5XKXeSbmhMvjssiViVcScVd_FpFIwuHeozWkrXIEpBMs0372yFfJrvYIrvE1XZxUxwrPBl4TmkteUKKENvtTzXoLJ0Yp4mfX4-gz9LsxGctRFFuEp8RocjjhRbIutyTM_DN3QStKliZWbgDcBveO4eOn0KtNVhG5dmecZNZ-EeOCtnOxgkb6sbGG9OLyV9eBXvD2ma53BKZCfBlWOIugKqyARQAutX4pSwjL9Gk0aZIXGn1qVx5Dw9Ca1nFwrvzX3E6Y8C6GaZbukkM8F8_Df7f7VhZbNCy8wbb-h6NwyQHqcGpoqb1xFx3fv3SJokxg-aXMvBd6H7l03AAV5lYJKTEaIBgcgFSzpP0DWzB3MXzLZLTgPh5hg7mZHjNVdNhGm08THbgnqBDWojEfXYANz8PRizDIJr-zbz05FThv7DQnvGAtatqGN0qGSKwpdv--XZcnvAbdNc0Bk-WPSXgiZ23Ly35e0cqydiFlC0OmiP-w3nk8Gtg_douLPJTMh4_DP4f_N_xyCVkmGw-rtw467nlWaf2uY-baZo-j4Ics_SNjgrDQZFK4POQuCauljONpShxrUjTPG63g3FUVIYU-SOJYuUjImWNurSLtdnvugsNGHwYC7g8NqF5UbtW38pzHWpXRs8l1pmn3NyuBQ6tLXRPfNSrJgdxSN-g_vnVkLkvNfR7fnAsuuPDLDbpWMjIhzZ0c5Lqc4jrXJWKUq4IafTGPiL814MdgNxiyemdo5pB03Dj6O7AY6WVeZhnIoo2u1DiYrl6BcO-z35hjJenohtGzf6szgkZG68SfBFOzWCQXDnH68WGcfqZuJky7U6jqbCBNqTa2RrkMNHRUqfYVzNjsJCaW5TsIC9OqiXPYGKA2z840hNogE57Is42iifMtn6MS2t8NxvvTi6A9OyeD9DZaBCkFM4J1LRLdDy8K-IKLNIXFMsy7izB19q5cOv3WBpxUjogpBNE3chTsvqKbk5R4yZEOS8AkrBpkKPEnRjScJvHwI7vRabfZhMzMex7cBA3hJu_9nUqbvnsPohczE6Vig4tvcJQEQ9JzosbgesEXaBjXvZGHyfv4wo539_cqzmoz5x1ejsLFfr-UeAawxnpANlHjqgx5wM7SCbM2p8brw4EU4voy7U_4yjt9-M5hlUr38YpgoYECtH7bY44mT6xJYPVkGtO-tjeaEB5Xa41UO8AmZvcHvg9kKqyzui2UqTGrOs35CdUBKQxEF8IH_oxSatwLbkifAu2lCEKAFvI9mCP2gvPBaR51bVe0grZN7uIEBh454R7EBhV3bx1Aaw-cx08Ucmqyl_Mu9YdcBnDcR2rqOp-u8AWSesAVhwtprjbwJELbQKPVZfo0JDmSAaT-Sv2yz49ubcbmp8msqmfAaqyylt0ZPsKnHi3fdWDuTNkxWzz_r1JqinsLWYn1WY87-3GodO4Pm6PsTHp8fys4dyYzK8Qnv7zE50WX3bd2cdCxx1qDJhmun9EhIwcRDoYpsmiW-5LyMajHN7TSsYi_donesXDig6YkToGOz7sLX5_blfVjBpfOnnWjy2rusyzjSdYcnVAmdzR1df_bvuBs6kVh6HkoZ1Liwg4wUbioIV-uN6_JAhcKIzDsa3tTSOjsAx5KZrOgW3P5sqsQNXKikc0KA31pmPOJbA8NIlyijdP7olphQuxtLCyrIf_GZWbrpsssRO-zjC_Tis4MFnR0IS8UeePJh98CDBEjaIkhMTbqjIv20YvejpQWGYkOcVNFBZbygdSIulRGa4A8wBuxe9u9fSyu5Kt0ouPC7VS9C703awgDPObLmN-jhFIscxIR7vZOl6gjl5qs5kg8uuoBt8R_AcJBLgJwczJUnlC4zsLy-xE-j3SiPPgvtM6lgOfFx2QWl3JaOmbkK15t3F7LMdaNaFLb2KRUw640SBdnM28ZDysRucfwgnuexgnzASSd2GxhB2tOnwKbqBkMefe4J1Jiet91XkmjtxPvPpe8hAqdgTQf7HPKDZt8QwNyHYEwyq_h1Tr3p-dhAX-aStWUtpm7QHzPcTF5I5pYJNcew6kC6mWL7CUf6HAsKXSxzqPOyFg-3dVIQ4l13_O8OP_IUheIAhN9BAWuj2921Zgs3u7qCCuN1GgY21XBGLDYix2GRPdqCB6yZpzNxIy7VyjhdmlgMW9z8pAZFtBGR5i_e9Y98Iktwk18Zwp1_Iafo6Slk6uoGelS0Ny5nqdGamuFCby0Jc1hgdiHaco3ifEb9VoFnO3_VYYnnnqiex0YQ17vBfNKHldUU5zlK3Jlv-yB4woOxPl8WPwHVRuK41gFgR4OidRp4SIh6YdktRvtndSgbeezUtehOieUpll30Jx-aiyTEKPusQ== -------------------------------------------------------------------------------- /prep.py: -------------------------------------------------------------------------------- 1 | import time 2 | import sys 3 | import argparse 4 | import os 5 | from glob import glob 6 | 7 | import tarfile 8 | import urllib.request 9 | 10 | import pandas as pd 11 | import dask.array as da 12 | 13 | DATASETS = ["random", "flights", "all"] 14 | here = os.path.dirname(__file__) 15 | data_dir = os.path.abspath(os.path.join(here, "data")) 16 | 17 | 18 | def parse_args(args=None): 19 | parser = argparse.ArgumentParser( 20 | description="Downloads, generates and prepares data for the Dask tutorial." 21 | ) 22 | parser.add_argument( 23 | "--small", 24 | action="store_true", 25 | default=None, 26 | help="Whether to use smaller example datasets. Checks DASK_TUTORIAL_SMALL environment variable if not specified.", 27 | ) 28 | parser.add_argument( 29 | "-d", "--dataset", choices=DATASETS, help="Datasets to generate.", default="all" 30 | ) 31 | 32 | return parser.parse_args(args) 33 | 34 | 35 | if not os.path.exists(data_dir): 36 | raise OSError( 37 | "data/ directory not found, aborting data preparation. " 38 | 'Restore it with "git checkout data" from the base ' 39 | "directory." 40 | ) 41 | 42 | 43 | def flights(small=None): 44 | start = time.time() 45 | flights_raw = os.path.join(data_dir, "nycflights.tar.gz") 46 | flightdir = os.path.join(data_dir, "nycflights") 47 | jsondir = os.path.join(data_dir, "flightjson") 48 | if small is None: 49 | small = bool(os.environ.get("DASK_TUTORIAL_SMALL", False)) 50 | 51 | if small: 52 | N = 500 53 | else: 54 | N = 10_000 55 | 56 | if not os.path.exists(flights_raw): 57 | print("- Downloading NYC Flights dataset... ", end="", flush=True) 58 | url = "https://storage.googleapis.com/dask-tutorial-data/nycflights.tar.gz" 59 | urllib.request.urlretrieve(url, flights_raw) 60 | print("done", flush=True) 61 | 62 | if not os.path.exists(flightdir): 63 | print("- Extracting flight data... ", end="", flush=True) 64 | tar_path = os.path.join(data_dir, "nycflights.tar.gz") 65 | with tarfile.open(tar_path, mode="r:gz") as flights: 66 | flights.extractall("data/") 67 | 68 | if small: 69 | for path in glob(os.path.join(data_dir, "nycflights", "*.csv")): 70 | with open(path, "r") as f: 71 | lines = f.readlines()[:1000] 72 | 73 | with open(path, "w") as f: 74 | f.writelines(lines) 75 | 76 | print("done", flush=True) 77 | 78 | if not os.path.exists(jsondir): 79 | print("- Creating json data... ", end="", flush=True) 80 | os.mkdir(jsondir) 81 | for path in glob(os.path.join(data_dir, "nycflights", "*.csv")): 82 | prefix = os.path.splitext(os.path.basename(path))[0] 83 | df = pd.read_csv(path, nrows=N) 84 | df.to_json( 85 | os.path.join(data_dir, "flightjson", prefix + ".json"), 86 | orient="records", 87 | lines=True, 88 | ) 89 | print("done", flush=True) 90 | else: 91 | return 92 | 93 | end = time.time() 94 | print("** Created flights dataset! in {:0.2f}s**".format(end - start)) 95 | 96 | 97 | def random_array(small=None): 98 | if small is None: 99 | small = bool(os.environ.get("DASK_TUTORIAL_SMALL", False)) 100 | 101 | t0 = time.time() 102 | print("- Generating random array data... ", end="", flush=True) 103 | if os.path.exists(os.path.join(data_dir, "random.zarr")) and os.path.exists( 104 | os.path.join(data_dir, "random_sc.zarr") 105 | ): 106 | return 107 | 108 | if small: 109 | size = 20_000_000 110 | random_arr = da.random.random(size=(size,), chunks=(625000,)) 111 | random_arr_small_chunks = da.random.random(size=(size,), chunks=(1000,)) 112 | else: 113 | size = 200_000_000 114 | random_arr = da.random.random(size=(size,), chunks=(6250000,)) 115 | random_arr_small_chunks = da.random.random(size=(size,), chunks=(10000,)) 116 | 117 | random_arr.to_zarr(os.path.join(data_dir, "random.zarr")) 118 | random_arr_small_chunks.to_zarr(os.path.join(data_dir, "random_sc.zarr")) 119 | 120 | t1 = time.time() 121 | print("** Created random data for array exercise in {:0.2f}s".format(t1 - t0)) 122 | 123 | 124 | def main(args=None): 125 | args = parse_args(args) 126 | if args.dataset == "random" or args.dataset == "all": 127 | random_array(args.small) 128 | if args.dataset == "flights" or args.dataset == "all": 129 | flights(args.small) 130 | 131 | 132 | if __name__ == "__main__": 133 | sys.exit(main()) 134 | -------------------------------------------------------------------------------- /binder/jupyterlab-workspace.json: -------------------------------------------------------------------------------- 1 | { 2 | "data": { 3 | "file-browser-filebrowser:cwd": { 4 | "path": "" 5 | }, 6 | "layout-restorer:data": { 7 | "main": { 8 | "dock": { 9 | "type": "split-area", 10 | "orientation": "horizontal", 11 | "sizes": [ 12 | 0.5, 13 | 0.5 14 | ], 15 | "children": [ 16 | { 17 | "type": "tab-area", 18 | "currentIndex": 1, 19 | "widgets": [ 20 | "notebook:00_overview.ipynb" 21 | ] 22 | }, 23 | { 24 | "type": "split-area", 25 | "orientation": "vertical", 26 | "sizes": [ 27 | 0.5, 28 | 0.5 29 | ], 30 | "children": [ 31 | { 32 | "type": "tab-area", 33 | "currentIndex": 0, 34 | "widgets": [ 35 | "dask-dashboard-launcher:/individual-task-stream" 36 | ] 37 | }, 38 | { 39 | "type": "split-area", 40 | "orientation": "horizontal", 41 | "sizes": [ 42 | 0.5, 43 | 0.5 44 | ], 45 | "children": [ 46 | { 47 | "type": "tab-area", 48 | "currentIndex": 0, 49 | "widgets": [ 50 | "dask-dashboard-launcher:/individual-progress" 51 | ] 52 | }, 53 | { 54 | "type": "tab-area", 55 | "currentIndex": 0, 56 | "widgets": [ 57 | "dask-dashboard-launcher:/individual-workers-memory" 58 | ] 59 | } 60 | ] 61 | } 62 | ] 63 | } 64 | ] 65 | }, 66 | "current": "notebook:00_overview.ipynb" 67 | }, 68 | "down": { 69 | "size": 0, 70 | "widgets": [] 71 | }, 72 | "left": { 73 | "collapsed": false, 74 | "current": "dask-dashboard-launcher", 75 | "widgets": [ 76 | "filebrowser", 77 | "running-sessions", 78 | "dask-dashboard-launcher", 79 | "@jupyterlab/toc:plugin", 80 | "extensionmanager.main-view" 81 | ] 82 | }, 83 | "right": { 84 | "collapsed": true, 85 | "widgets": [ 86 | "jp-property-inspector", 87 | "debugger-sidebar" 88 | ] 89 | }, 90 | "relativeSizes": [ 91 | 0.14820850561134083, 92 | 0.8517914943886592, 93 | 0 94 | ] 95 | }, 96 | "notebook:00_overview.ipynb": { 97 | "data": { 98 | "path": "00_overview.ipynb", 99 | "factory": "Notebook" 100 | } 101 | }, 102 | "dask-dashboard-launcher:/individual-task-stream": { 103 | "data": { 104 | "route": "/individual-task-stream", 105 | "label": "Task Stream", 106 | "key": "Task Stream" 107 | } 108 | }, 109 | "dask-dashboard-launcher:/individual-progress": { 110 | "data": { 111 | "route": "/individual-progress", 112 | "label": "Progress", 113 | "key": "Progress" 114 | } 115 | }, 116 | "dask-dashboard-launcher:/individual-workers-memory": { 117 | "data": { 118 | "route": "/individual-workers-memory", 119 | "label": "Workers Memory", 120 | "key": "Workers Memory" 121 | } 122 | }, 123 | "dask-dashboard-launcher": { 124 | "url": "DASK_DASHBOARD_URL", 125 | "cluster": "" 126 | } 127 | }, 128 | "metadata": { 129 | "id": "default", 130 | "last_modified": "2022-10-27T18:31:51.619821+00:00", 131 | "created": "2022-10-27T18:31:51.619821+00:00" 132 | } 133 | } 134 | -------------------------------------------------------------------------------- /conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Configuration file for the Sphinx documentation builder. 4 | # 5 | # This file does only contain a selection of the most common options. For a 6 | # full list see the documentation: 7 | # http://www.sphinx-doc.org/en/master/config 8 | 9 | # -- Path setup -------------------------------------------------------------- 10 | 11 | # If extensions (or modules to document with autodoc) are in another directory, 12 | # add these directories to sys.path here. If the directory is relative to the 13 | # documentation root, use os.path.abspath to make it absolute, like shown here. 14 | # 15 | # import os 16 | # import sys 17 | # sys.path.insert(0, os.path.abspath('.')) 18 | 19 | 20 | # -- Project information ----------------------------------------------------- 21 | 22 | project = "Dask Tutorial" 23 | copyright = "2018, Dask Developers" 24 | author = "Dask Developers" 25 | 26 | # The short X.Y version 27 | version = "" 28 | # The full version, including alpha/beta/rc tags 29 | release = "" 30 | 31 | 32 | # -- General configuration --------------------------------------------------- 33 | 34 | # If your documentation needs a minimal Sphinx version, state it here. 35 | # 36 | # needs_sphinx = '1.0' 37 | 38 | # Add any Sphinx extension module names here, as strings. They can be 39 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 40 | # ones. 41 | extensions = [ 42 | "sphinx.ext.mathjax", 43 | "nbsphinx", 44 | ] 45 | 46 | nbsphinx_timeout = 600 47 | # nbsphinx_execute = "always" 48 | 49 | 50 | nbsphinx_prolog = """ 51 | {% set docname = env.doc2path(env.docname, base=None) %} 52 | 53 | You can run this notebook in a `live session `_ |Binder| or view it `on Github `_. 55 | 56 | .. |Binder| image:: https://static.mybinder.org/badge_logo.svg 57 | :target: https://mybinder.org/v2/gh/dask/dask-tutorial/main?urlpath=lab/tree/{{ docname }} 58 | """ 59 | 60 | 61 | # Add any paths that contain templates here, relative to this directory. 62 | templates_path = ["_templates"] 63 | 64 | # The suffix(es) of source filenames. 65 | # You can specify multiple suffix as a list of string: 66 | # 67 | # source_suffix = ['.rst', '.md'] 68 | source_suffix = ".rst" 69 | 70 | # The master toctree document. 71 | master_doc = "index" 72 | 73 | # The language for content autogenerated by Sphinx. Refer to documentation 74 | # for a list of supported languages. 75 | # 76 | # This is also used if you do content translation via gettext catalogs. 77 | # Usually you set "language" from the command line for these cases. 78 | language = "en" 79 | 80 | # List of patterns, relative to source directory, that match files and 81 | # directories to ignore when looking for source files. 82 | # This pattern also affects html_static_path and html_extra_path . 83 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "**.ipynb_checkpoints"] 84 | 85 | # The name of the Pygments (syntax highlighting) style to use. 86 | # Commenting this out for now, if we register dask pygments, 87 | # then eventually this line can be: 88 | # pygments_style = "dask" 89 | 90 | 91 | # -- Options for HTML output ------------------------------------------------- 92 | 93 | # The theme to use for HTML and HTML Help pages. See the documentation for 94 | # a list of builtin themes. 95 | # 96 | html_theme = "dask_sphinx_theme" 97 | 98 | # Theme options are theme-specific and customize the look and feel of a theme 99 | # further. For a list of options available for each theme, see the 100 | # documentation. 101 | # 102 | # html_theme_options = {} 103 | 104 | # Add any paths that contain custom static files (such as style sheets) here, 105 | # relative to this directory. They are copied after the builtin static files, 106 | # so a file named "default.css" will overwrite the builtin "default.css". 107 | html_static_path = [] 108 | 109 | # Custom sidebar templates, must be a dictionary that maps document names 110 | # to template names. 111 | # 112 | # The default sidebars (for documents that don't match any pattern) are 113 | # defined by theme itself. Builtin themes are using these templates by 114 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html', 115 | # 'searchbox.html']``. 116 | # 117 | # html_sidebars = {} 118 | 119 | 120 | # -- Options for HTMLHelp output --------------------------------------------- 121 | 122 | # Output file base name for HTML help builder. 123 | htmlhelp_basename = "DaskTutorialdoc" 124 | 125 | 126 | # -- Options for LaTeX output ------------------------------------------------ 127 | 128 | latex_elements = { 129 | # The paper size ('letterpaper' or 'a4paper'). 130 | # 131 | # 'papersize': 'letterpaper', 132 | # The font size ('10pt', '11pt' or '12pt'). 133 | # 134 | # 'pointsize': '10pt', 135 | # Additional stuff for the LaTeX preamble. 136 | # 137 | # 'preamble': '', 138 | # Latex figure (float) alignment 139 | # 140 | # 'figure_align': 'htbp', 141 | } 142 | 143 | # Grouping the document tree into LaTeX files. List of tuples 144 | # (source start file, target name, title, 145 | # author, documentclass [howto, manual, or own class]). 146 | latex_documents = [ 147 | ( 148 | master_doc, 149 | "DaskTutorial.tex", 150 | "Dask Tutorial Documentation", 151 | "Dask Developers", 152 | "manual", 153 | ), 154 | ] 155 | 156 | 157 | # -- Options for manual page output ------------------------------------------ 158 | 159 | # One entry per manual page. List of tuples 160 | # (source start file, name, description, authors, manual section). 161 | man_pages = [(master_doc, "dasktutorial", "Dask Tutorial Documentation", [author], 1)] 162 | 163 | 164 | # -- Options for Texinfo output ---------------------------------------------- 165 | 166 | # Grouping the document tree into Texinfo files. List of tuples 167 | # (source start file, target name, title, author, 168 | # dir menu entry, description, category) 169 | texinfo_documents = [ 170 | ( 171 | master_doc, 172 | "DaskTutorial", 173 | "Dask Tutorial Documentation", 174 | author, 175 | "DaskTutorial", 176 | "One line description of project.", 177 | "Miscellaneous", 178 | ), 179 | ] 180 | -------------------------------------------------------------------------------- /00_overview.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Welcome to the Dask Tutorial\n", 8 | "\n", 9 | "\"Dask\n", 10 | "\n", 11 | "Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.\n", 12 | "\n", 13 | "Dask can scale up to your full laptop capacity and out to a cloud cluster." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## An example Dask computation\n", 21 | "\n", 22 | "In the following lines of code, we're reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don't worry about the code, this is just for a quick demonstration. We'll go over all of this in the next notebook. :)\n", 23 | "\n", 24 | "**Note for learners:** This might be heavy for Binder.\n", 25 | "\n", 26 | "**Note for instructors:** Don't forget to open the Dask Dashboard!" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "import dask.dataframe as dd\n", 36 | "from dask.distributed import Client" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "client = Client()\n", 46 | "client" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "ddf = dd.read_parquet(\n", 56 | " \"s3://coiled-data/uber/\",\n", 57 | " columns=[\"tips\"],\n", 58 | " storage_options={\"anon\": True},\n", 59 | ")" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "result = ddf[\"tips\"][ddf[\"tips\"] > 0].mean().compute()\n", 69 | "result" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## What is [Dask](\"https://www.dask.org/\")?\n", 77 | "\n", 78 | "There are many parts to the \"Dask\" the project:\n", 79 | "* Collections/API also known as \"core-library\".\n", 80 | "* Distributed -- to create clusters\n", 81 | "* Intergrations and broader ecosystem\n" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Dask Collections\n", 89 | "\n", 90 | "Dask provides **multi-core** and **distributed+parallel** execution on **larger-than-memory** datasets\n", 91 | "\n", 92 | "We can think of Dask's APIs (also called collections) at a high and a low level:\n", 93 | "\n", 94 | "
\n", 95 | "\"High\n", 96 | "
\n", 97 | "\n", 98 | "* **High-level collections:** Dask provides high-level Array, Bag, and DataFrame\n", 99 | " collections that mimic NumPy, lists, and pandas but can operate in parallel on\n", 100 | " datasets that don't fit into memory.\n", 101 | "* **Low-level collections:** Dask also provides low-level Delayed and Futures\n", 102 | " collections that give you finer control to build custom parallel and distributed computations." 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### Dask Cluster\n", 110 | "\n", 111 | "Most of the times when you are using Dask, you will be using a distributed scheduler, which exists in the context of a Dask cluster. The Dask cluster is structured as:\n", 112 | "\n", 113 | "
\n", 114 | "\"Distributed\n", 115 | "
" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### Dask Ecosystem\n", 123 | "\n", 124 | "In addition to the core Dask library and its distributed scheduler, the Dask ecosystem connects several additional initiatives, including:\n", 125 | "\n", 126 | "- Dask-ML (parallel scikit-learn-style API)\n", 127 | "- Dask-image\n", 128 | "- Dask-cuDF\n", 129 | "- Dask-sql\n", 130 | "- Dask-snowflake\n", 131 | "- Dask-mongo\n", 132 | "- Dask-bigquery\n", 133 | "\n", 134 | "Community libraries that have built-in dask integrations like:\n", 135 | "\n", 136 | "- Xarray\n", 137 | "- XGBoost\n", 138 | "- Prefect\n", 139 | "- Airflow\n", 140 | "\n", 141 | "Dask deployment libraries\n", 142 | "- Dask-kubernetes\n", 143 | "- Dask-YARN\n", 144 | "- Dask-gateway\n", 145 | "- Dask-cloudprovider\n", 146 | "- jobqueue\n", 147 | "\n", 148 | "... When we talk about the Dask project we include all these efforts as part of the community. " 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "## Dask Use Cases\n", 156 | "\n", 157 | "Dask is used in multiple fields such as:\n", 158 | "\n", 159 | "* Geospatial\n", 160 | "* Finance\n", 161 | "* Astrophysics\n", 162 | "* Microbiology\n", 163 | "* Environmental science\n", 164 | "\n", 165 | "Check out the Dask [use cases](https://stories.dask.org/en/latest/) page that provides a number of sample workflows." 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "## Prepare" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": { 178 | "tags": [] 179 | }, 180 | "source": [ 181 | "#### 1. You should clone this repository\n", 182 | "\n", 183 | "\n", 184 | " git clone http://github.com/dask/dask-tutorial\n", 185 | "\n", 186 | "and then install necessary packages.\n", 187 | "There are three different ways to achieve this, pick the one that best suits you, and ***only pick one option***.\n", 188 | "They are, in order of preference:\n", 189 | "\n", 190 | "#### 2a) Create a conda environment (preferred)\n", 191 | "\n", 192 | "In the main repo directory\n", 193 | "\n", 194 | "\n", 195 | " conda env create -f binder/environment.yml\n", 196 | " conda activate dask-tutorial\n", 197 | "\n", 198 | "\n", 199 | "#### 2b) Install into an existing environment\n", 200 | "\n", 201 | "You will need the following core libraries\n", 202 | "\n", 203 | "\n", 204 | " conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch pyarrow s3fs scipy dask distributed dask-labextension\n", 205 | "\n", 206 | "Note that these options will alter your existing environment, potentially changing the versions of packages you already\n", 207 | "have installed." 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "## Tutorial Structure\n", 215 | "\n", 216 | "Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.\n", 217 | "\n", 218 | "0. [Overview](00_overview.ipynb) - dask's place in the universe.\n", 219 | "\n", 220 | "1. [Dataframe](01_dataframe.ipynb) - parallelized operations on many pandas dataframes spread across your cluster.\n", 221 | "\n", 222 | "2. [Array](02_array.ipynb) - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.\n", 223 | "\n", 224 | "3. [Delayed](03_dask.delayed.ipynb) - the single-function way to parallelize general python code.\n", 225 | "\n", 226 | "4. [Deployment/Distributed](04_distributed.ipynb) - Dask's scheduler for clusters, with details of how to view the UI.\n", 227 | "\n", 228 | "5. [Distributed Futures](05_futures.ipynb) - non-blocking results that compute asynchronously.\n", 229 | "\n", 230 | "6. Conclusion\n", 231 | "\n", 232 | "\n", 233 | "If you haven't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is\n", 234 | "\n", 235 | "1. There are two modes: command and edit\n", 236 | "2. From command mode, press `Enter` to edit a cell (like this markdown cell)\n", 237 | "3. From edit mode, press `Esc` to change to command mode\n", 238 | "4. Press `shift+enter` to execute a cell and move to the next cell.\n", 239 | "\n", 240 | "The toolbar has commands for executing, converting, and creating cells." 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### Exercise: Print `Hello, world!`\n", 248 | "Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.\n", 249 | "\n", 250 | "\n", 251 | "Print the text \"Hello, world!\"." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "# Your code here" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell,\n", 268 | "in case later sections of the notebook depend on the output from the solution." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "jupyter": { 276 | "source_hidden": true 277 | }, 278 | "tags": [] 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "print(\"Hello, world!\")" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": { 288 | "tags": [] 289 | }, 290 | "source": [ 291 | "## Useful Links" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "* Reference\n", 299 | " * [Docs](https://dask.org/)\n", 300 | " * [Examples](https://examples.dask.org/)\n", 301 | " * [Code](https://github.com/dask/dask/)\n", 302 | " * [Blog](https://blog.dask.org/)\n", 303 | "* Ask for help\n", 304 | " * [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions\n", 305 | " * [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests\n", 306 | " * [discourse forum](https://dask.discourse.group/) for general, non-bug, questions and discussion\n", 307 | " * Attend a live tutorial" 308 | ] 309 | } 310 | ], 311 | "metadata": { 312 | "anaconda-cloud": {}, 313 | "kernelspec": { 314 | "display_name": "Python 3 (ipykernel)", 315 | "language": "python", 316 | "name": "python3" 317 | }, 318 | "language_info": { 319 | "codemirror_mode": { 320 | "name": "ipython", 321 | "version": 3 322 | }, 323 | "file_extension": ".py", 324 | "mimetype": "text/x-python", 325 | "name": "python", 326 | "nbconvert_exporter": "python", 327 | "pygments_lexer": "ipython3", 328 | "version": "3.10.5" 329 | } 330 | }, 331 | "nbformat": 4, 332 | "nbformat_minor": 4 333 | } 334 | -------------------------------------------------------------------------------- /05_futures.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\"Dask\n", 11 | "\n", 12 | "# Futures - non-blocking distributed calculations\n", 13 | "\n", 14 | "Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. \n", 15 | "\n", 16 | "The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session.\n", 17 | "\n", 18 | "This is the important difference between futures and delayed. Both can be used to support arbitrary task scheduling, but delayed is lazy (it just constructs a graph) whereas futures are eager. With futures, as soon as the inputs are available and there is compute available, the computation starts. \n", 19 | "\n", 20 | "**Related Documentation**\n", 21 | "\n", 22 | "* [Futures documentation](https://docs.dask.org/en/latest/futures.html)\n", 23 | "* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)\n", 24 | "* [Futures examples](https://examples.dask.org/futures.html)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from dask.distributed import Client\n", 34 | "\n", 35 | "client = Client(n_workers=4)\n", 36 | "client" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## A Typical Workflow\n", 44 | "\n", 45 | "This is the same workflow that we saw in the delayed notebook. It is for-loopy and the data is not necessarily an array or a dataframe. The following example outlines a read-transform-write:\n", 46 | "\n", 47 | "```python\n", 48 | "def process_file(filename):\n", 49 | " data = read_a_file(filename)\n", 50 | " data = do_a_transformation(data)\n", 51 | " destination = f\"results/{filename}\"\n", 52 | " write_out_data(data, destination)\n", 53 | " return destination\n", 54 | "\n", 55 | "futures = []\n", 56 | "for filename in filenames:\n", 57 | " future = client.submit(process_file, filename)\n", 58 | " futures.append(future)\n", 59 | " \n", 60 | "futures\n", 61 | "```" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "## Basics\n", 69 | "\n", 70 | "Just like we did in the delayed notebook, let's make some toy functions, `inc` and `add`, that sleep for a while to simulate work. We'll then time running these functions normally." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "from time import sleep\n", 80 | "\n", 81 | "\n", 82 | "def inc(x):\n", 83 | " sleep(1)\n", 84 | " return x + 1\n", 85 | "\n", 86 | "\n", 87 | "def double(x):\n", 88 | " sleep(2)\n", 89 | " return 2 * x\n", 90 | "\n", 91 | "\n", 92 | "def add(x, y):\n", 93 | " sleep(1)\n", 94 | " return x + y" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "We can run these locally" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "inc(1)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result." 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "future = client.submit(inc, 1) # returns immediately with pending future\n", 127 | "future" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "If you wait a second, and then check on the future again, you’ll see that it has finished." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "future" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "You can block on the computation and gather the result with the `.result()` method." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "future.result()" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "#### Other ways to wait for a future\n", 167 | "```python\n", 168 | "from dask.distributed import wait, progress\n", 169 | "progress(future)\n", 170 | "```\n", 171 | "\n", 172 | "shows a progress bar in *this* notebook, rather than having to go to the dashboard. This progress bar is also asynchronous, and doesn't block the execution of other code in the meanwhile.\n", 173 | "\n", 174 | "```python\n", 175 | "wait(future)\n", 176 | "```\n", 177 | "blocks and forces the notebook to wait until the computation pointed to by `future` is done. However, note that if the result of `inc()` is sitting in the cluster, it would take **no time** to execute the computation now, because Dask notices that we are asking for the result of a computation it already knows about. More on this later.\n", 178 | "\n", 179 | "#### Other ways to gather results\n", 180 | "```python\n", 181 | "client.gather(futures)\n", 182 | "```\n", 183 | "\n", 184 | "gathers results from more than one future." 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "## `client.compute`\n", 192 | "\n", 193 | "Generally, any Dask operation that is executed using `.compute()` or `dask.compute()` can be submitted for asynchronous execution using `client.compute()` instead.\n", 194 | "\n", 195 | "Here is an example from the delayed notebook:" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "import dask\n", 205 | "\n", 206 | "\n", 207 | "@dask.delayed\n", 208 | "def inc(x):\n", 209 | " sleep(1)\n", 210 | " return x + 1\n", 211 | "\n", 212 | "\n", 213 | "@dask.delayed\n", 214 | "def add(x, y):\n", 215 | " sleep(1)\n", 216 | " return x + y\n", 217 | "\n", 218 | "\n", 219 | "x = inc(1)\n", 220 | "y = inc(2)\n", 221 | "z = add(x, y)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "So far we have a regular `dask.delayed` output. When we pass `z` to `client.compute` we get a future back and Dask starts evaluating the task graph. " 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "# notice the difference from z.compute()\n", 238 | "# notice that this cell completes immediately\n", 239 | "future = client.compute(z)\n", 240 | "future" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "future.result() # waits until result is ready" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "When using futures, the *computation moves to the data* rather than the other way around, and the client, in the local Python session, need never see the intermediate values." 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "## `client.submit`\n", 264 | "\n", 265 | "`client.submit` takes a function and arguments, pushes these to the cluster, returning a `Future` representing the result to be computed. The function is passed to a worker process for evaluation. This looks a lot like doing `client.compute()`, above, except now we are passing the function and arguments directly to the cluster." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "def inc(x):\n", 275 | " sleep(1)\n", 276 | " return x + 1\n", 277 | "\n", 278 | "\n", 279 | "future_x = client.submit(inc, 1)\n", 280 | "future_y = client.submit(inc, 2)\n", 281 | "future_z = client.submit(sum, [future_x, future_y])\n", 282 | "future_z" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "future_z.result() # waits until result is ready" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "The arguments to`client.submit` can be regular Python functions and objects, futures from other submit operations or `dask.delayed` objects." 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "### How does it work?\n", 306 | "\n", 307 | "Each future represents a result held, or being evaluated by the cluster. Thus we can control caching of intermediate values - when a future is no longer referenced, its value is forgotten. In the solution, above, futures are held for each of the function calls. These results would not need to be re-evaluated if we chose to submit more work that needed them.\n", 308 | "\n", 309 | "We can explicitly pass data from our local session into the cluster using `client.scatter()`, but usually it is better to construct functions that do the loading of data within the workers themselves, so that there is no need to serialize and communicate the data. Most of the loading functions within Dask, such as `dd.read_csv`, work this way. Similarly, we normally don't want to `gather()` results that are too big in memory." 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "## Example: Sporadically failing task\n", 317 | "\n", 318 | "Let's imagine a task that sometimes fails. You might encounter this when dealing with input data where sometimes a file is malformed, or maybe a request times out." 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "from random import random\n", 328 | "\n", 329 | "\n", 330 | "def flaky_inc(i):\n", 331 | " if random() < 0.2:\n", 332 | " raise ValueError(\"You hit the error!\")\n", 333 | " return i + 1" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "If you run this function over and over again, it will sometimes fail. \n", 341 | "\n", 342 | "```python\n", 343 | ">>> flaky_inc(2)\n", 344 | "---------------------------------------------------------------------------\n", 345 | "ValueError Traceback (most recent call last)\n", 346 | "Input In [65], in ()\n", 347 | "----> 1 flaky_inc(2)\n", 348 | "\n", 349 | "Input In [61], in flaky_inc(i)\n", 350 | " 3 def flaky_inc(i):\n", 351 | " 4 if random() < 0.5:\n", 352 | "----> 5 raise ValueError(\"You hit the error!\")\n", 353 | " 6 return i + 1\n", 354 | "\n", 355 | "ValueError: You hit the error!\n", 356 | "```" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "We can run this function on a range of inputs using `client.map`." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "futures = client.map(flaky_inc, range(10))" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "Notice how the cell returned even though some of the computations failed. We can inspect these futures one by one and find the ones that failed:" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "for i, future in enumerate(futures):\n", 389 | " print(i, future.status)" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "You can rerun those specific futures to try to get the task to successfully complete:" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "futures[5].retry()" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "for i, future in enumerate(futures):\n", 415 | " print(i, future.status)" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "A more concise way of retrying in the case of sporadic failures is by setting the number of retries in the `client.compute`, `client.submit` or `client.map` method.\n", 423 | "\n", 424 | "**Note**: In this example we also need to set `pure=False` to let Dask know that the arguments to the function do not totally determine the output." 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "futures = client.map(flaky_inc, range(10), retries=5, pure=False)\n", 434 | "future_z = client.submit(sum, futures)\n", 435 | "future_z.result()" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "You will see a lot of warnings, but the computation should eventually succeed." 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "## Why use Futures?\n", 450 | "\n", 451 | "The futures API offers a work submission style that can easily emulate the map/reduce paradigm. If that is familiar to you then futures might be the simplest entrypoint into Dask. \n", 452 | "\n", 453 | "The other big benefit of futures is that the intermediate results, represented by futures, can be passed to new tasks without having to pull data locally from the cluster. New operations can be setup to work on the output of previous jobs that haven't even begun yet." 454 | ] 455 | } 456 | ], 457 | "metadata": { 458 | "anaconda-cloud": {}, 459 | "kernelspec": { 460 | "display_name": "Python 3 (ipykernel)", 461 | "language": "python", 462 | "name": "python3" 463 | }, 464 | "language_info": { 465 | "codemirror_mode": { 466 | "name": "ipython", 467 | "version": 3 468 | }, 469 | "file_extension": ".py", 470 | "mimetype": "text/x-python", 471 | "name": "python", 472 | "nbconvert_exporter": "python", 473 | "pygments_lexer": "ipython3", 474 | "version": "3.10.5" 475 | } 476 | }, 477 | "nbformat": 4, 478 | "nbformat_minor": 4 479 | } 480 | -------------------------------------------------------------------------------- /04_distributed.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\"Dask\n", 11 | "\n", 12 | "# Distributed - spread your data and computation across a cluster\n", 13 | "\n", 14 | "As we covered at the beginning Dask has the ability to run work on multiple machines using the distributed scheduler.\n", 15 | "\n", 16 | "Until now we have actually been using the distributed scheduler for our work, but just on a single machine.\n", 17 | "\n", 18 | "When we instantiate a `Client()` object with no arguments it will attempt to locate a Dask cluster. It will check your local Dask config and environment variables to see if connection information has been specified. If not it will create an instance of `LocalCluster` and use that.\n", 19 | "\n", 20 | "*Specifying connection information in config is useful for system administrators to provide access to their users. We do this in the [Dask Helm Chart for Kubernetes](https://github.com/dask/helm-chart/blob/master/dask/templates/dask-jupyter-deployment.yaml#L46-L48), the chart installs a multi-node Dask cluster and a Jupyter server on a Kubernetes cluster and Jupyter is preconfigured to discover the distributed cluster.*" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## Local Cluster\n", 28 | "\n", 29 | "Let's explore the `LocalCluster` object ourselves and see what it is doing." 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "from dask.distributed import LocalCluster, Client" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "cluster = LocalCluster()\n", 48 | "cluster" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Creating a cluster object will create a Dask scheduler and a number of Dask workers. If no arguments are specified then it will autodetect the number of CPU cores your system has and the amount of memory and create workers to appropriately fill that.\n", 56 | "\n", 57 | "You can also specify these arguments yourself. Let's have a look at the docstring to see the options we have available.\n", 58 | "\n", 59 | "*These arguments can also be passed to `Client` and in the case where it creates a `LocalCluster` they will just be passed on down the line.*" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "?LocalCluster" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "Our cluster object has attributes and methods which we can use to access information about our cluster. For instance we can get the log output from the scheduler and all the workers with the `get_logs()` method." 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "cluster.get_logs()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "We can access the url that the Dask dashboard is being hosted at." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "cluster.dashboard_link" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "In order for Dask to use our cluster we still need to create a `Client` object, but as we have already created a cluster we can pass that directly to our client." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "client = Client(cluster)\n", 117 | "client" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "del client, cluster" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "## Remote clusters via SSH\n", 134 | "\n", 135 | "A common way to distribute your work onto multiple machines is via SSH. Dask has a cluster manager which will handle creating SSH connections for you called `SSHCluster`." 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "```python\n", 143 | "from dask.distributed import SSHCluster\n", 144 | "```" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "When constructing this cluster manager we need to pass a list of addresses, either hostnames or IP addresses, which we will SSH into and attempt to start a Dask scheduler or worker on." 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "```python\n", 159 | "cluster = SSHCluster([\"localhost\", \"hostA\", \"hostB\"])\n", 160 | "cluster\n", 161 | "```" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "When we create our `SSHCluster` object we have given a list of three hostnames.\n", 169 | "\n", 170 | "The first host in the list will be used as the scheduler, all other hosts will be used as workers. If you're on the same network it wouldn't be unreasonable to set your local machine as the scheduler and then use other machines as workers.\n", 171 | "\n", 172 | "If your servers are remote to you, in the cloud for instance, you may want the scheduler to be a remote machine too to avoid network bottlenecks." 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "## Scalable clusters\n", 180 | "\n", 181 | "Both of the clusters we have seen so far are fixed size clusters. We are either running locally and using all the resources in our machine, or we are using an explicit number of other machines via SSH.\n", 182 | "\n", 183 | "With some cluster managers it is possible to increase and descrease the number of workers either by calling `cluster.scale(n)` in your code where `n` is the desired number of workers. Or you can let Dask do this dynamically by calling `cluster.adapt(minimum=1, maximum=100)` where minimum and maximum are your preferred limits for Dask to abide to.\n", 184 | "\n", 185 | "It is always good to keep your minimum to at least 1 as Dask will start running work on a single worker in order to profile how long things take and extrapolate how many additional workers it thinks it needs. Getting new workers may take time depending on your setup so keeping this at 1 or above means this profilling will start immediately.\n", 186 | "\n", 187 | "We currently have cluster managers for [Kubernetes](https://kubernetes.dask.org/en/latest/), [Hadoop/Yarn](https://yarn.dask.org/en/latest/), [cloud platforms](https://cloudprovider.dask.org/en/latest/) and [batch systems including PBS, SLURM and SGE](http://jobqueue.dask.org/en/latest/).\n", 188 | "\n", 189 | "These cluster managers allow users who have access to resources such as these to bootstrap Dask clusters on to them. If an institution wishes to provide a central service that users can request Dask clusters from there is also [Dask Gateway](https://gateway.dask.org/)." 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## Cluster components\n", 197 | "\n", 198 | "The minimum requirements for a functioning Dask cluster is a scheduler process and one worker process.\n", 199 | "\n", 200 | "We can start these processes manually via the CLI. Let's start with the scheduler.\n", 201 | "\n", 202 | "```console\n", 203 | "$ dask-scheduler \n", 204 | "2022-07-07 14:11:35,661 - distributed.scheduler - INFO - -----------------------------------------------\n", 205 | "2022-07-07 14:11:37,405 - distributed.scheduler - INFO - State start\n", 206 | "2022-07-07 14:11:37,408 - distributed.scheduler - INFO - -----------------------------------------------\n", 207 | "2022-07-07 14:11:37,409 - distributed.scheduler - INFO - Clear task state\n", 208 | "2022-07-07 14:11:37,409 - distributed.scheduler - INFO - Scheduler at: tcp://10.51.100.80:8786\n", 209 | "2022-07-07 14:11:37,409 - distributed.scheduler - INFO - dashboard at: :8787\n", 210 | "```\n", 211 | "\n", 212 | "Then we can connect a worker on the address that the scheduler is listening on.\n", 213 | "\n", 214 | "```console\n", 215 | "$ dask-worker tcp://10.51.100.80:8786 --nworkers=auto\n", 216 | "2022-07-07 14:12:53,915 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.51.100.80:58051'\n", 217 | "2022-07-07 14:12:53,922 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.51.100.80:58052'\n", 218 | "2022-07-07 14:12:53,924 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.51.100.80:58053'\n", 219 | "2022-07-07 14:12:53,925 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.51.100.80:58054'\n", 220 | "2022-07-07 14:12:55,222 - distributed.worker - INFO - Start worker at: tcp://10.51.100.80:58065\n", 221 | "2022-07-07 14:12:55,222 - distributed.worker - INFO - Listening to: tcp://10.51.100.80:58065\n", 222 | "2022-07-07 14:12:55,223 - distributed.worker - INFO - dashboard at: 10.51.100.80:58068\n", 223 | "2022-07-07 14:12:55,223 - distributed.worker - INFO - Waiting to connect to: tcp://10.51.100.80:8786\n", 224 | "2022-07-07 14:12:55,223 - distributed.worker - INFO - -------------------------------------------------\n", 225 | "2022-07-07 14:12:55,223 - distributed.worker - INFO - Threads: 3\n", 226 | "2022-07-07 14:12:55,223 - distributed.worker - INFO - Memory: 4.00 GiB\n", 227 | "2022-07-07 14:12:55,224 - distributed.worker - INFO - Local Directory: /Users/jtomlinson/Projects/dask/dask-tutorial/dask-worker-space/worker-hlvac6m5\n", 228 | "2022-07-07 14:12:55,225 - distributed.worker - INFO - -------------------------------------------------\n", 229 | "2022-07-07 14:12:55,227 - distributed.worker - INFO - Start worker at: tcp://10.51.100.80:58066\n", 230 | "2022-07-07 14:12:55,227 - distributed.worker - INFO - Listening to: tcp://10.51.100.80:58066\n", 231 | "2022-07-07 14:12:55,227 - distributed.worker - INFO - dashboard at: 10.51.100.80:58070\n", 232 | "2022-07-07 14:12:55,227 - distributed.worker - INFO - Waiting to connect to: tcp://10.51.100.80:8786\n", 233 | "2022-07-07 14:12:55,227 - distributed.worker - INFO - -------------------------------------------------\n", 234 | "2022-07-07 14:12:55,227 - distributed.worker - INFO - Threads: 3\n", 235 | "2022-07-07 14:12:55,228 - distributed.worker - INFO - Memory: 4.00 GiB\n", 236 | "2022-07-07 14:12:55,228 - distributed.worker - INFO - Local Directory: /Users/jtomlinson/Projects/dask/dask-tutorial/dask-worker-space/worker-e1suf_7o\n", 237 | "2022-07-07 14:12:55,229 - distributed.worker - INFO - -------------------------------------------------\n", 238 | "2022-07-07 14:12:55,231 - distributed.worker - INFO - Start worker at: tcp://10.51.100.80:58063\n", 239 | "2022-07-07 14:12:55,233 - distributed.worker - INFO - Listening to: tcp://10.51.100.80:58063\n", 240 | "2022-07-07 14:12:55,233 - distributed.worker - INFO - dashboard at: 10.51.100.80:58067\n", 241 | "2022-07-07 14:12:55,233 - distributed.worker - INFO - Waiting to connect to: tcp://10.51.100.80:8786\n", 242 | "2022-07-07 14:12:55,233 - distributed.worker - INFO - -------------------------------------------------\n", 243 | "2022-07-07 14:12:55,234 - distributed.worker - INFO - Threads: 3\n", 244 | "2022-07-07 14:12:55,234 - distributed.worker - INFO - Memory: 4.00 GiB\n", 245 | "2022-07-07 14:12:55,235 - distributed.worker - INFO - Local Directory: /Users/jtomlinson/Projects/dask/dask-tutorial/dask-worker-space/worker-oq39ihb4\n", 246 | "2022-07-07 14:12:55,236 - distributed.worker - INFO - -------------------------------------------------\n", 247 | "2022-07-07 14:12:55,246 - distributed.worker - INFO - Registered to: tcp://10.51.100.80:8786\n", 248 | "2022-07-07 14:12:55,246 - distributed.worker - INFO - -------------------------------------------------\n", 249 | "2022-07-07 14:12:55,249 - distributed.core - INFO - Starting established connection\n", 250 | "2022-07-07 14:12:55,264 - distributed.worker - INFO - Registered to: tcp://10.51.100.80:8786\n", 251 | "2022-07-07 14:12:55,264 - distributed.worker - INFO - -------------------------------------------------\n", 252 | "2022-07-07 14:12:55,267 - distributed.worker - INFO - Registered to: tcp://10.51.100.80:8786\n", 253 | "2022-07-07 14:12:55,267 - distributed.core - INFO - Starting established connection\n", 254 | "2022-07-07 14:12:55,267 - distributed.worker - INFO - -------------------------------------------------\n", 255 | "2022-07-07 14:12:55,269 - distributed.core - INFO - Starting established connection\n", 256 | "2022-07-07 14:12:55,273 - distributed.worker - INFO - Start worker at: tcp://10.51.100.80:58064\n", 257 | "2022-07-07 14:12:55,273 - distributed.worker - INFO - Listening to: tcp://10.51.100.80:58064\n", 258 | "2022-07-07 14:12:55,273 - distributed.worker - INFO - dashboard at: 10.51.100.80:58069\n", 259 | "2022-07-07 14:12:55,273 - distributed.worker - INFO - Waiting to connect to: tcp://10.51.100.80:8786\n", 260 | "2022-07-07 14:12:55,274 - distributed.worker - INFO - -------------------------------------------------\n", 261 | "2022-07-07 14:12:55,274 - distributed.worker - INFO - Threads: 3\n", 262 | "2022-07-07 14:12:55,275 - distributed.worker - INFO - Memory: 4.00 GiB\n", 263 | "2022-07-07 14:12:55,275 - distributed.worker - INFO - Local Directory: /Users/jtomlinson/Projects/dask/dask-tutorial/dask-worker-space/worker-zfie55ku\n", 264 | "2022-07-07 14:12:55,276 - distributed.worker - INFO - -------------------------------------------------\n", 265 | "2022-07-07 14:12:55,299 - distributed.worker - INFO - Registered to: tcp://10.51.100.80:8786\n", 266 | "2022-07-07 14:12:55,300 - distributed.worker - INFO - -------------------------------------------------\n", 267 | "2022-07-07 14:12:55,302 - distributed.core - INFO - Starting established connection\n", 268 | "```\n", 269 | "\n", 270 | "Then in Python we can connect a client to this cluster and submit some work.\n", 271 | "\n", 272 | "```python\n", 273 | ">>> from dask.distributed import Client\n", 274 | ">>> client = Client(\"tcp://10.51.100.80:8786\")\n", 275 | ">>> client.submit(lambda: 1+1)\n", 276 | "```\n", 277 | "\n", 278 | "We can also do this in Python by importing cluster components and creating them directly." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "from dask.distributed import Scheduler, Worker, Client\n", 288 | "\n", 289 | "async with Scheduler() as scheduler:\n", 290 | " async with Worker(scheduler.address) as worker:\n", 291 | " async with Client(scheduler.address, asynchronous=True) as client:\n", 292 | " print(await client.submit(lambda: 1 + 1))" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "Most of the time we never have to create these components ourselves and instead can rely on cluster manager objects to do this for us. But in some situations it can be useful to be able to contruct a cluster yourself manually.\n", 300 | "\n", 301 | "You may also see a `Nanny` process being referenced from time to time. This is a wrapper for the worker that handles restarting the process if it is killed. When we run `dask-worker` via the CLI a nanny is automatically created for us." 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "## Cluster networking\n", 309 | "\n", 310 | "By default Dask uses a custom TCP based remote procedure call protocol to communicate between processes. The scheduler and workers all listen on TCP ports for communication.\n", 311 | "\n", 312 | "When you start a scheduler it typically listens on port `8786`. When a worker is created it listens on a random high port and communicates that port to the scheduler when it first connects.\n", 313 | "\n", 314 | "The scheduler maintains a list of all workers and their address which can be accessed via the workers, therefore both the scheduler and any of the workers can open connections to any other worker at any time. Connections are closed automatically when not in use.\n", 315 | "\n", 316 | "The `Client` will only ever connect to the scheduler and all communication to the workers will pass through it. This means that when deploying Dask clusters the scheduler and workers must typically be on the same network and able to access each other via IP and port directly. But the client can run wherever as long as it can access the scheduler communication port. It is common to configure firewall rules or load balancers to provide access to just the scheduler port.\n", 317 | "\n", 318 | "Dask also supports other network protocols such as [TLS](https://distributed.dask.org/en/stable/tls.html), [websockets](https://distributed.dask.org/en/stable/protocol.html) and [UCX](https://docs.rapids.ai/api/dask-cuda/nightly/examples/ucx.html)." 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "### TLS/SSL for secure communication\n", 326 | "\n", 327 | "Dask cluster components can use certificates to mutually authenticate and communicate securely if run in an untrusted envronment. You can either generate certificates for the scheduler, worker and client automatically and distribute those or you can generate temporary credentials. \n", 328 | "\n", 329 | "Some cluster managers such as `dask-cloudprovider` will automatically enable TLS and generate one-time certificates when exposing clusters to the internet from the public cloud." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "from dask.distributed import Scheduler, Worker, Client\n", 339 | "from distributed.security import Security\n", 340 | "\n", 341 | "security = Security.temporary()\n", 342 | "\n", 343 | "async with Scheduler(security=security) as scheduler:\n", 344 | " async with Worker(scheduler.address, security=security) as worker:\n", 345 | " async with Client(\n", 346 | " scheduler.address, security=security, asynchronous=True\n", 347 | " ) as client:\n", 348 | " print(await client.submit(lambda: 1 + 1))" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "### Websockets\n", 356 | "\n", 357 | "Dask can also communicate via websockets instead of TCP. There is a very small performance overhead to doing this but it means that the dashboard and communication happen on the same port and can be reverse proxied by a layer 7 proxy like nginx. This is necessary for some deployment scenarios where you cannot exprt ports but you can proxy web services." 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "### UCX\n", 365 | "\n", 366 | "On systems with high performance networking such as Infiniband or NVLink Dask can also leverage [UCX](https://openucx.org/) which provides a unified communication protocol that automatically upgrades communication to use the fastest hardware available. This is vital for good performance on HPC systems with Infiniband or systems with multiple GPU workers." 367 | ] 368 | } 369 | ], 370 | "metadata": { 371 | "anaconda-cloud": {}, 372 | "interpreter": { 373 | "hash": "b7cfefc49ba37c014f46ba4939b08cd2bb5b25f27a21eb8109ce6441079573de" 374 | }, 375 | "kernelspec": { 376 | "display_name": "Python 3 (ipykernel)", 377 | "language": "python", 378 | "name": "python3" 379 | }, 380 | "language_info": { 381 | "codemirror_mode": { 382 | "name": "ipython", 383 | "version": 3 384 | }, 385 | "file_extension": ".py", 386 | "mimetype": "text/x-python", 387 | "name": "python", 388 | "nbconvert_exporter": "python", 389 | "pygments_lexer": "ipython3", 390 | "version": "3.10.5" 391 | } 392 | }, 393 | "nbformat": 4, 394 | "nbformat_minor": 4 395 | } 396 | -------------------------------------------------------------------------------- /03_dask.delayed.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\"Dask\n", 11 | "\n", 12 | "# dask.delayed - parallelize any code\n", 13 | "\n", 14 | "What if you don't have an array or dataframe? Instead of having blocks where the function is applied to each block, you can decorate functions with `@delayed` and _have the functions themselves be lazy_. \n", 15 | "\n", 16 | "This is a simple way to use `dask` to parallelize existing codebases or build [complex systems](https://blog.dask.org/2018/02/09/credit-models-with-dask). \n", 17 | "\n", 18 | "**Related Documentation**\n", 19 | "\n", 20 | "* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)\n", 21 | "* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)\n", 22 | "* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)\n", 23 | "* [Delayed examples](https://examples.dask.org/delayed.html)\n", 24 | "* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "As we'll see in the [distributed scheduler notebook](05_distributed.ipynb), Dask has several ways of executing code in parallel. We'll use the distributed scheduler by creating a `dask.distributed.Client`. For now, this will provide us with some nice diagnostics. We'll talk about schedulers in depth later." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "from dask.distributed import Client\n", 41 | "\n", 42 | "client = Client(n_workers=4)" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## A Typical Workflow\n", 50 | "\n", 51 | "Typically if a workflow contains a for-loop it can benefit from delayed. The following example outlines a read-transform-write:\n", 52 | "\n", 53 | "```python\n", 54 | "import dask\n", 55 | " \n", 56 | "@dask.delayed\n", 57 | "def process_file(filename):\n", 58 | " data = read_a_file(filename)\n", 59 | " data = do_a_transformation(data)\n", 60 | " destination = f\"results/{filename}\"\n", 61 | " write_out_data(data, destination)\n", 62 | " return destination\n", 63 | "\n", 64 | "results = []\n", 65 | "for filename in filenames:\n", 66 | " results.append(process_file(filename))\n", 67 | " \n", 68 | "dask.compute(results)\n", 69 | "```" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## Basics\n", 77 | "\n", 78 | "First let's make some toy functions, `inc` and `add`, that sleep for a while to simulate work. We'll then time running these functions normally.\n", 79 | "\n", 80 | "In the next section we'll parallelize this code." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "from time import sleep\n", 90 | "\n", 91 | "\n", 92 | "def inc(x):\n", 93 | " sleep(1)\n", 94 | " return x + 1\n", 95 | "\n", 96 | "\n", 97 | "def add(x, y):\n", 98 | " sleep(1)\n", 99 | " return x + y" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "We time the execution of this normal code using the `%%time` magic, which is a special function of the Jupyter Notebook." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "%%time\n", 116 | "# This takes three seconds to run because we call each\n", 117 | "# function sequentially, one after the other\n", 118 | "\n", 119 | "x = inc(1)\n", 120 | "y = inc(2)\n", 121 | "z = add(x, y)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "### Parallelize with the `dask.delayed` decorator\n", 129 | "\n", 130 | "Those two increment calls *could* be called in parallel, because they are totally independent of one-another.\n", 131 | "\n", 132 | "We'll make the `inc` and `add` functions lazy using the `dask.delayed` decorator. When we call the delayed version by passing the arguments, exactly as before, the original function isn't actually called yet - which is why the cell execution finishes very quickly.\n", 133 | "Instead, a *delayed object* is made, which keeps track of the function to call and the arguments to pass to it.\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "import dask\n", 143 | "\n", 144 | "\n", 145 | "@dask.delayed\n", 146 | "def inc(x):\n", 147 | " sleep(1)\n", 148 | " return x + 1\n", 149 | "\n", 150 | "\n", 151 | "@dask.delayed\n", 152 | "def add(x, y):\n", 153 | " sleep(1)\n", 154 | " return x + y" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "%%time\n", 164 | "# This runs immediately, all it does is build a graph\n", 165 | "\n", 166 | "x = inc(1)\n", 167 | "y = inc(2)\n", 168 | "z = add(x, y)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "This ran immediately, since nothing has really happened yet.\n", 176 | "\n", 177 | "To get the result, call `compute`. Notice that this runs faster than the original code." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "%%time\n", 187 | "# This actually runs our computation using a local thread pool\n", 188 | "\n", 189 | "z.compute()" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## What just happened?\n", 197 | "\n", 198 | "The `z` object is a lazy `Delayed` object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another. We can evaluate the result with `.compute()` as above or we can visualize the task graph for this value with `.visualize()`." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "z" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "# Look at the task graph for `z`\n", 217 | "z.visualize()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "Notice that this includes the names of the functions from before, and the logical flow of the outputs of the `inc` functions to the inputs of `add`." 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "### Some questions to consider:\n", 232 | "\n", 233 | "- Why did we go from 3s to 2s? Why weren't we able to parallelize down to 1s?\n", 234 | "- What would have happened if the inc and add functions didn't include the `sleep(1)`? Would Dask still be able to speed up this code?\n", 235 | "- What if we have multiple outputs or also want to get access to x or y?" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "## Exercise: Parallelize a for loop\n", 243 | "\n", 244 | "`for` loops are one of the most common things that we want to parallelize. Use `dask.delayed` on `inc` and `sum` to parallelize the computation below:" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "data = [1, 2, 3, 4, 5, 6, 7, 8]" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [ 262 | "%%time\n", 263 | "# Sequential code\n", 264 | "\n", 265 | "\n", 266 | "def inc(x):\n", 267 | " sleep(1)\n", 268 | " return x + 1\n", 269 | "\n", 270 | "\n", 271 | "results = []\n", 272 | "for x in data:\n", 273 | " y = inc(x)\n", 274 | " results.append(y)\n", 275 | "\n", 276 | "total = sum(results)" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "total" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "%%time\n", 295 | "# Your parallel code here..." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": { 302 | "jupyter": { 303 | "source_hidden": true 304 | }, 305 | "tags": [] 306 | }, 307 | "outputs": [], 308 | "source": [ 309 | "@dask.delayed\n", 310 | "def inc(x):\n", 311 | " sleep(1)\n", 312 | " return x + 1\n", 313 | "\n", 314 | "\n", 315 | "results = []\n", 316 | "for x in data:\n", 317 | " y = inc(x)\n", 318 | " results.append(y)\n", 319 | "\n", 320 | "total = sum(results)\n", 321 | "print(\"Before computing:\", total) # Let's see what type of thing total is\n", 322 | "result = total.compute()\n", 323 | "print(\"After computing :\", result) # After it's computed" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "How do the graph visualizations compare with the given solution, compared to a version with the `sum` function used directly rather than wrapped with `delayed`? Can you explain the latter version? You might find the result of the following expression illuminating\n", 331 | "```python\n", 332 | "inc(1) + inc(2)\n", 333 | "```" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "## Exercise: Parallelize a for-loop code with control flow\n", 341 | "\n", 342 | "Often we want to delay only *some* functions, running a few of them immediately. This is especially helpful when those functions are fast and help us to determine what other slower functions we should call. This decision, to delay or not to delay, is usually where we need to be thoughtful when using `dask.delayed`.\n", 343 | "\n", 344 | "In the example below we iterate through a list of inputs. If that input is even then we want to call `inc`. If the input is odd then we want to call `double`. This `is_even` decision to call `inc` or `double` has to be made immediately (not lazily) in order for our graph-building Python code to proceed." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "def double(x):\n", 354 | " sleep(1)\n", 355 | " return 2 * x\n", 356 | "\n", 357 | "\n", 358 | "def is_even(x):\n", 359 | " return not x % 2\n", 360 | "\n", 361 | "\n", 362 | "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "%%time\n", 372 | "# Sequential code\n", 373 | "\n", 374 | "results = []\n", 375 | "for x in data:\n", 376 | " if is_even(x):\n", 377 | " y = double(x)\n", 378 | " else:\n", 379 | " y = inc(x)\n", 380 | " results.append(y)\n", 381 | "\n", 382 | "total = sum(results)\n", 383 | "print(total)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "%%time\n", 393 | "# Your parallel code here...\n", 394 | "# TODO: parallelize the sequential code above using dask.delayed\n", 395 | "# You will need to delay some functions, but not all" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": { 402 | "jupyter": { 403 | "source_hidden": true 404 | }, 405 | "tags": [] 406 | }, 407 | "outputs": [], 408 | "source": [ 409 | "@dask.delayed\n", 410 | "def double(x):\n", 411 | " sleep(1)\n", 412 | " return 2 * x\n", 413 | "\n", 414 | "\n", 415 | "results = []\n", 416 | "for x in data:\n", 417 | " if is_even(x): # even\n", 418 | " y = double(x)\n", 419 | " else: # odd\n", 420 | " y = inc(x)\n", 421 | " results.append(y)\n", 422 | "\n", 423 | "total = sum(results)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "%time total.compute()" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "metadata": {}, 439 | "outputs": [], 440 | "source": [ 441 | "total.visualize()" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "### Some questions to consider:\n", 449 | "\n", 450 | "- What are other examples of control flow where we can't use delayed?\n", 451 | "- What would have happened if we had delayed the evaluation of `is_even(x)` in the example above?\n", 452 | "- What are your thoughts on delaying `sum`? This function is both computational but also fast to run." 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "## Exercise: Parallelize a Pandas Groupby Reduction\n", 460 | "\n", 461 | "In this exercise we read several CSV files and perform a groupby operation in parallel. We are given sequential code to do this and parallelize it with `dask.delayed`.\n", 462 | "\n", 463 | "The computation we will parallelize is to compute the mean departure delay per airport from some historical flight data. We will do this by using `dask.delayed` together with `pandas`. In a future section we will do this same exercise with `dask.dataframe`." 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "## Create data\n", 471 | "\n", 472 | "Run this code to prep some data.\n", 473 | "\n", 474 | "This downloads and extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is originally from [here](http://stat-computing.org/dataexpo/2009/the-data.html)." 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "%run prep.py -d flights" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "### Inspect data" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "import os\n", 500 | "\n", 501 | "sorted(os.listdir(os.path.join(\"data\", \"nycflights\")))" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": {}, 507 | "source": [ 508 | "### Read one file with `pandas.read_csv` and compute mean departure delay" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": null, 514 | "metadata": {}, 515 | "outputs": [], 516 | "source": [ 517 | "import pandas as pd\n", 518 | "\n", 519 | "df = pd.read_csv(os.path.join(\"data\", \"nycflights\", \"1990.csv\"))\n", 520 | "df.head()" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [ 529 | "# What is the schema?\n", 530 | "df.dtypes" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "# What originating airports are in the data?\n", 540 | "df.Origin.unique()" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": {}, 547 | "outputs": [], 548 | "source": [ 549 | "# Mean departure delay per-airport for one year\n", 550 | "df.groupby(\"Origin\").DepDelay.mean()" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "### Sequential code: Mean Departure Delay Per Airport\n", 558 | "\n", 559 | "The above cell computes the mean departure delay per-airport for one year. Here we expand that to all years using a sequential for loop." 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [ 568 | "from glob import glob\n", 569 | "\n", 570 | "filenames = sorted(glob(os.path.join(\"data\", \"nycflights\", \"*.csv\")))" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [ 579 | "%%time\n", 580 | "\n", 581 | "sums = []\n", 582 | "counts = []\n", 583 | "for fn in filenames:\n", 584 | " # Read in file\n", 585 | " df = pd.read_csv(fn)\n", 586 | "\n", 587 | " # Groupby origin airport\n", 588 | " by_origin = df.groupby(\"Origin\")\n", 589 | "\n", 590 | " # Sum of all departure delays by origin\n", 591 | " total = by_origin.DepDelay.sum()\n", 592 | "\n", 593 | " # Number of flights by origin\n", 594 | " count = by_origin.DepDelay.count()\n", 595 | "\n", 596 | " # Save the intermediates\n", 597 | " sums.append(total)\n", 598 | " counts.append(count)\n", 599 | "\n", 600 | "# Combine intermediates to get total mean-delay-per-origin\n", 601 | "total_delays = sum(sums)\n", 602 | "n_flights = sum(counts)\n", 603 | "mean = total_delays / n_flights" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": {}, 610 | "outputs": [], 611 | "source": [ 612 | "mean" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "### Parallelize the code above\n", 620 | "\n", 621 | "Use `dask.delayed` to parallelize the code above. Some extra things you will need to know.\n", 622 | "\n", 623 | "1. Methods and attribute access on delayed objects work automatically, so if you have a delayed object you can perform normal arithmetic, slicing, and method calls on it and it will produce the correct delayed calls.\n", 624 | " \n", 625 | "2. Calling the `.compute()` method works well when you have a single output. When you have multiple outputs you might want to use the `dask.compute` function. This way Dask can share the intermediate values.\n", 626 | " \n", 627 | "So your goal is to parallelize the code above (which has been copied below) using `dask.delayed`. You may also want to visualize a bit of the computation to see if you're doing it correctly." 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "%%time\n", 637 | "# your code here" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "If you load the solution, add `%%time` to the top of the cell to measure the running time." 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": { 651 | "jupyter": { 652 | "source_hidden": true 653 | }, 654 | "tags": [] 655 | }, 656 | "outputs": [], 657 | "source": [ 658 | "%%time\n", 659 | "\n", 660 | "# This is just one possible solution, there are\n", 661 | "# several ways to do this using `dask.delayed`\n", 662 | "\n", 663 | "\n", 664 | "@dask.delayed\n", 665 | "def read_file(filename):\n", 666 | " # Read in file\n", 667 | " return pd.read_csv(filename)\n", 668 | "\n", 669 | "\n", 670 | "sums = []\n", 671 | "counts = []\n", 672 | "for fn in filenames:\n", 673 | " # Delayed read in file\n", 674 | " df = read_file(fn)\n", 675 | "\n", 676 | " # Groupby origin airport\n", 677 | " by_origin = df.groupby(\"Origin\")\n", 678 | "\n", 679 | " # Sum of all departure delays by origin\n", 680 | " total = by_origin.DepDelay.sum()\n", 681 | "\n", 682 | " # Number of flights by origin\n", 683 | " count = by_origin.DepDelay.count()\n", 684 | "\n", 685 | " # Save the intermediates\n", 686 | " sums.append(total)\n", 687 | " counts.append(count)\n", 688 | "\n", 689 | "# Combine intermediates to get total mean-delay-per-origin\n", 690 | "total_delays = sum(sums)\n", 691 | "n_flights = sum(counts)\n", 692 | "mean, *_ = dask.compute(total_delays / n_flights)" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": {}, 699 | "outputs": [], 700 | "source": [ 701 | "(sum(sums)).visualize()" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "metadata": {}, 708 | "outputs": [], 709 | "source": [ 710 | "# ensure the results still match\n", 711 | "mean" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": {}, 717 | "source": [ 718 | "### Some questions to consider:\n", 719 | "\n", 720 | "- How much speedup did you get? Is this how much speedup you'd expect?\n", 721 | "- Experiment with where to call `compute`. What happens when you call it on `sums` and `counts`? What happens if you wait and call it on `mean`?\n", 722 | "- Experiment with delaying the call to `sum`. What does the graph look like if `sum` is delayed? What does the graph look like if it isn't?\n", 723 | "- Can you think of any reason why you'd want to do the reduction one way over the other?\n", 724 | "\n", 725 | "### Learn More\n", 726 | "\n", 727 | "Visit the [Delayed documentation](https://docs.dask.org/en/latest/delayed.html). In particular, this [delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU) will reinforce the concepts you learned here and the [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html) document collects advice on using `dask.delayed` well." 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": {}, 733 | "source": [ 734 | "## Close the Client\n", 735 | "\n", 736 | "Before moving on to the next exercise, make sure to close your client or stop this kernel." 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": null, 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [ 745 | "client.close()" 746 | ] 747 | } 748 | ], 749 | "metadata": { 750 | "kernelspec": { 751 | "display_name": "Python 3 (ipykernel)", 752 | "language": "python", 753 | "name": "python3" 754 | }, 755 | "language_info": { 756 | "codemirror_mode": { 757 | "name": "ipython", 758 | "version": 3 759 | }, 760 | "file_extension": ".py", 761 | "mimetype": "text/x-python", 762 | "name": "python", 763 | "nbconvert_exporter": "python", 764 | "pygments_lexer": "ipython3", 765 | "version": "3.10.5" 766 | } 767 | }, 768 | "nbformat": 4, 769 | "nbformat_minor": 4 770 | } 771 | -------------------------------------------------------------------------------- /02_array.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\"Dask\n", 11 | "\n", 12 | "# Dask Arrays - parallelized numpy\n", 13 | "Parallel, larger-than-memory, n-dimensional array using blocked algorithms. \n", 14 | "\n", 15 | "* **Parallel**: Uses all of the cores on your computer\n", 16 | "* **Larger-than-memory**: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.\n", 17 | "* **Blocked Algorithms**: Perform large computations by performing many smaller computations.\n" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "\n", 25 | "\n", 26 | "\n", 27 | "In other words, Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.\n", 28 | "\n", 29 | "In this notebook, we'll build some understanding by implementing some blocked algorithms from scratch.\n", 30 | "We'll then use Dask Array to analyze large datasets, in parallel, using a familiar NumPy-like API.\n", 31 | "\n", 32 | "**Related Documentation**\n", 33 | "\n", 34 | "* [Array documentation](https://docs.dask.org/en/latest/array.html)\n", 35 | "* [Array screencast](https://youtu.be/9h_61hXCDuI)\n", 36 | "* [Array API](https://docs.dask.org/en/latest/array-api.html)\n", 37 | "* [Array examples](https://examples.dask.org/array.html)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## Create datasets" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "Create the datasets you will be using in this notebook:" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "%run prep.py -d random" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## Start the Client" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "from dask.distributed import Client\n", 77 | "\n", 78 | "client = Client(n_workers=4)\n", 79 | "client" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## Blocked Algorithms in a nutshell\n", 87 | "\n", 88 | "Let's do side by side the sum of the elements of an array using a NumPy array and a Dask array. " 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "import numpy as np\n", 98 | "import dask.array as da" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "# NumPy array\n", 108 | "a_np = np.ones(10)\n", 109 | "a_np" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "We know that we can use `sum()` to compute the sum of the elements of our array, but to show what a blocksized operation would look like, let's do:" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "a_np_sum = a_np[:5].sum() + a_np[5:].sum()\n", 126 | "a_np_sum" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Now notice that each sum in the computation above is completely independent so they could be done in parallel. \n", 134 | "To do this with Dask array, we need to define our \"slices\", we do this by defining the amount of elements we want per block using the variable `chunks`. " 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "a_da = da.ones(10, chunks=5)\n", 144 | "a_da" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "**Important!**\n", 152 | "\n", 153 | "Note here that to get two blocks, we specify `chunks=5`, in other words, we have 5 elements per block. " 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "a_da_sum = a_da.sum()\n", 163 | "a_da_sum" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## Task Graphs\n", 171 | "\n", 172 | "In general, the code that humans write rely on compilers or interpreters so the computers can understand what we wrote. When we move to parallel execution there is a desire to shift responsibility from the compilers to the human, as they often bring the analysis, optimization, and execution of code into the code itself. In these cases, we often represent the structure of our program explicitly as data within the program itself.\n", 173 | "\n", 174 | "In Dask we use task scheduling, where we break our program into into many medium-sized tasks or units of computation.We represent these tasks as nodes in a graph with edges between nodes if one task depends on data produced by another. We call upon a task scheduler to execute this graph in a way that respects these data dependencies and leverages parallelism where possible, so multiple independent tasks can be run simultaneously." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "# visualize the low level Dask graph using cytoscape\n", 184 | "a_da_sum.visualize(engine=\"cytoscape\")" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "a_da_sum.compute()" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "Performance comparison\n", 201 | "------------------------------\n", 202 | "\n", 203 | "Let's try a more interesting example. We will create a 20_000 x 20_000 array with normally distributed values, and take the mean along one of its axis.\n", 204 | "\n", 205 | "**Note:**\n", 206 | "\n", 207 | "If you are running on Binder, the Numpy example might need to be a smaller one due to memory issues. " 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "### Numpy version " 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "%%time\n", 224 | "xn = np.random.normal(10, 0.1, size=(30_000, 30_000))\n", 225 | "yn = xn.mean(axis=0)\n", 226 | "yn" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "### Dask array version" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "xd = da.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))\n", 243 | "xd" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "xd.nbytes / 1e9 # Gigabytes of the input processed lazily" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "yd = xd.mean(axis=0)\n", 262 | "yd" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "%%time\n", 272 | "xd = da.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))\n", 273 | "yd = xd.mean(axis=0)\n", 274 | "yd.compute()" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "**Questions to think about:**\n", 282 | "\n", 283 | "* What happens if the Dask chunks=(10000,10000)?\n", 284 | "* What happens if the Dask chunks=(30,30)?" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "**Exercise:** \n", 292 | "\n", 293 | "For Dask arrays, compute the mean along `axis=1` of the sum of the x array and its transpose. " 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "# Your code here" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "**Solution**" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": { 316 | "jupyter": { 317 | "source_hidden": true 318 | }, 319 | "tags": [] 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "x_sum = xd + xd.T\n", 324 | "res = x_sum.mean(axis=1)\n", 325 | "res.compute()" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "## Choosing good chunk sizes\n", 333 | "This section was inspired on a Dask blog by Genevieve Buckley you can read it [here](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)\n", 334 | "\n", 335 | "A common problem when getting started with Dask array is determine what is a good chunk size. But what is a good size, and how do we determine this? \n", 336 | "\n", 337 | "\n", 338 | "### Get to know the chunks \n", 339 | "\n", 340 | "We can think of Dask arrays as a big structure composed by chunks of a smaller size, where these chunks are typically an a single `numpy` array, and they are all arranged to form a larger Dask array. \n", 341 | "\n", 342 | "If you have a Dask array and want to know more information about chunks and their size, you can use the `chunksize` and `chunks` attributes to access this information. If you are in a jupyter notebook\n", 343 | "you can also visualize the Dask array via its HTML representation. " 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "darr = da.random.random((1000, 1000, 1000))\n", 353 | "darr" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "Notice that when we created the Dask array, we did not specify the `chunks`. Dask has set by default `chunks='auto'` which accommodates ideal chunk sizes. To learn more on how auto-chunking works you can go to this documentation https://docs.dask.org/en/stable/array-chunks.html#automatic-chunking\n", 361 | "\n", 362 | "`darr.chunksize` shows the largest chunk size. If you expect your array to have uniform chunk sizes this is a a good summary of the chunk size information. But if your array have irregular chunks, `darr.chunks` will show you the explicit sizes of all the chunks along all the dimensions of your dask array." 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "darr.chunksize" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "darr.chunks" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "Let's modify our example to see explore chunking a bit more. We can rechunk our array:" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "darr = darr.rechunk({0: -1, 1: 100, 2: \"auto\"})" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "darr" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "darr.chunksize" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "darr.chunks" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "**Exercise:**\n", 431 | "\n", 432 | "- What does -1 do when specify as the chunk on a certain axis?" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "### Too small is a problem\n", 440 | "\n", 441 | "If your chunks are too small, the amount of actual work done by every task is very tiny, and the overhead of coordinating all these tasks results in a very inefficient process. \n", 442 | "\n", 443 | "In general, the dask scheduler takes approximately one millisecond to coordinate a single task. That means we want the computation time to be comparatively large, i.e in the order of seconds. \n", 444 | "\n", 445 | "Intuitive analogy by Genevieve Buckley:\n", 446 | "\n", 447 | "> Lets imagine we are building a house. It is a pretty big job, and if there were only one worker it would take much too long to build. So we have a team of workers and a site foreman. The site foreman is equivalent to the Dask scheduler: their job is to tell the workers what tasks they need to do. \n", 448 | "Say we have a big pile of bricks to build a wall, sitting in the corner of the building site. If the foreman (the Dask scheduler) tells workers to go and fetch a single brick at a time, then bring each one to where the wall is being built, you can see how this is going to be very slow and inefficient! The workers are spending most of their time moving between the wall and the pile of bricks. Much less time is going towards doing the actual work of mortaring bricks onto the wall. \n", 449 | "Instead, we can do this in a smarter way. The foreman (Dask scheduler) can tell the workers to go and bring one full wheelbarrow load of bricks back each time. Now workers are spending much less time moving between the wall and the pile of bricks, and the wall will be finished much quicker. \n", 450 | "\n", 451 | "### Too big is a problem\n", 452 | "\n", 453 | "If your chunks are too big, this is also a problem because you will likely run out of memory. You will start seeing in the dashboard that data is being spill to disk and this will lead to performance decrements. \n", 454 | "\n", 455 | "If we load to much data into memory, Dask workers will start to spill data to disk to avoid crashing. Spilling data to disk will slow things down significantly, because of all the extra read and write operations to disk. This is definitely a situation that we want to avoid, to watch out for this you can look at the worker memory plot on the dashboard. Orange bars are a warning you are close to the limit, and gray means data is being spilled to disk. \n", 456 | "\n", 457 | "To watch out for this, look at the worker memory plot on the Dask dashboard. Orange bars are a warning you are close to the limit, and gray means data is being spilled to disk - not good! For more tips, see the section on using the Dask dashboard below. To learn more about the memory plot, check the [dashboard documentation](https://docs.dask.org/en/stable/dashboard.html#bytes-stored-and-bytes-per-worker).\n", 458 | "\n", 459 | "\n", 460 | "### Rules of thumb\n", 461 | "\n", 462 | "- Users have reported that chunk sizes smaller than 1MB tend to be bad. In general, a chunk size between **100MB and 1GB is good**, while going over 1 or 2GB means you have a really big dataset and/or a lot of memory available per worker.\n", 463 | "- Upper bound: Avoid very large task graphs. More than 10,000 or 100,000 chunks may start to perform poorly.\n", 464 | "- Lower bound: To get the advantage of parallelization, you need the number of chunks to at least equal the number of worker cores available (or better, the number of worker cores times 2). Otherwise, some workers will stay idle.\n", 465 | "- The time taken to compute each task should be much larger than the time needed to schedule the task. The Dask scheduler takes roughly 1 millisecond to coordinate a single task, so a good task computation time would be in the order of seconds (not milliseconds).\n", 466 | "- Chunks should be aligned with array storage on disk. Modern NDArray storage formats (HDF5, NetCDF, TIFF, Zarr) allow arrays to be stored in chunks so that the blocks of data can be pulled efficiently. However, data stores often chunk more finely than is ideal for Dask array, so it is common to choose a chunking that is a multiple of your storage chunk size, otherwise you might incur high overhead. For example, if you are loading data that is chunked in blocks of (100, 100), the you might might choose a chunking strategy more like (1000, 2000) that is larger but still divisible by (100, 100). \n", 467 | "\n", 468 | "For more more advice on chunking see https://docs.dask.org/en/stable/array-chunks.html" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "## Example of chunked data with Zarr\n", 476 | "\n", 477 | "Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays (Dask array behave like Numpy arrays) but whose data is divided into chunks and each chunk is compressed. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility.\n", 478 | "\n", 479 | "For extra material check the [Zarr tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html)\n", 480 | "\n", 481 | "**Let's read an array from zarr:**" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "import zarr" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "a = da.from_zarr(\"data/random.zarr\")" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "a" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "Notice that the array is already chunked, and we didn't specify anything when loading it. Now notice that the chunks have a nice chunk size, let's compute the mean and see how long it takes to run" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "%%time\n", 525 | "a.mean().compute()" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "Let's load a separate example where the `chunksize` is much smaller, and see what happen" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "metadata": {}, 539 | "outputs": [], 540 | "source": [ 541 | "b = da.from_zarr(\"data/random_sc.zarr\")\n", 542 | "b" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "metadata": {}, 549 | "outputs": [], 550 | "source": [ 551 | "%%time\n", 552 | "b.mean().compute()" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "### Exercise:\n", 560 | "\n", 561 | "Provide a `chunksize` when reading `b` that will improve the time of computation of the mean. Try multiple `chunks` values and see what happens." 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "# Your code here" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": { 577 | "jupyter": { 578 | "source_hidden": true 579 | }, 580 | "tags": [] 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "# 1 possible Solution (imitate original). chunks will vary if you are in binder\n", 585 | "c = da.from_zarr(\"data/random_sc.zarr\", chunks=(6250000,))\n", 586 | "c" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "metadata": {}, 593 | "outputs": [], 594 | "source": [ 595 | "%%time\n", 596 | "c.mean().compute()" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": {}, 602 | "source": [ 603 | "## Xarray \n", 604 | "\n", 605 | "In some applications we have multidimensional data, and sometimes working with all this dimensions can be confusing. Xarray is an open source project and Python package that makes working with labeled multi-dimensional arrays easier. \n", 606 | "\n", 607 | "Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labeled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with Dask for parallel computing.\n", 608 | "\n", 609 | "Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. \n", 610 | "\n", 611 | "Let's learn how to use xarray and Dask together:\n" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "import xarray as xr" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": {}, 627 | "outputs": [], 628 | "source": [ 629 | "ds = xr.tutorial.open_dataset(\n", 630 | " \"air_temperature\",\n", 631 | " chunks={ # this tells xarray to open the dataset as a dask array\n", 632 | " \"lat\": 25,\n", 633 | " \"lon\": 25,\n", 634 | " \"time\": -1,\n", 635 | " },\n", 636 | ")\n", 637 | "ds" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": {}, 644 | "outputs": [], 645 | "source": [ 646 | "ds.air" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": null, 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [ 655 | "ds.air.chunks" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "mean = ds.air.mean(\"time\") # no activity on dashboard\n", 665 | "mean # contains a dask array" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "metadata": {}, 672 | "outputs": [], 673 | "source": [ 674 | "# we will see dashboard activity\n", 675 | "mean.load()" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "### Standard Xarray Operations\n", 683 | "\n", 684 | "Let's grab the air variable and do some operations. Operations using xarray objects are identical, regardless if the underlying data is stored as a Dask array or a NumPy array." 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "dair = ds.air" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": null, 699 | "metadata": {}, 700 | "outputs": [], 701 | "source": [ 702 | "dair2 = dair.groupby(\"time.month\").mean(\"time\")\n", 703 | "dair_new = dair - dair2\n", 704 | "dair_new" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "Call `.compute()` or `.load()` when you want your result as a `xarray.DataArray` with data stored as NumPy arrays." 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "# things happen in the dashboard\n", 721 | "dair_new.load()" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "### Time Series Operations with xarray\n", 729 | "\n", 730 | "Because we have a datetime index time-series operations work efficiently, for example we can do a resample and then plot the result." 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "metadata": {}, 737 | "outputs": [], 738 | "source": [ 739 | "dair_resample = dair.resample(time=\"1w\").mean(\"time\").std(\"time\")" 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "metadata": {}, 746 | "outputs": [], 747 | "source": [ 748 | "dair_resample.load().plot(figsize=(12, 8))" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "### Learn More \n", 756 | "\n", 757 | "Both xarray and zarr have their own tutorials that go into greater depth:\n", 758 | "\n", 759 | "* [Zarr tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html)\n", 760 | "* [Xarray tutorial](https://tutorial.xarray.dev/intro.html)" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "## Close your cluster\n", 768 | "\n", 769 | "It's good practice to close any Dask cluster you create:" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": null, 775 | "metadata": {}, 776 | "outputs": [], 777 | "source": [ 778 | "client.shutdown()" 779 | ] 780 | } 781 | ], 782 | "metadata": { 783 | "anaconda-cloud": {}, 784 | "kernelspec": { 785 | "display_name": "Python 3 (ipykernel)", 786 | "language": "python", 787 | "name": "python3" 788 | }, 789 | "language_info": { 790 | "codemirror_mode": { 791 | "name": "ipython", 792 | "version": 3 793 | }, 794 | "file_extension": ".py", 795 | "mimetype": "text/x-python", 796 | "name": "python", 797 | "nbconvert_exporter": "python", 798 | "pygments_lexer": "ipython3", 799 | "version": "3.10.5" 800 | } 801 | }, 802 | "nbformat": 4, 803 | "nbformat_minor": 4 804 | } 805 | -------------------------------------------------------------------------------- /01_dataframe.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "tags": [] 7 | }, 8 | "source": [ 9 | "\"Dask\n", 13 | "\n", 14 | "\n", 15 | "# Dask DataFrame - parallelized pandas\n", 16 | "\n", 17 | "Looks and feels like the pandas API, but for parallel and distributed workflows. \n", 18 | "\n", 19 | "At its core, the `dask.dataframe` module implements a \"blocked parallel\" `DataFrame` object that looks and feels like the pandas API, but for parallel and distributed workflows. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": { 25 | "tags": [] 26 | }, 27 | "source": [ 28 | "\"Dask\n", 32 | "\n", 33 | "**Related Documentation**\n", 34 | "\n", 35 | "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n", 36 | "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n", 37 | "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n", 38 | "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n", 39 | "* [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)\n", 40 | "\n", 41 | "## When to use `dask.dataframe`\n", 42 | "\n", 43 | "pandas is great for tabular datasets that fit in memory. A general rule of thumb for pandas is:\n", 44 | "\n", 45 | "> \"Have 5 to 10 times as much RAM as the size of your dataset\"\n", 46 | ">\n", 47 | "> ~ Wes McKinney (2017) in [10 things I hate about pandas](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)\n", 48 | "\n", 49 | "Here \"size of dataset\" means dataset size on _the disk_.\n", 50 | "\n", 51 | "Dask becomes useful when the datasets exceed the above rule.\n", 52 | "\n", 53 | "In this notebook, you will be working with the New York City Airline data. This dataset is only ~200MB, so that you can download it in a reasonable time, but `dask.dataframe` will scale to datasets **much** larger than memory.\n", 54 | "\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "## Create datasets" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Create the datasets you will be using in this notebook:" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "%run prep.py -d flights" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Set up your local cluster" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "Create a local Dask cluster and connect it to the client. Don't worry about this bit of code for now, you will learn more in the Distributed notebook." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "from dask.distributed import Client\n", 101 | "\n", 102 | "client = Client(n_workers=4)\n", 103 | "client" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### Dask Diagnostic Dashboard\n", 111 | "\n", 112 | "Dask Distributed provides a useful Dashboard to visualize the state of your cluster and computations.\n", 113 | "\n", 114 | "If you're on **JupyterLab or Binder**, you can use the [Dask JupyterLab extension](https://github.com/dask/dask-labextension) (which should be already installed in your environment) to open the dashboard plots:\n", 115 | "* Click on the Dask logo in the left sidebar\n", 116 | "* Click on the magnifying glass icon, which will automatically connect to the active dashboard (if that doesn't work, you can type/paste the dashboard link http://127.0.0.1:8787 in the field)\n", 117 | "* Click on **\"Task Stream\"**, **\"Progress Bar\"**, and **\"Worker Memory\"**, which will open these plots in new tabs\n", 118 | "* Re-organize the tabs to suit your workflow!\n", 119 | "\n", 120 | "Alternatively, click on the dashboard link displayed in the Client details above: http://127.0.0.1:8787/status. It will open a new browser tab with the Dashboard." 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": { 126 | "tags": [] 127 | }, 128 | "source": [ 129 | "## Reading and working with datasets\n", 130 | "\n", 131 | "Let's read an extract of flights in the USA across several years. This data is specific to flights out of the three airports in the New York City area." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "import os\n", 141 | "import dask" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "By convention, we import the module `dask.dataframe` as `dd`, and call the corresponding `DataFrame` object `ddf`.\n", 149 | "\n", 150 | "**Note**: The term \"Dask DataFrame\" is slightly overloaded. Depending on the context, it can refer to the module or the DataFrame object. To avoid confusion, throughout this notebook:\n", 151 | "- `dask.dataframe` (note the all lowercase) refers to the API, and\n", 152 | "- `DataFrame` (note the CamelCase) refers to the object.\n", 153 | "\n", 154 | "The following filename includes a glob pattern `*`, so all files in the path matching that pattern will be read into the same `DataFrame`." 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "import dask.dataframe as dd\n", 164 | "\n", 165 | "ddf = dd.read_csv(\n", 166 | " os.path.join(\"data\", \"nycflights\", \"*.csv\"), parse_dates={\"Date\": [0, 1, 2]}\n", 167 | ")\n", 168 | "ddf" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "Dask has not loaded the data yet, it has:\n", 176 | "- investigated the input path and found that there are ten matching files\n", 177 | "- intelligently created a set of jobs for each chunk -- one per original CSV file in this case" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "Notice that the representation of the `DataFrame` object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes." 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### Lazy Evaluation\n", 192 | "\n", 193 | "Most Dask Collections, including Dask `DataFrame` are evaluated lazily, which means Dask constructs the logic (called task graph) of your computation immediately but \"evaluates\" them only when necessary. You can view this task graph using `.visualize()`.\n", 194 | "\n", 195 | "You will learn more about this in the Delayed notebook, but for now, note that we need to call `.compute()` to trigger actual computations." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "ddf.visualize()" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Some functions like `len` and `head` also trigger a computation. Specifically, calling `len` will:\n", 212 | "- load actual data, (that is, load each file into a pandas DataFrame)\n", 213 | "- then apply the corresponding functions to each pandas DataFrame (also known as a partition)\n", 214 | "- combine the subtotals to give you the final grand total" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "# load and count number of rows\n", 224 | "len(ddf)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "You can view the start and end of the data as you would in pandas:" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "ddf.head()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": { 246 | "tags": [ 247 | "raises-exception" 248 | ] 249 | }, 250 | "source": [ 251 | "```python\n", 252 | "ddf.tail()\n", 253 | "\n", 254 | "# ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\n", 255 | "\n", 256 | "# +----------------+---------+----------+\n", 257 | "# | Column | Found | Expected |\n", 258 | "# +----------------+---------+----------+\n", 259 | "# | CRSElapsedTime | float64 | int64 |\n", 260 | "# | TailNum | object | float64 |\n", 261 | "# +----------------+---------+----------+\n", 262 | "\n", 263 | "# The following columns also raised exceptions on conversion:\n", 264 | "\n", 265 | "# - TailNum\n", 266 | "# ValueError(\"could not convert string to float: 'N54711'\")\n", 267 | "\n", 268 | "# Usually this is due to dask's dtype inference failing, and\n", 269 | "# *may* be fixed by specifying dtypes manually by adding:\n", 270 | "\n", 271 | "# dtype={'CRSElapsedTime': 'float64',\n", 272 | "# 'TailNum': 'object'}\n", 273 | "\n", 274 | "# to the call to `read_csv`/`read_table`.\n", 275 | "\n", 276 | "```" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "Unlike `pandas.read_csv` which reads in the entire file before inferring datatypes, `dask.dataframe.read_csv` only reads in a sample from the beginning of the file (or first file if using a glob). These inferred datatypes are then enforced when reading all partitions.\n", 284 | "\n", 285 | "In this case, the datatypes inferred in the sample are incorrect. The first `n` rows have no value for `CRSElapsedTime` (which pandas infers as a `float`), and later on turn out to be strings (`object` dtype). Note that Dask gives an informative error message about the mismatch. When this happens you have a few options:\n", 286 | "\n", 287 | "- Specify dtypes directly using the `dtype` keyword. This is the recommended solution, as it's the least error prone (better to be explicit than implicit) and also the most performant.\n", 288 | "- Increase the size of the `sample` keyword (in bytes)\n", 289 | "- Use `assume_missing` to make `dask` assume that columns inferred to be `int` (which don't allow missing values) are actually `floats` (which do allow missing values). In our particular case this doesn't apply.\n", 290 | "\n", 291 | "In our case we'll use the first option and directly specify the `dtypes` of the offending columns. " 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "ddf = dd.read_csv(\n", 301 | " os.path.join(\"data\", \"nycflights\", \"*.csv\"),\n", 302 | " parse_dates={\"Date\": [0, 1, 2]},\n", 303 | " dtype={\"TailNum\": str, \"CRSElapsedTime\": float, \"Cancelled\": bool},\n", 304 | ")" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "ddf.tail() # now works" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "### Reading from remote storage\n", 321 | "\n", 322 | "If you're thinking about distributed computing, your data is probably stored remotely on services (like Amazon's S3 or Google's cloud storage) and is in a friendlier format (like Parquet). Dask can read data in various formats directly from these remote locations **lazily** and **in parallel**.\n", 323 | "\n", 324 | "Here's how you can read the NYC taxi cab data from Amazon S3:\n", 325 | "\n", 326 | "```python\n", 327 | "ddf = dd.read_parquet(\n", 328 | " \"s3://nyc-tlc/trip data/yellow_tripdata_2012-*.parquet\",\n", 329 | ")\n", 330 | "```\n", 331 | "\n", 332 | "You can also leverage Parquet-specific optimizations like column selection and metadata handling, learn more in [the Dask documentation on working with Parquet files](https://docs.dask.org/en/stable/dataframe-parquet.html)." 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": { 338 | "tags": [] 339 | }, 340 | "source": [ 341 | "## Computations with `dask.dataframe`\n", 342 | "\n", 343 | "Let's compute the maximum of the flight delay.\n", 344 | "\n", 345 | "With just pandas, we would loop over each file to find the individual maximums, then find the final maximum over all the individual maximums.\n", 346 | "\n", 347 | "```python\n", 348 | "import pandas as pd\n", 349 | "\n", 350 | "files = os.listdir(os.path.join('data', 'nycflights'))\n", 351 | "\n", 352 | "maxes = []\n", 353 | "\n", 354 | "for file in files:\n", 355 | " df = pd.read_csv(os.path.join('data', 'nycflights', file))\n", 356 | " maxes.append(df.DepDelay.max())\n", 357 | " \n", 358 | "final_max = max(maxes)\n", 359 | "```\n", 360 | "\n", 361 | "`dask.dataframe` lets us write pandas-like code, that operates on larger-than-memory datasets in parallel." 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "%%time\n", 371 | "result = ddf.DepDelay.max()\n", 372 | "result.compute()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "This creates the lazy computation for us and then runs it. " 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "**Note:** Dask will delete intermediate results (like the full pandas DataFrame for each file) as soon as possible. This means you can handle datasets that are larger than memory but, repeated computations will have to load all of the data in each time. (Run the code above again, is it faster or slower than you would expect?)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "You can view the underlying task graph using `.visualize()`:" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# notice the parallelism\n", 403 | "result.visualize()" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "## Exercises\n", 411 | "\n", 412 | "In this section you will do a few `dask.dataframe` computations. If you are comfortable with pandas then these should be familiar. You will have to think about when to call `.compute()`." 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "### 1. How many rows are in our dataset?\n", 420 | "\n", 421 | "_Hint_: how would you check how many items are in a list?" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": null, 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "# Your code here" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": { 437 | "jupyter": { 438 | "source_hidden": true 439 | }, 440 | "tags": [] 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "len(ddf)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "### 2. In total, how many non-canceled flights were taken?\n", 452 | "\n", 453 | "_Hint_: use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "# Your code here" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": { 469 | "jupyter": { 470 | "source_hidden": true 471 | }, 472 | "tags": [] 473 | }, 474 | "outputs": [], 475 | "source": [ 476 | "len(ddf[~ddf.Cancelled])" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "### 3. In total, how many non-canceled flights were taken from each airport?\n", 484 | "\n", 485 | "*Hint*: use [groupby](https://pandas.pydata.org/pandas-docs/stable/groupby.html)." 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [ 494 | "# Your code here" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": null, 500 | "metadata": { 501 | "jupyter": { 502 | "source_hidden": true 503 | }, 504 | "tags": [] 505 | }, 506 | "outputs": [], 507 | "source": [ 508 | "ddf[~ddf.Cancelled].groupby(\"Origin\").Origin.count().compute()" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "### 4. What was the average departure delay from each airport?" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "# Your code here" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "metadata": { 531 | "jupyter": { 532 | "source_hidden": true 533 | }, 534 | "tags": [] 535 | }, 536 | "outputs": [], 537 | "source": [ 538 | "ddf.groupby(\"Origin\").DepDelay.mean().compute()" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "### 5. What day of the week has the worst average departure delay?" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": null, 551 | "metadata": {}, 552 | "outputs": [], 553 | "source": [ 554 | "# Your code here" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": { 561 | "jupyter": { 562 | "source_hidden": true 563 | }, 564 | "tags": [] 565 | }, 566 | "outputs": [], 567 | "source": [ 568 | "ddf.groupby(\"DayOfWeek\").DepDelay.mean().idxmax().compute()" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "### 6. Let's say the distance column is erroneous and you need to add 1 to all values, how would you do this?" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "# Your code here" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": null, 590 | "metadata": { 591 | "jupyter": { 592 | "source_hidden": true 593 | }, 594 | "tags": [] 595 | }, 596 | "outputs": [], 597 | "source": [ 598 | "ddf[\"Distance\"].apply(\n", 599 | " lambda x: x + 1\n", 600 | ").compute() # don't worry about the warning, we'll discuss in the next sections\n", 601 | "\n", 602 | "# OR\n", 603 | "\n", 604 | "(ddf[\"Distance\"] + 1).compute()" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "## Sharing Intermediate Results\n", 612 | "\n", 613 | "When computing all of the above, we sometimes did the same operation more than once. For most operations, `dask.dataframe` stores the arguments, allowing duplicate computations to be shared and only computed once.\n", 614 | "\n", 615 | "For example, let's compute the mean and standard deviation for departure delay of all non-canceled flights. Since Dask operations are lazy, those values aren't the final results yet. They're just the steps required to get the result.\n", 616 | "\n", 617 | "If you compute them with two calls to compute, there is no sharing of intermediate computations." 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "non_canceled = ddf[~ddf.Cancelled]\n", 627 | "mean_delay = non_canceled.DepDelay.mean()\n", 628 | "std_delay = non_canceled.DepDelay.std()" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": null, 634 | "metadata": { 635 | "tags": [] 636 | }, 637 | "outputs": [], 638 | "source": [ 639 | "%%time\n", 640 | "\n", 641 | "mean_delay_res = mean_delay.compute()\n", 642 | "std_delay_res = std_delay.compute()" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "### `dask.compute`" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "But let's try by passing both to a single `compute` call." 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": null, 662 | "metadata": { 663 | "tags": [] 664 | }, 665 | "outputs": [], 666 | "source": [ 667 | "%%time\n", 668 | "\n", 669 | "mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "Using `dask.compute` takes roughly 1/2 the time. This is because the task graphs for both results are merged when calling `dask.compute`, allowing shared operations to only be done once instead of twice. In particular, using `dask.compute` only does the following once:\n", 677 | "\n", 678 | "- the calls to `read_csv`\n", 679 | "- the filter (`df[~df.Cancelled]`)\n", 680 | "- some of the necessary reductions (`sum`, `count`)" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (you might want to use `filename='graph.pdf'` to save the graph to disk so that you can zoom in more easily):" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": {}, 694 | "outputs": [], 695 | "source": [ 696 | "dask.visualize(mean_delay, std_delay, engine=\"cytoscape\")" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": { 702 | "tags": [] 703 | }, 704 | "source": [ 705 | "### `.persist()`\n", 706 | "\n", 707 | "While using a distributed scheduler (you will learn more about schedulers in the upcoming notebooks), you can keep some _data that you want to use often_ in the _distributed memory_. \n", 708 | "\n", 709 | "`persist` generates \"Futures\" (more on this later as well) and stores them in the same structure as your output. You can use `persist` with any data or computation that fits in memory." 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": {}, 715 | "source": [ 716 | "If you want to analyze data only for non-canceled flights departing from JFK airport, you can either have two compute calls like in the previous section:" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": null, 722 | "metadata": {}, 723 | "outputs": [], 724 | "source": [ 725 | "non_cancelled = ddf[~ddf.Cancelled]\n", 726 | "ddf_jfk = non_cancelled[non_cancelled.Origin == \"JFK\"]" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": null, 732 | "metadata": {}, 733 | "outputs": [], 734 | "source": [ 735 | "%%time\n", 736 | "ddf_jfk.DepDelay.mean().compute()\n", 737 | "ddf_jfk.DepDelay.sum().compute()" 738 | ] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": {}, 743 | "source": [ 744 | "Or, consider persisting that subset of data in memory.\n", 745 | "\n", 746 | "See the \"Graph\" dashboard plot, the red squares indicate persisted data stored as Futures in memory. You will also notice an increase in Worker Memory (another dashboard plot) consumption." 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": null, 752 | "metadata": {}, 753 | "outputs": [], 754 | "source": [ 755 | "ddf_jfk = ddf_jfk.persist() # returns back control immediately" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": null, 761 | "metadata": {}, 762 | "outputs": [], 763 | "source": [ 764 | "%%time\n", 765 | "ddf_jfk.DepDelay.mean().compute()\n", 766 | "ddf_jfk.DepDelay.std().compute()" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "metadata": {}, 772 | "source": [ 773 | "Analyses on this persisted data is faster because we are not repeating the loading and selecting (non-canceled, JFK departure) operations." 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "## Custom code with Dask DataFrame\n", 781 | "\n", 782 | "`dask.dataframe` only covers a small but well-used portion of the pandas API.\n", 783 | "\n", 784 | "This limitation is for two reasons:\n", 785 | "\n", 786 | "1. The Pandas API is *huge*\n", 787 | "2. Some operations are genuinely hard to do in parallel, e.g, sorting.\n", 788 | "\n", 789 | "Additionally, some important operations like `set_index` work, but are slower than in pandas because they include substantial shuffling of data, and may write out to disk.\n", 790 | "\n", 791 | "**What if you want to use some custom functions that aren't (or can't be) implemented for Dask DataFrame yet?**\n", 792 | "\n", 793 | "You can open an issue on the [Dask issue tracker](https://github.com/dask/dask/issues) to check how feasible the function could be to implement, and you can consider contributing this function to Dask.\n", 794 | "\n", 795 | "In case it's a custom function or tricky to implement, `dask.dataframe` provides a few methods to make applying custom functions to Dask DataFrames easier:\n", 796 | "\n", 797 | "- [`map_partitions`](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.map_partitions.html): to run a function on each partition (each pandas DataFrame) of the Dask DataFrame\n", 798 | "- [`map_overlap`](https://docs.dask.org/en/latest/generated/dask.dataframe.rolling.map_overlap.html): to run a function on each partition (each pandas DataFrame) of the Dask DataFrame, with some rows shared between neighboring partitions\n", 799 | "- [`reduction`](https://docs.dask.org/en/latest/generated/dask.dataframe.Series.reduction.html): for custom row-wise reduction operations." 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "Let's take a quick look at the `map_partitions()` function:" 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "execution_count": null, 812 | "metadata": { 813 | "tags": [] 814 | }, 815 | "outputs": [], 816 | "source": [ 817 | "help(ddf.map_partitions)" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "The \"Distance\" column in `ddf` is currently in miles. Let's say we want to convert the units to kilometers and we have a general helper function as shown below. In this case, we can use `map_partitions` to apply this function across each of the internal pandas `DataFrame`s in parallel. " 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": null, 830 | "metadata": {}, 831 | "outputs": [], 832 | "source": [ 833 | "def my_custom_converter(df, multiplier=1):\n", 834 | " return df * multiplier\n", 835 | "\n", 836 | "\n", 837 | "meta = pd.Series(name=\"Distance\", dtype=\"float64\")\n", 838 | "\n", 839 | "distance_km = ddf.Distance.map_partitions(\n", 840 | " my_custom_converter, multiplier=0.6, meta=meta\n", 841 | ")" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": {}, 848 | "outputs": [], 849 | "source": [ 850 | "distance_km.visualize()" 851 | ] 852 | }, 853 | { 854 | "cell_type": "code", 855 | "execution_count": null, 856 | "metadata": {}, 857 | "outputs": [], 858 | "source": [ 859 | "distance_km.head()" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "### What is `meta`?\n", 867 | "\n", 868 | "Since Dask operates lazily, it doesn't always have enough information to infer the output structure (which includes datatypes) of certain operations.\n", 869 | "\n", 870 | "`meta` is a _suggestion_ to Dask about the output of your computation. Importantly, `meta` _never infers with the output structure_. Dask uses this `meta` until it can determine the actual output structure.\n", 871 | "\n", 872 | "Even though there are many ways to define `meta`, we suggest using a small pandas Series or DataFrame that matches the structure of your final output." 873 | ] 874 | }, 875 | { 876 | "cell_type": "markdown", 877 | "metadata": { 878 | "tags": [] 879 | }, 880 | "source": [ 881 | "## Close you local Dask Cluster" 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": {}, 887 | "source": [ 888 | "It's good practice to always close any Dask cluster you create:" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": null, 894 | "metadata": {}, 895 | "outputs": [], 896 | "source": [ 897 | "client.shutdown()" 898 | ] 899 | } 900 | ], 901 | "metadata": { 902 | "anaconda-cloud": {}, 903 | "kernelspec": { 904 | "display_name": "Python 3 (ipykernel)", 905 | "language": "python", 906 | "name": "python3" 907 | }, 908 | "language_info": { 909 | "codemirror_mode": { 910 | "name": "ipython", 911 | "version": 3 912 | }, 913 | "file_extension": ".py", 914 | "mimetype": "text/x-python", 915 | "name": "python", 916 | "nbconvert_exporter": "python", 917 | "pygments_lexer": "ipython3", 918 | "version": "3.10.5" 919 | } 920 | }, 921 | "nbformat": 4, 922 | "nbformat_minor": 4 923 | } 924 | --------------------------------------------------------------------------------