├── images ├── SO_page.png ├── data_from_post.png ├── storage-files.png ├── sketch_bar_plots.png ├── dask_SO_posts_links.png ├── coiled-logo.svg └── sketch_bar_plots.svg ├── .dask └── config.yml ├── binder ├── jupyterlab-workspace.json ├── start └── environment.yml ├── .gitignore ├── LICENSE ├── README.md ├── tutorials-v2 ├── 1-Parallelize-your-python-code_Futures_API.ipynb └── 2-Get_better-at-dask-dataframes.ipynb ├── 1-Parallelize-your-python-code_Futures_API.ipynb └── 2-Get_better-at-dask-dataframes.ipynb /images/SO_page.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/SO_page.png -------------------------------------------------------------------------------- /images/data_from_post.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/data_from_post.png -------------------------------------------------------------------------------- /images/storage-files.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/storage-files.png -------------------------------------------------------------------------------- /images/sketch_bar_plots.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/sketch_bar_plots.png -------------------------------------------------------------------------------- /images/dask_SO_posts_links.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coiled/dask-tutorial/main/images/dask_SO_posts_links.png -------------------------------------------------------------------------------- /.dask/config.yml: -------------------------------------------------------------------------------- 1 | distributed: 2 | logging: 3 | bokeh: critical 4 | 5 | dashboard: 6 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 7 | 8 | admin: 9 | tick: 10 | limit: 5s -------------------------------------------------------------------------------- /binder/jupyterlab-workspace.json: -------------------------------------------------------------------------------- 1 | { 2 | "data": { 3 | "file-browser-filebrowser:cwd": { 4 | "path": "" 5 | }, 6 | "dask-dashboard-launcher": { 7 | "url": "DASK_DASHBOARD_URL" 8 | } 9 | }, 10 | "metadata": { 11 | "id": "/lab" 12 | } 13 | } -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.dot 3 | *.pdf 4 | *.png 5 | .ipynb_checkpoints 6 | *.gz 7 | data/accounts.*.csv 8 | data/accounts.h5 9 | data/random.hdf5 10 | data/weather-big 11 | data/myfile.hdf5 12 | data/flightjson 13 | data/holidays 14 | data/nycflights 15 | data/myfile.zarr 16 | data/accounts.parquet 17 | dask-worker-space/ 18 | profile.html 19 | log 20 | .idea/ 21 | _build/ -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | export DASK_DISTRIBUTED__DASHBOARD__LINK="${JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status" 6 | 7 | 8 | # Import the workspace 9 | jupyter lab workspaces import binder/jupyterlab-workspace.json 10 | 11 | exec "$@" -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-tutorial 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python==3.10 6 | - dask==2023.5.0 7 | - distributed==2023.5.0 8 | - coiled==0.7.7 9 | - numpy==1.23.5 10 | - pandas==2.0.1 11 | - pyarrow==10.0.1 12 | - ipykernel==6.21.2 13 | - dask-labextension==6.1.0 14 | - jupyterlab==3.5.0 15 | - s3fs==2023.5.0 16 | - python-graphviz==0.20.1 17 | - beautifulsoup4==4.11.1 18 | - matplotlib==3.6.2 19 | - gilknocker==0.4.1 20 | -------------------------------------------------------------------------------- /images/coiled-logo.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2022, Coiled 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | 1. Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | 3. Neither the name of the copyright holder nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/coiled/dask-tutorial/HEAD) 2 | ## Dask Tutorial by Coiled 3 | 4 | Hands-on 1 hour sessions to learn how to scale workflows using Dask and Coiled. Each session is independent of the others, so you can choose the ones you want or come to all of them. 5 | 6 | #### Sign up for the next live session https://www.coiled.io/tutorials 7 | -------------------------------------------------------------------------------------------------------------------------------------------------------- 8 | 9 | ### Parallelize your Python Code: Futures API 10 | [Youtube Recording](https://www.youtube.com/watch?v=32w33L7hseQ) 11 | 12 | In this lesson, we’ll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We’ll get to: 13 | ‍ 14 | - Learn how to do arbitrary task scheduling using the Dask Futures API 15 | - Utilize blocking and non-blocking distributed calculations 16 | 17 | By the end, we’ll see how much faster this workflow is using Dask and how the Dask Futures API is particularly well-suited for this type of fine-grained execution. 18 | 19 | -------------------------------------------------------------------------------------------------------------------------------------------------------- 20 | 21 | ### Get Better at Dask Dataframes 22 | [Youtube Recording](https://www.youtube.com/watch?v=8bd7DswSxw4) 23 | 24 | In this lesson, we’ll learn some best practices around working with larger-than-memory datasets. We’ll use the Uber/Lyft dataset to: 25 | ‍ 26 | - Manipulate Parquet files and optimize queries 27 | - Navigate inconvenient file sizes and data types 28 | - Extract useful features with Dask Dataframe 29 | 30 | By the end, we’ll learn the advantages of working with the Parquet file format and how to efficiently perform an exploratory analysis with Dask. 31 | 32 | -------------------------------------------------------------------------------------------------------------------------------------------------------- 33 | 34 | ### Setup 35 | 36 | We have two options for you to follow this tutorial: 37 | 38 | ### Run locally 39 | 40 | If you are joining for a live session please make sure you do the setup in advance, and be ready to go once the session starts. 41 | 42 | 1. **Clone this repository** 43 | In your terminal: 44 | 45 | ``` 46 | git clone https://github.com/coiled/dask-tutorial.git 47 | cd dask-tutorial 48 | ``` 49 | Alternatively, you can download the zip file of the repository at the top of the main page of the repository. This is a good option if you don't have experience with git. 50 | 51 | 2. **Create Conda Environment** 52 | 53 | In your terminal navigate to the directory where you have cloned/downloaded the `dask-tutorial` repository and install the required packages: 54 | 55 | ``` 56 | conda env create -f binder/environment.yml 57 | ``` 58 | 59 | This will create a new environment called `dask-tutorial`. To activate the environment do: 60 | 61 | ``` 62 | conda activate dask-tutorial 63 | ``` 64 | 65 | 4. **Open Jupyter Lab** 66 | 67 | Once your environment has been activated and you are in the `dask-tutorial` repository, in your terminal do: 68 | 69 | ``` 70 | jupyter lab 71 | ``` 72 | 73 | You will see a notebooks directory, click on there and you will be ready to go. 74 | 75 | ### Use Coiled notebooks 76 | 77 | ``` 78 | pip install coiled jupyter 79 | coiled login --token ### --account dask-tutorials 80 | coiled notebook start --software dask-tutorials 81 | ``` 82 | 83 | ### Run on binder 84 | 85 | Click on this button [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/coiled/dask-tutorial/HEAD) to run in a pre-configured cloud environment. 86 | 87 | If you are joining the live session, please click on the button few minutes before we start so we are ready to go. 88 | 89 | -------------------------------------------------------------------------------- /images/sketch_bar_plots.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | eyJ2ZXJzaW9uIjoiMSIsImVuY29kaW5nIjoiYnN0cmluZyIsImNvbXByZXNzZWQiOnRydWUsImVuY29kZWQiOiJ4nO1cXGtT4khcdTAwMWL9Pr/C8v26Zvt+2W+i4Fx1MDAwNXVUXHUwMDE0Zd7asjJcdTAwMTAgXHUwMDEyIJAg4NT8931cdTAwMWGVxJAooijuStVMXHJ9XHRN9znnufTD/Pq2trZcdTAwMWWOfWf9r7V1Z1S1PbfWt4frf5j2XHUwMDFip1x1MDAxZrjdXHUwMDBldJHJ+6A76FcnI5th6Fx1MDAwN3/9+Wc0w6p223ezXHUwMDFjz2k7nTCAcf+H92trvyZ/Q49bM3ODU39Ubu3km1d+u+ZcdTAwMWSX2dVxXHUwMDFlTaZOXHUwMDA2PSzGcztO1DqCJo6ZxVx1MDAwNZNIcyxcdTAwMTDXato7ht5cckqYsojQmlx1MDAxM40wV5xN+4duLWzCXHUwMDE4ZEmEXHUwMDE11VxmhkhGOVx1MDAxM9MhTcdtNENcdTAwMTijmSWZokRgjjllXHUwMDFhT4fYnYZnloamLUHY77acra7X7Zsl/1x1MDAwZk1e0ap/2tVWo99cdTAwMWR0atFcdTAwMThcXOd2vVx1MDAxZY2pu55XXG7HkyfDXHUwMDBlw26uJ55/cb96nGjPmlx1MDAwNVx1MDAxZthodpwgeLTWrm9X3dDsXHUwMDE0RlGrWZ2/V4tcdTAwMGVr0sp+5OXxZeuqT/x6q7TdLu/6+mT6XHUwMDAxMMD/LjrXvZN6RZdQ/njYXHUwMDFj51x1MDAwZTfy6/f9f8dX0qndr+RXNPv+iMl9y+/oezmOgVxibDqcbvyAIyRiXCKSrUfdzlx1MDAwNJVYwulKolA0wlxytlx1MDAwMY7h5Kl121x1MDAwYpzoaMzS8lx1MDAxMVQ7XHUwMDAzz5t2XHUwMDBl/Jp9N1x0XHUwMDBiSYhgXHUwMDFjIc7ptFx1MDAxZrDZSs7xutVWyuf4XTfOXHUwMDA184r+tVx1MDAxNp3P5M3033//kT46XHUwMDEzv+Y1g9zocTMn49lBuNVtt91cdTAwMTC+57FZY/L7XHUwMDA0od1cdTAwMGZzbqfmdlx1MDAxYck+p1PL6JnM2uz3u8OmY9dS5iX7vsVcdTAwMTCQXHUwMDEwioPK0L9cdTAwMWTJwUFcdO/t51x1MDAwZVu3ku/tzydcdTAwMTTcolx1MDAxOFx0wJBcdTAwMTKKIZ1cdTAwMTBcbsyJpSSB7WNScCr5jFBgrC1z8lRcdTAwMTBNpVRcIuLeVChcIlx1MDAxMr2NMDh1zFx1MDAwNf5cdTAwMTKGJ4RBUEk1VThVXHUwMDE4kMxcdTAwMTZcdTAwMDZcdTAwMDVHrXDMXCK8qTBEj/0oYcjGa2L6p1x1MDAxNII+Yc29k0KtKrdcdTAwMWPys1c5rFRcdTAwMGVcdTAwMGVnhaDvVMM7Kj5WXHUwMDAzii2lXGJVhFx1MDAwMFx1MDAwZeBd0m2g1KJaISooKGfM3k/FQFuUIU5cdTAwMTjFXHUwMDE0USpTtEBqS1x1MDAxM81cYmVKUVx0XHUwMDFm9p5OQ9OuNlx1MDAwN33nLdUhmvPR6vBcdTAwMThUdzKANdWEXHUwMDEzRNN0QNNk44NcZjBcdTAwMDZ0JVx1MDAwNK+AXG48XHUwMDAx9svOcbBz27hcdTAwMTlcdTAwMTbLvb1cdTAwMWTsePlcdTAwMWS/MT/YJVx1MDAwMlx1MDAxZpkyijRcdTAwMDIvIeZcbt2DXHUwMDFkU1x1MDAwYlHEQVxmOYv1TrEuLaKEXHUwMDEwXHUwMDE0vGnYX835LNi5ho/AWoBtRYrJmEP2XHUwMDA19mWAXYLyMEFRXHUwMDFh2HFcdTAwMTbYMeIwjymyJGf4jdDuNjZ28uJov7jFWOFwWKR75XJ9brRcdTAwMGIkLayRwFx1MDAxMOyBdFOaQDuWwuKScaa01oBnOot3bkG7hkFYMoFcdTAwMTiTs3gn1ILd1GBcdTAwMWTAtdaIf8F9mXBcdTAwMDczrXhKlPc02oFcdTAwMWZcdTAwMWO0SK62tp/d7I9cdTAwMDdSikFl43Q8yJV36ts3ZFx1MDAxNu2hM1xuXHUwMDEzss6kXHUwMDA1nlx1MDAwN+g651JcdTAwMGJNkrIujOxcdTAwMTOmQeCVZDFX+Fx1MDAwMehpQo7fNbexXHUwMDA0KC9cdTAwMTbEvFx1MDAxMVKZUFxcP8JGhFTBZ9T6XHUwMDAxqlJRLqmKXHL4+FhkXHUwMDAyONjqnFx1MDAxM4RrdidcdTAwMTjCkmOn2O2EJffWLF5ZXHUwMDE4XHUwMDAyXGZcIlx1MDAxMThcdTAwMTecxd1cdTAwMGIzqGC3XW/86CzNkzc9t2E2Zd1z6jFcXMO+hG7V9qbdYdePwytwJpE9fOi0sVxuXHUwMDFmYkNjf28mouj23Ybbsb2ztK+SSsl0tkVnKDXPOkNcIiCawHgu23rH/J3mVjA47ThjcZLbzeVue1W2115cdTAwMWSiPNCfWZJQRLRSkkhKYztgVFxi5Fx1MDAwNawlXHUwMDAxXHJcdTAwMTJcbrRIolxibPcyhLilhYBQXHUwMDFjXHUwMDBio8hcIrHU16nJg5TFoqs0KbunJ3BcZkxcdIud0XLU4L04+oiFXHUwMDE03Fx1MDAxOUHATVx1MDAxN1xuUyXnYaHZ0iqsNXyGY1x1MDAwYjP2XHUwMDA15Lxbx0Ks5ChTWSnFXHUwMDEw0/C50r93rKzl9aVze1Ysbp/+qPzEpFLPXHUwMDE1z1ePlehcdTAwMTFcdTAwMGKBfZZcdTAwMTKYXHUwMDExQlx1MDAxMNNcdTAwMTgl81x1MDAxOdBjIaZgiGJcblx1MDAwYr5cdTAwMTRcdTAwMGXSlJtcdTAwMTOMklx1MDAxY2SYUHDnXHUwMDE4+fdxUFpcYlx0iTlcdTAwMDLXXHUwMDBirD0n83JwXHUwMDEwOP2O3XaCVFx1MDAxZcp35WG0lsUsZDZcdTAwMTclo4KAqZjHy7njYtEtXGZo0PlBKuX89bjXXHUwMDFibVx1MDAxZjTc1eNcIrfArmFKNDN3XHUwMDA3XGaLx9ykXGY8JKwgJqWSXHUwMDExlbyiXHUwMDA0vFx1MDAwMDeBnVphhYFcdTAwMWFLIaee5aZMUlx1MDAxM1RCaKmZ+lx1MDAxN3KTWYJp44Uoc0xEqPm52ceptOTvTEtYxmKMJDPRSJRcdTAwMTCCzVx1MDAxMFxiz8/In+2w0CxcdTAwMWT0XG51YdtcdTAwMTfscMv1jsWnY6QkwEiDXHUwMDA3kyVScvYuXHUwMDEwW0hcbopcdHiLXFwvx2PFaVdcdTAwMDYzjKSCmlxcXHUwMDE0Xnb8+slcdTAwMThJVoORZFFcdTAwMWI5c0kxjVwiNYRRJHaJ8Vx1MDAxYyG1z0t9v3BRXHUwMDFlXFw3nVYh3z8/PD77bIQ0OVvJMDNBXCLmJFx1MDAxNmJGhIQ/XG5rKriUsVTWuzOSIFx1MDAwMWw0iY4vRsZcdTAwMTlJV4OR9GlGPq5cdTAwMGKJ1bLhTK9cdTAwMTVsJFLK5Lfm5mTnsLzxY6d9denK291O0/5cdIe1n8HJdy/iSFx1MDAwZiCFXHUwMDAwhlx1MDAxYeZcdTAwMTFBNFx1MDAxMOy5OrqlXHUwMDEw8HWleFx1MDAwZibTpD/MI54jqHfWbYZa/ThcdTAwMWN+L1x1MDAxNG6LQf+70o3Xl6W8lLdcdTAwMTQ8/0V4u1hcdMacpVx1MDAxZC8p0pj2rVZZ20I6gLOvkyQziX0+V6XAnVxmXHUwMDE0zoqXtNe8wsNQ986PdXNDXFx7qy5cdTAwMDPC1L1ohk06Xz1TJPe6SDVZ27ZYnd096znErVJp/pyf/EX6VSb9i0vWniR5SoFAxHTKMpk+wTZoyvxOOHhG9dF4s8BHXHUwMDA1Qlx1MDAxYsPbXFx4VMmq2vyA2/tcZq4rYoGpXHUwMDE3hFNccv5cdTAwMGaeuT9m1Fx1MDAxMkIrQTWHp6HX2fx6vYqxk8L251x1MDAwYuk0slxiLFx1MDAwZlx1MDAxNiFcdTAwMDFcdTAwMTNcdTAwMTBcdTAwMDQkyS8hclDgri7B4r/KJX8hy1x1MDAxN1x1MDAwNzNcdTAwMTGZXHUwMDA1blx1MDAxODglJEVz3S3fofnQb15sn4vbgk+bXHUwMDFkm5V63dZNftXRLFx0XHUwMDA2XHUwMDA3XHUwMDE2tp6B21x1MDAwMD7gjFx1MDAwM0uYqfhcdTAwMTRcdTAwMWFRXGZAemVcYpmJ5jlK5SS2INRcdTAwMDUsMyqVqdyYgbOp0sIs5oP/1+DMMn/PXHUwMDAxQlx1MDAwMDyXlM7vhu1cdTAwMWXt25snbXV7vX+52yqe6Uq5uLHycOZcdTAwMTBvSXNhJziII56pYlMgilx1MDAwMiuJTbZcYr3yRi9cdTAwMTPO81TCXHRLakq4Ulx1MDAxY3FwXHUwMDFjYz+JebjxU4yAXG7RJfhmK4PnzFRcdTAwMWbPVGZcYqk5wyhmtZ5D8oCdn1x1MDAxY2yey+71eC+o3OxcdTAwMWWh635WXmFlrqbBg7BcdTAwMTjhXHUwMDFjZFFiXHUwMDAxcE1cbnNmmdpbplx1MDAxNtKQO1tcdTAwMWZcdTAwMDJeXHUwMDA0YFx1MDAwMutPrrxcdTAwMTFcIl5ZpTXxXHUwMDA2ukG4dtM1K4rt/Fx1MDAwN1x1MDAxNYnEXHUwMDE2s1x1MDAxMFx1MDAxZJXKvlx1MDAwYmOKaIyZnt+y7HyXuZPh1vXm6YGTXHUwMDFiyrP2SXF0sXp8fLqAS4CbrzXXWJtKXHUwMDE5ymf4iYTFhEBgjLhcdTAwMTRSLif1XHUwMDE3u6N8ip+Mg1uEns29f0p+Lli/ZdiQXjfyvtS8W8dCrOQqO3xcdTAwMTHgfYCT84JcdTAwMWIxuVVcIqwld882W9vFrbPS2PU3b1aPlVx0f1x1MDAwZklLm1x1MDAwYmlcdTAwMGWuXHUwMDFkXHUwMDE4yyRcdTAwMGKTXHUwMDA1XFxsKSycr4JLQ1xmJdgy8utcdTAwMWZNwq9cdTAwMDIuU4ueWeKsIa7lJtUyN1x1MDAxN49oXHUwMDBmyJiv6Zvj/Kh+dVx1MDAxZYxcdTAwMWNnXHUwMDA1LeQzt9OKWyZBRqk0OSixulx1MDAwNVxcXHUwMDA0bDRcdTAwMDFTzv591Pzv1m+p7GpcdTAwMTFcdTAwMDZRt/lPKObP7W353fJl/aR0NSqIwC1cXPhesKU/XHUwMDFiIcGdXHUwMDA1QjJJtaTmsmCm2nl16rdcdTAwMTAyN1x1MDAxOeyLkY9cdTAwMTn5yeu3sm+JiVx1MDAwMvvwgl9cdTAwMWKcX3h4uOdQLPB24/Rq73rsXHUwMDFlfTpcdTAwMDNpkpVcdTAwMDBcdTAwMDJzI0w4ljL5XHUwMDAz89Uq31KIMvrZk5FvTchcdTAwMTUr3/p2v8frtu+XQtir9Yd7+PVcdTAwMWLXXHUwMDE55lJz2OZl5k/obKjjTK7vf3/7/Vx1MDAwZjJY6+cifQ== 4 | 5 | 15 | 16 | Best answerscountusernamesusr1usr2usr3Most votedvotesusernamesusr1usr2usr3 -------------------------------------------------------------------------------- /tutorials-v2/1-Parallelize-your-python-code_Futures_API.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "bd982e68-605d-461d-a0f1-0166ebb1e55c", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Coiled\n", 12 | "\n", 13 | "### Sign up for the next live session https://www.coiled.io/tutorials\n" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "b04c1daa-d1af-45b1-bcfc-4d56fd3b7593", 19 | "metadata": {}, 20 | "source": [ 21 | "\"Dask\n", 25 | " \n", 26 | "# Parallelize your Python code\n", 27 | "\n", 28 | "In this lesson you will learn how to parallelize custom Python code using Dask using the Futures API. We will take normal for-loopy Python code that looks like this:\n", 29 | "\n", 30 | "```python\n", 31 | "urls = [...]\n", 32 | "results = []\n", 33 | "for url in urls:\n", 34 | " page = download(url)\n", 35 | " result = process(page)\n", 36 | " results.append(result)\n", 37 | "```\n", 38 | "\n", 39 | "or more dynamic Python code that looks like this:\n", 40 | "\n", 41 | "```python\n", 42 | "urls = [...]\n", 43 | "results = []\n", 44 | "while urls:\n", 45 | " url = urls.pop()\n", 46 | " page = download(url)\n", 47 | " result = process(page)\n", 48 | " results.append(result)\n", 49 | " \n", 50 | " new_urls = scrape(page)\n", 51 | " urls.extend(new_urls)\n", 52 | "```\n", 53 | "\n", 54 | "and parallelize it using [Dask Futures](https://docs.dask.org/en/stable/futures.html). \n", 55 | "\n", 56 | "\n", 57 | "## Futures: a low-level collection.\n", 58 | "\n", 59 | "Dask low-level collections are the best tools when you need to have fine control to build custom parallel and distributed computations. \n", 60 | "\n", 61 | "The `futures` interface (derived from the built-in `concurrent.futures`) provides fine-grained real-time execution for custom situations. It allows you to submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. \n", 62 | "\n", 63 | "### Why use Futures?\n", 64 | "\n", 65 | "The `futures` API offers a work submission style that can easily emulate the map/reduce paradigm. If that is familiar to you then futures might be the simplest entrypoint into Dask.\n", 66 | "\n", 67 | "The other big benefit of futures is that the intermediate results, represented by futures, can be passed to new tasks without having to pull data locally from the cluster. The **call returns immediately**, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. " 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "id": "6abb22f5-3030-4357-8cfd-baa1ccbe3909", 73 | "metadata": {}, 74 | "source": [ 75 | "## Outline\n", 76 | "\n", 77 | "We will learn how to use futures, and then use them on a real-world example, first in a simple case, and then in a complex case:\n", 78 | "\n", 79 | "1. How to use Futures \n", 80 | "2. Use futures to download and parse webpages\n", 81 | "3. Dynamic/changing workloads\n", 82 | "4. Crawl and scrape a website\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "id": "28a67dbd-765c-4459-9a60-5002f72d889e", 88 | "metadata": {}, 89 | "source": [ 90 | "### Parallel Code with low-level Futures\n", 91 | "\n", 92 | "This is an example of an embarrassingly parallel computation. We want to run the same Python code on many pieces of data. This is a very simple and also very common case that comes up all the time.\n", 93 | "\n", 94 | "First, we're going to see a very simple example, then we'll try to parallelize the code above." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "id": "b37bef54-58d1-4ac7-802b-518bcbb10cb3", 100 | "metadata": {}, 101 | "source": [ 102 | "### Set up a Dask cluster locally" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "id": "1cae81cd-7b3b-42c1-98bf-d83eaaf9b112", 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "from dask.distributed import Client\n", 113 | "\n", 114 | "client = Client()\n", 115 | "client" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "6301907f-8c9d-4f49-9873-ea57a2595e15", 121 | "metadata": { 122 | "tags": [] 123 | }, 124 | "source": [ 125 | "### Dask Futures introduction" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "id": "c0744577-3f0b-4dec-a427-f4af4ed907d0", 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "import time\n", 136 | "import random\n", 137 | "\n", 138 | "def inc(x):\n", 139 | " time.sleep(random.random())\n", 140 | " return x + 1\n", 141 | "\n", 142 | "def double(x):\n", 143 | " time.sleep(random.random())\n", 144 | " return 2 * x\n", 145 | "\n", 146 | "def add(x, y):\n", 147 | " time.sleep(random.random())\n", 148 | " return 2 * x\n", 149 | " " 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "id": "866d9a8b-5424-46ee-8cd1-ae516b1cb59d", 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "%%time\n", 160 | "\n", 161 | "y = inc(10)\n", 162 | "z = double(y)\n", 163 | "z" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "id": "936d4ac8-9939-4195-84c1-37e90b67bc58", 169 | "metadata": {}, 170 | "source": [ 171 | "Dask futures lets us run Python functions remotely on parallel hardware. Rather than calling the function directly, like in the cell above, we can ask Dask to run that function, `inc` on the data `10` by passing each as arguments into the `client.submit` method. The first argument is the function to call and the rest of the arguments are arguments to that function.\n", 172 | "\n", 173 | "Normal Execution\n", 174 | "\n", 175 | "```python\n", 176 | "result = function(*args, **kwargs) # e.g inc(10)\n", 177 | "```\n", 178 | "\n", 179 | "Submit function for remote execution\n", 180 | "\n", 181 | "```python\n", 182 | "future = client.submit(function, *args, **kwargs) # instantaneously fire off work\n", 183 | "...\n", 184 | "result = future.result() # when we need, block until done and collect the result\n", 185 | "```" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "id": "0c397a32-a6fa-4e7b-aa69-cb2eb930e12a", 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "%%time\n", 196 | "\n", 197 | "y = client.submit(inc, 10)\n", 198 | "z = client.submit(double, y)\n", 199 | "z" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "id": "37bc6676-bb3b-4635-b2e2-7561f4cfeef9", 205 | "metadata": {}, 206 | "source": [ 207 | "You'll notice that that happened immediately. That's because all we did was submit the `inc` function to run on Dask, and then return a `Future`, or a pointer to where the data will eventually be.\n", 208 | "\n", 209 | "We can gather the future by calling `future.result()`" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "id": "972bc35b-bd30-427e-bcab-2a424c157cea", 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "z" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "id": "d6a4bd1e-a269-4b86-be47-ecbcce873c13", 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "z.result()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "id": "22280cd9-1bb6-469f-9434-99c716e3d59c", 235 | "metadata": {}, 236 | "source": [ 237 | "### Submit many tasks in a loop\n", 238 | "\n", 239 | "We can submit lots of functions to run at once, and then gather them when we're done. This allows us to easily parallelize simple for loops.\n", 240 | "\n", 241 | "*This section uses the following API*:\n", 242 | "\n", 243 | "- [Client.submit and Future.result](https://docs.dask.org/en/stable/futures.html#submit-tasks)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "id": "d6fcf290-6638-4877-b719-fabe9a654347", 249 | "metadata": {}, 250 | "source": [ 251 | "#### Sequential code" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "id": "8de1b192-0ca7-4ead-b738-92e632f3eb74", 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "%%time \n", 262 | "\n", 263 | "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 264 | "results = []\n", 265 | "\n", 266 | "for x in data:\n", 267 | " y = inc(x)\n", 268 | " z = double(y)\n", 269 | " results.append(z)\n", 270 | " \n", 271 | "results" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "id": "22e16d8c-d49d-451c-b416-f0b1bea0d2fc", 277 | "metadata": {}, 278 | "source": [ 279 | "#### Parallel code" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "id": "bc34cfb1-1d84-445c-8094-eac1a7e0f5a3", 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "%%time \n", 290 | "\n", 291 | "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 292 | "results = []\n", 293 | "\n", 294 | "for x in data:\n", 295 | " y = client.submit(inc, x)\n", 296 | " z = client.submit(double, y)\n", 297 | " results.append(z)\n", 298 | " \n", 299 | "results = client.gather(results)\n", 300 | "results" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "id": "c7393bff-27b0-46f7-bba7-b6d4c8c56f62", 306 | "metadata": {}, 307 | "source": [ 308 | "### Lessons:\n", 309 | "\n", 310 | "1. Submit a function to run elsewhere\n", 311 | "\n", 312 | " ```python\n", 313 | " y = f(x)\n", 314 | " future = client.submit(f, x)\n", 315 | " ```\n", 316 | " \n", 317 | " \n", 318 | "2. Get results when you're done\n", 319 | "\n", 320 | " ```python\n", 321 | " y = future.result()\n", 322 | " # or \n", 323 | " results = client.gather(futures)\n", 324 | " ```" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "id": "9e364aee-cb00-4b67-8cba-444af569f80e", 330 | "metadata": {}, 331 | "source": [ 332 | "## Use futures to download and parse webpages\n", 333 | "\n", 334 | "### Sequential Code\n", 335 | "\n", 336 | "The code below downloads 50 question pages from a Stack Overflow tag, parses those pages, and collects the title and list of tags from each page.\n", 337 | "\n", 338 | "We then count up all the tags to see what are the most popular kinds of questions. We divide this code into four sections:\n", 339 | "\n", 340 | "1. Define useful functions\n", 341 | "2. Get a list of pages to download and scrape\n", 342 | "3. Download and scrape\n", 343 | "4. Analyze results" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "id": "abce7ac1-7e53-480e-869e-e97afc8f15f1", 349 | "metadata": {}, 350 | "source": [ 351 | "#### Define useful functions\n", 352 | "\n", 353 | "You don't need to study these. Feel free to skip." 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "id": "ebf48fc6-b488-4266-95c0-83ac57c04173", 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "import re\n", 364 | "import requests\n", 365 | "from bs4 import BeautifulSoup\n", 366 | "import time\n", 367 | "\n", 368 | "def download(url: str, delay=0) -> str:\n", 369 | " time.sleep(delay)\n", 370 | " response = requests.get(url)\n", 371 | " if response.status_code == 200:\n", 372 | " return response.text\n", 373 | " else:\n", 374 | " response.raise_for_status()\n", 375 | " \n", 376 | " \n", 377 | "def scrape_title(body: str) -> str:\n", 378 | " html = BeautifulSoup(body, \"html.parser\")\n", 379 | " return str(html.html.title)\n", 380 | "\n", 381 | "\n", 382 | "def scrape_links(body: str, base_url=\"\") -> list[str]:\n", 383 | " html = BeautifulSoup(body, \"html.parser\")\n", 384 | " \n", 385 | " return [\n", 386 | " str(base_url + link.attrs[\"href\"]).split(\"?\")[0]\n", 387 | " for link in html.find_all(\"a\") \n", 388 | " if re.match(\"/questions/\\d{5}\", link.attrs.get(\"href\", \"\"))\n", 389 | " ]\n", 390 | "\n", 391 | "\n", 392 | "def scrape_tags(body: str) -> list[str]:\n", 393 | " html = BeautifulSoup(body, \"html.parser\")\n", 394 | " \n", 395 | " return sorted({\n", 396 | " str(list(link.children)[0])\n", 397 | " for link in html.find_all(\"a\", class_=\"post-tag\")\n", 398 | " })" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "id": "6eb36481-ad2f-4a3c-a3f5-7ef39b02bcb3", 404 | "metadata": {}, 405 | "source": [ 406 | "### Serial for-loopy code" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "id": "c4eae6b5-22b6-40b8-a146-de174b2a1fdb", 412 | "metadata": {}, 413 | "source": [ 414 | "#### Get list of pages to download and scrape" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "id": "de04dc62-abcc-4e77-b127-c7ad6e31c855", 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [ 424 | "url = \"https://stackoverflow.com/questions/tagged/dask\"\n", 425 | "body = download(url)\n", 426 | "urls = scrape_links(body, base_url=\"https://stackoverflow.com\")\n", 427 | "urls[:5]" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "id": "b2d107cc-e13c-498e-afaf-6af47d0ac1ec", 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "len(urls)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "id": "9a618f52-b89b-4988-930e-8f9a46f2beba", 443 | "metadata": {}, 444 | "source": [ 445 | "#### Download and scrape" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "id": "ac648177-8525-446c-b9fd-85698ef2a8f2", 452 | "metadata": {}, 453 | "outputs": [], 454 | "source": [ 455 | "%%time\n", 456 | "\n", 457 | "all_tags = []\n", 458 | "titles = []\n", 459 | "\n", 460 | "for url in urls:\n", 461 | " page = download(url)\n", 462 | " print(\".\", end=\"\")\n", 463 | " tags = scrape_tags(page)\n", 464 | " title = scrape_title(page)\n", 465 | " \n", 466 | " all_tags.append(tags)\n", 467 | " titles.append(title)\n", 468 | "print()" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "id": "0a2c30b7-dd2b-4d84-b43c-0e31574bbc96", 474 | "metadata": {}, 475 | "source": [ 476 | "#### Analyze Results\n", 477 | "\n", 478 | "Aggregate tags to find related topics" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "id": "36b06197-f8a2-4e2c-8a9d-f19c0dad1d09", 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "import collections\n", 489 | "\n", 490 | "tag_counter = collections.defaultdict(int)\n", 491 | "\n", 492 | "for tags in all_tags:\n", 493 | " for tag in tags:\n", 494 | " tag_counter[tag] += 1\n", 495 | " \n", 496 | "sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:10]" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "id": "6624ee94-5cbe-43be-b37f-f3d64375fb7c", 502 | "metadata": {}, 503 | "source": [ 504 | "### Exercise: Parallelize this code\n", 505 | "\n", 506 | "Take the code above, and use Dask futures to run it in parallel\n", 507 | "\n", 508 | "Which sections should we think about parallelizing?" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": null, 514 | "id": "2c07a529-7a06-4fb8-987c-d9c0b3a346f4", 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "url = \"https://stackoverflow.com/questions/tagged/dask\"\n", 519 | "body = download(url)\n", 520 | "urls = scrape_links(body, base_url=\"https://stackoverflow.com\")" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "id": "9e9f891e-9cb6-46bb-9ad1-172e0e3fb236", 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [ 530 | "# TODO: parallelize me\n", 531 | "\n", 532 | "%%time\n", 533 | "\n", 534 | "all_tags = []\n", 535 | "titles = []\n", 536 | "\n", 537 | "for url in urls:\n", 538 | " page = download(url)\n", 539 | " tags = scrape_tags(page)\n", 540 | " title = scrape_title(page)\n", 541 | " \n", 542 | " all_tags.append(tags)\n", 543 | " titles.append(title)\n", 544 | "print()" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "id": "d0f7d1d0-0f3a-4ff4-9f7f-3e9b5b5b1bd6", 550 | "metadata": {}, 551 | "source": [ 552 | "#### Solution\n", 553 | "\n", 554 | "Expand the three dots below if you want to see the answer" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "id": "da65abf2-b2a2-4c15-b991-d030b88d01ad", 561 | "metadata": { 562 | "jupyter": { 563 | "source_hidden": true 564 | }, 565 | "tags": [] 566 | }, 567 | "outputs": [], 568 | "source": [ 569 | "%%time\n", 570 | "\n", 571 | "all_tags = []\n", 572 | "titles = []\n", 573 | "\n", 574 | "for url in urls:\n", 575 | " page = client.submit(download, url)\n", 576 | " tags = client.submit(scrape_tags, page)\n", 577 | " title = client.submit(scrape_title, page)\n", 578 | " \n", 579 | " all_tags.append(tags)\n", 580 | " titles.append(title)\n", 581 | " \n", 582 | "all_tags = client.gather(all_tags)\n", 583 | "titles = client.gather(titles)" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "id": "9536ff6a-9247-414d-bcab-bb414cc8ae86", 590 | "metadata": {}, 591 | "outputs": [], 592 | "source": [ 593 | "import collections\n", 594 | "\n", 595 | "tag_counter = collections.defaultdict(int)\n", 596 | "\n", 597 | "for tags in all_tags:\n", 598 | " for tag in tags:\n", 599 | " tag_counter[tag] += 1\n", 600 | " \n", 601 | "sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:10]" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "id": "84a0c0c5-4656-41e4-b701-d9b26b09921b", 607 | "metadata": {}, 608 | "source": [ 609 | "### Exercise: Scale out\n", 610 | "\n", 611 | "There are different reasons to scale out for this problem:\n", 612 | "\n", 613 | "1. Parallelize bandwidth\n", 614 | "2. StackOverflow's rate-limits won't affect us as much if we spread out our requests from many different machines\n", 615 | "3. ~CPU Processing speed~ (not really an issue here)\n", 616 | "\n", 617 | "Let's ask for some machines from Coiled, and switch our Dask client to use that cluster." 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "id": "2164349e-185f-4040-85db-1bb58a181e87", 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "client.close()" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "id": "a2656f3c-2ed4-4c46-b2c9-2559a42f663d", 634 | "metadata": {}, 635 | "outputs": [], 636 | "source": [ 637 | "import coiled\n", 638 | "\n", 639 | "cluster = coiled.Cluster(\n", 640 | " n_workers=20,\n", 641 | " account=\"dask-tutorials\",\n", 642 | ")\n", 643 | "\n", 644 | "client = cluster.get_client()" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "id": "7b5152a8-fc00-4c44-8f59-23d320d5aac3", 651 | "metadata": {}, 652 | "outputs": [], 653 | "source": [ 654 | "client" 655 | ] 656 | }, 657 | { 658 | "cell_type": "markdown", 659 | "id": "5c4ef8ef-05a1-49d9-b07c-01c8c3dcbbc2", 660 | "metadata": {}, 661 | "source": [ 662 | "**Rerun your computation and see.**" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "id": "4b692762-e08f-4856-b07e-c24ef9685851", 668 | "metadata": {}, 669 | "source": [ 670 | "#### Solution" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "id": "4f9cb8c5-723b-423c-85d5-8bfa82454e3c", 677 | "metadata": { 678 | "jupyter": { 679 | "source_hidden": true 680 | }, 681 | "tags": [] 682 | }, 683 | "outputs": [], 684 | "source": [ 685 | "%%time\n", 686 | "\n", 687 | "all_tags = []\n", 688 | "titles = []\n", 689 | "\n", 690 | "for url in urls:\n", 691 | " page = client.submit(download, url)\n", 692 | " tags = client.submit(scrape_tags, page)\n", 693 | " title = client.submit(scrape_title, page)\n", 694 | " \n", 695 | " all_tags.append(tags)\n", 696 | " titles.append(title)\n", 697 | " \n", 698 | "all_tags = client.gather(all_tags)\n", 699 | "titles = client.gather(titles)" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "id": "c09a9bd0-f8f1-4111-a591-eb21fc1ce03e", 705 | "metadata": {}, 706 | "source": [ 707 | "## 3. Evolving computations\n", 708 | "\n", 709 | "Dask futures are flexible. There are many ways to coordinate them including ...\n", 710 | "\n", 711 | "1. Distributed locks and semaphores\n", 712 | "2. Distributed queues\n", 713 | "3. Launching tasks from tasks\n", 714 | "4. Global variables\n", 715 | "5. ... [and lots more](https://docs.dask.org/en/stable/futures.html)\n", 716 | "\n", 717 | "We're going to get a taste of this by learning about one Dask futures feature, [`as_completed`](https://docs.dask.org/en/stable/futures.html#distributed.as_completed), which lets us dynamically build up a computation as it completes.\n", 718 | "\n", 719 | "We will use this to build a parallel web crawler over Stack Overflow. \n", 720 | "\n", 721 | "1. First, we'll build this sequentially.\n", 722 | "2. Second, we'll learn how `as_completed` works in a simple example\n", 723 | "3. Third, we'll convert the sequential code into parallel code" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "id": "935e9a45-b547-48ee-bbec-efa69793368a", 729 | "metadata": {}, 730 | "source": [ 731 | "### Sequential Code to Crawl Stack Overflow" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "id": "2cd922ee-db5c-480b-98f4-40365df48dad", 738 | "metadata": {}, 739 | "outputs": [], 740 | "source": [ 741 | "%%time\n", 742 | "from collections import deque\n", 743 | "\n", 744 | "urls = deque()\n", 745 | "urls.append(\"https://stackoverflow.com/questions/tagged/dask\") # seed with a single page\n", 746 | "\n", 747 | "all_tags = []\n", 748 | "titles = []\n", 749 | "seen = set()\n", 750 | "i = 0\n", 751 | "\n", 752 | "while urls and i < 10: \n", 753 | " url = urls.popleft()\n", 754 | " \n", 755 | " # Don't scrape the same page twice\n", 756 | " if url in seen: \n", 757 | " continue\n", 758 | " else:\n", 759 | " seen.add(url)\n", 760 | " \n", 761 | " print(\".\", end=\"\")\n", 762 | " i += 1\n", 763 | " \n", 764 | " # This is like before\n", 765 | " page = download(url)\n", 766 | " tags = scrape_tags(page)\n", 767 | " title = scrape_title(page)\n", 768 | " all_tags.append(tags)\n", 769 | " titles.append(title)\n", 770 | "\n", 771 | " # This is new! \n", 772 | " # We scrape links on this page, and add them to the list of URLs\n", 773 | " new_urls = scrape_links(page, base_url=\"https://stackoverflow.com\")\n", 774 | " urls.extend(new_urls)" 775 | ] 776 | }, 777 | { 778 | "cell_type": "markdown", 779 | "id": "f0ee92de-c1c2-44e3-ab55-62bb18b5d507", 780 | "metadata": {}, 781 | "source": [ 782 | "## Exercise: Parallelize code to crawl Stack Overflow\n", 783 | "\n", 784 | "Expand the sequential code that we saw below. Parallelize it with futures and as_completed." 785 | ] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "execution_count": null, 790 | "id": "0f1cd8d9-525d-4a1d-bd2d-4fa518ba71c3", 791 | "metadata": {}, 792 | "outputs": [], 793 | "source": [ 794 | "from collections import deque\n", 795 | "from dask.distributed import as_completed" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": null, 801 | "id": "ec078790-36ef-4e54-91ce-b91c02b95b37", 802 | "metadata": {}, 803 | "outputs": [], 804 | "source": [ 805 | "%%time\n", 806 | "\n", 807 | "urls = deque()\n", 808 | "urls.append(\"https://stackoverflow.com/questions/tagged/dask\") # seed with a single page\n", 809 | "\n", 810 | "all_tags = []\n", 811 | "titles = []\n", 812 | "url_futures = as_completed()\n", 813 | "seen = set()\n", 814 | "i = 0\n", 815 | "\n", 816 | "while urls or not url_futures.is_empty() and i < 1000:\n", 817 | " \n", 818 | " # TODO: If urls is empty, \n", 819 | " # get the next future from url_futures\n", 820 | " # collect those new url results to the local notebook\n", 821 | " # and add those new urls to urls\n", 822 | "\n", 823 | " url = urls.popleft()\n", 824 | "\n", 825 | " if url in seen:\n", 826 | " continue\n", 827 | " else:\n", 828 | " seen.add(url)\n", 829 | " \n", 830 | " print(\".\", end=\"\")\n", 831 | " i += 1\n", 832 | "\n", 833 | " # This is like before\n", 834 | " # TODO: Submit this work to happen in parallel\n", 835 | " page = download(url, delay=0.25)\n", 836 | " tags = scrape_tags(page)\n", 837 | " title = scrape_title(page)\n", 838 | " \n", 839 | " all_tags.append(tags)\n", 840 | " titles.append(title)\n", 841 | "\n", 842 | " # We scrape links on this page, and add them to the list of URLs\n", 843 | " # TODO: Submit this work to happen in parallel. Add the future to url_futures\n", 844 | " new_urls = scrape_question_links(page, base_url=\"https://stackoverflow.com\")\n", 845 | " urls.extend(new_urls)" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "id": "a50258b6-d4a5-47d2-8a0a-739bf68c850d", 851 | "metadata": {}, 852 | "source": [ 853 | "#### Solution" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": null, 859 | "id": "92efc708-316c-408a-8806-ac65ec124ec2", 860 | "metadata": { 861 | "jupyter": { 862 | "source_hidden": true 863 | }, 864 | "tags": [] 865 | }, 866 | "outputs": [], 867 | "source": [ 868 | "%%time\n", 869 | "\n", 870 | "urls = deque()\n", 871 | "urls.append(\"https://stackoverflow.com/questions/tagged/dask\") # seed with a single page\n", 872 | "\n", 873 | "all_tags = []\n", 874 | "titles = []\n", 875 | "url_futures = as_completed()\n", 876 | "seen = set()\n", 877 | "i = 0\n", 878 | "\n", 879 | "while urls or not url_futures.is_empty() and i < 1000:\n", 880 | " \n", 881 | " # TODO: If urls is empty, \n", 882 | " # get the next future from url_futures\n", 883 | " # collect those new url results to the local notebook\n", 884 | " # and add those new urls to urls\n", 885 | " if not urls:\n", 886 | " future = url_futures.next()\n", 887 | " new_urls = future.result()\n", 888 | " urls.extend(new_urls)\n", 889 | " continue\n", 890 | " \n", 891 | " url = urls.popleft()\n", 892 | " \n", 893 | " if url in seen:\n", 894 | " continue\n", 895 | " else:\n", 896 | " seen.add(url)\n", 897 | " \n", 898 | " print(\".\", end=\"\")\n", 899 | " i += 1\n", 900 | "\n", 901 | " # This is like before\n", 902 | " # TODO: Submit this work to happen in parallel\n", 903 | " page = client.submit(download, url, delay=0.25)\n", 904 | " tags = client.submit(scrape_tags, page)\n", 905 | " title = client.submit(scrape_title, page)\n", 906 | "\n", 907 | " all_tags.append(tags)\n", 908 | " titles.append(title)\n", 909 | " \n", 910 | " # We scrape links on this page, and add them to the list of URLs\n", 911 | " # TODO: Submit this work to happen in parallel. Add the future to url_futures\n", 912 | " new_urls = client.submit(scrape_links, page, base_url=\"https://stackoverflow.com\")\n", 913 | " url_futures.add(new_urls)" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "id": "3dcda5db-d6eb-4b3c-9140-e45254953029", 919 | "metadata": {}, 920 | "source": [ 921 | "### Analyze results\n", 922 | "\n", 923 | "At this point you likely have lists `titles` and `all_tags` that are lists of futures. Let's gather them and analyze results." 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": null, 929 | "id": "02d12830-10d1-4fca-a6c7-4ca6ecc5ce51", 930 | "metadata": {}, 931 | "outputs": [], 932 | "source": [ 933 | "titles = client.gather(titles)" 934 | ] 935 | }, 936 | { 937 | "cell_type": "code", 938 | "execution_count": null, 939 | "id": "f4dcb1d1-826f-4957-8711-2c6b3058afa4", 940 | "metadata": {}, 941 | "outputs": [], 942 | "source": [ 943 | "len(titles)" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "id": "e81c4b16-b2ba-4234-980f-e4af10c21779", 950 | "metadata": {}, 951 | "outputs": [], 952 | "source": [ 953 | "titles[:20]" 954 | ] 955 | }, 956 | { 957 | "cell_type": "code", 958 | "execution_count": null, 959 | "id": "a0b1923a-a4f7-452b-9801-3e83fb473fd2", 960 | "metadata": {}, 961 | "outputs": [], 962 | "source": [ 963 | "all_tags = client.gather(all_tags)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": null, 969 | "id": "98cd8656-90af-4381-beb5-58ad12cc7506", 970 | "metadata": {}, 971 | "outputs": [], 972 | "source": [ 973 | "import collections\n", 974 | "\n", 975 | "tag_counter = collections.defaultdict(int)\n", 976 | "\n", 977 | "for tags in all_tags:\n", 978 | " for tag in tags:\n", 979 | " tag_counter[tag] += 1\n", 980 | " \n", 981 | "sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:20]" 982 | ] 983 | }, 984 | { 985 | "cell_type": "markdown", 986 | "id": "39486ea8-930c-46fe-a0a9-f88b9e8ad940", 987 | "metadata": {}, 988 | "source": [ 989 | "## Clean up" 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": null, 995 | "id": "a852e4ac-4a15-461a-9d59-5bb65115ccf2", 996 | "metadata": {}, 997 | "outputs": [], 998 | "source": [ 999 | "cluster.shutdown()\n", 1000 | "client.close()" 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "markdown", 1005 | "id": "40d6a9cd-2d15-4ef4-9d8b-4f6ada4d9904", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "### Useful links\n", 1009 | "\n", 1010 | "- https://tutorial.dask.org/05_futures.html\n", 1011 | "- [Futures documentation](https://docs.dask.org/en/latest/futures.html)\n", 1012 | "- [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)\n", 1013 | "- [Futures examples](https://examples.dask.org/futures.html)\n", 1014 | "\n", 1015 | "### More Dask Tutorials\n", 1016 | "\n", 1017 | "Coiled also runs regular Dask tutorials. See [coiled.io/tutorials](https://www.coiled.io/tutorials) for more information. \n" 1018 | ] 1019 | } 1020 | ], 1021 | "metadata": { 1022 | "kernelspec": { 1023 | "display_name": "Python 3 (ipykernel)", 1024 | "language": "python", 1025 | "name": "python3" 1026 | }, 1027 | "language_info": { 1028 | "codemirror_mode": { 1029 | "name": "ipython", 1030 | "version": 3 1031 | }, 1032 | "file_extension": ".py", 1033 | "mimetype": "text/x-python", 1034 | "name": "python", 1035 | "nbconvert_exporter": "python", 1036 | "pygments_lexer": "ipython3", 1037 | "version": "3.10.0" 1038 | } 1039 | }, 1040 | "nbformat": 4, 1041 | "nbformat_minor": 5 1042 | } 1043 | -------------------------------------------------------------------------------- /tutorials-v2/2-Get_better-at-dask-dataframes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d3f8ffbf-9dc9-42d4-9482-c7fb9e4333a9", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Coiled\n", 12 | "\n", 13 | "### Sign up for the next live session https://www.coiled.io/tutorials" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "036478b0-c89d-4051-a513-f4436f48968f", 19 | "metadata": {}, 20 | "source": [ 21 | "\"Dask\n", 25 | "\n", 26 | "# Get better at Dask Dataframes\n", 27 | "\n", 28 | "In this lesson, you will learn the advantages of working with the parquet data format and best practices when working with big data. You will learn how to manipulate inconvenient file sizes and datatypes, as well as how to make your data easier to manipulate. You will be exploring the Uber/Lyft dataset and learning some key practices of feature engineering with Dask Dataframes.\n", 29 | "\n", 30 | "## Dask Dataframes \n", 31 | "\n", 32 | "\"Dask\n", 36 | "\n", 37 | "At its core, the `dask.dataframe` module implements a \"blocked parallel\" `DataFrame` object that looks and feels like the `pandas` API, but for parallel and distributed workflows. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n", 38 | "\n", 39 | "Dask dataframes are very useful, but getting the most out of them can be tricky. Where your data is stored, the format your data was saved in, the size of each file and the data types, are some examples of things you need to care when it comes to working with dataframes. " 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "4307fa56-f051-467b-bf1f-e3acae08e3a8", 45 | "metadata": {}, 46 | "source": [ 47 | "### Work close to your data\n", 48 | "\n", 49 | "To get started when you are working with data that is in the cloud it's always better to work close to your data to minimize the impact of IO networking. \n", 50 | "\n", 51 | "In this lesson, we will use Coiled Clusters that will be created on the same region that our datasets are stored. (the region is `\"us-east-2\"`)\n" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "id": "7a52c65b-f0bc-4341-9841-4de07f710dce", 57 | "metadata": {}, 58 | "source": [ 59 | "## Parquet vs CSV\n", 60 | "\n", 61 | "Most people are familiarized with **csv** files, but when it comes to working with data, working with **parquet** can make a big difference. \n", 62 | "\n", 63 | "### Parquet is where it's at!!\n", 64 | "\n", 65 | "The Parquet file format is column-oriented and it is designed to efficiently store and retrieve data. Columnar formats provide better compression and improved performance, and enable you to query data column by column. Consequently, aggregation queries are faster compared to row-oriented storage.\n", 66 | "\n", 67 | "\"Dask\n", 71 | " \n", 72 | " \n", 73 | "- **Column pruning:** Parquet lets you read specific columns from a dataset without reading the entire file.\n", 74 | "- **Better compression:** Because in each column the data types are fairly similar, the compression of each column is quite straightforward. (saves on storage)\n", 75 | "- **Schema:** Parquet stores the file schema in the file metadata.\n", 76 | "- **Column metadata:** Parquet stores metadata statistics for each column, which can make certain types of queries a lot more efficient.\n", 77 | "\n", 78 | " " 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "id": "e205e7c7-1d13-4f16-9035-ade902f43fad", 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "### coiled login\n", 89 | "#!coiled login --token ### --account dask-tutorials" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "id": "bea1905f-09aa-4865-844e-d752639e29c2", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "import coiled\n", 100 | "import dask\n", 101 | "import dask.dataframe as dd\n", 102 | "from dask.distributed import Client" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "id": "f1162a7b-969a-46c7-a6b2-bfef92cade59", 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "# we use this to avoid re-using clusters on a team\n", 113 | "import uuid\n", 114 | "\n", 115 | "id_cluster = uuid.uuid4().hex[:4]" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "8e323f48-9bbb-46bd-af08-7d6ded1ccead", 121 | "metadata": {}, 122 | "source": [ 123 | "## Uber/Lyft data transformation\n", 124 | "\n", 125 | "The NYC Taxi dataset is a timeless classic.\n", 126 | "\n", 127 | "The NYC Taxi and Limousine Commission (TLC) has data from all ride-share services in the city of New York. This includes private limosine services, van services, and a new category \"High Volume For Hire Vehicle\" services, those that dispatch 10,000 rides per day or more. This is a special category defined for Uber and Lyft.\n", 128 | "\n", 129 | "Let's use the Uber/Lyft dataset, as an example of a `parquet` dataset to learn how to troubleshoot the nuances of working with real data. The data comes from [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page)\n", 130 | "\n", 131 | "_Data dictionary:_\n", 132 | "\n", 133 | "https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "id": "3c9ab02d-f58b-48a3-bd1e-f04ab40697b5", 139 | "metadata": {}, 140 | "source": [ 141 | "### Let's get a cluster" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "3476887b-5643-4e4a-9cdb-d2488692cdf7", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "%%time\n", 152 | "cluster = coiled.Cluster(\n", 153 | " name=f\"uber-lyft-{id_cluster}\",\n", 154 | " n_workers=20,\n", 155 | " account=\"dask-tutorials\",\n", 156 | " worker_vm_types=[\"m6i.xlarge\"],\n", 157 | " backend_options={\"region_name\": \"us-east-2\"},\n", 158 | ")" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "id": "304ba45b-5604-4bdd-8a33-7433cd9df26d", 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "client = Client(cluster)\n", 169 | "client" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "id": "aed8c836-0a39-4219-b85c-ee63019af46c", 175 | "metadata": {}, 176 | "source": [ 177 | "### Explore the data\n", 178 | "\n", 179 | "We have a public version of this data set that is ready to use to get some insights, at\n", 180 | "`\"s3://coiled-datasets/uber-lyft-tlc/\"`" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "id": "b76f77fb-67bd-4c3e-a6aa-3698365a0dd7", 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "dask.config.set({\"dataframe.convert-string\": True}) #Use PyArrow strings" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "id": "15530dc1-703e-4491-9eec-c391cd663a34", 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "df = dd.read_parquet(\n", 201 | " \"s3://coiled-datasets/uber-lyft-tlc/\" \n", 202 | ")" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "id": "2e2900ba-cbdf-46fc-9ee4-9fbc69ea881c", 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "df.dtypes" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "id": "80eb7dc4-e84e-40b9-9842-dad6670dad3b", 218 | "metadata": {}, 219 | "source": [ 220 | "## Memory usage \n", 221 | "\n", 222 | "```python\n", 223 | "dask.utils.format_bytes(\n", 224 | " df.memory_usage(deep=True).sum().compute()\n", 225 | ")\n", 226 | "```\n", 227 | "'82.81 GiB'\n" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "id": "834816b0-5899-43a8-853e-ffc91a0a7da8", 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "df.head()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "id": "dff7b58b-97c5-4e06-9fdf-f2fe4fc979a3", 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "df.columns" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "id": "d52ca6d7-4030-4103-b128-4263f95044c7", 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "# len(df)\n", 258 | "## 783_431_901" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "id": "cdcd250f-0a9f-4268-b90e-6a4f409a3f93", 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "#We have enough memory so we persist the data set\n", 269 | "df = df.persist()" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "id": "07a108e9-984c-498e-8c59-99cd7a269370", 275 | "metadata": {}, 276 | "source": [ 277 | "### Get some insights\n", 278 | "\n", 279 | "We assume you know pandas, so using pandas syntax and adding at the end `.compute()`, compute the follwoing quantities. " 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "id": "a5bb5f87-55bf-4b3b-b63a-8c37e1fcddda", 285 | "metadata": {}, 286 | "source": [ 287 | "How much did New Yorkers pay Uber/Lyft? Sum the `base_passenger_fare` column." 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "id": "447931fd-eb9c-450a-be0f-5c6c9c18f745", 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "id": "af9c909b-2264-456e-b495-bc76cc974464", 302 | "metadata": { 303 | "jupyter": { 304 | "source_hidden": true 305 | }, 306 | "tags": [] 307 | }, 308 | "outputs": [], 309 | "source": [ 310 | "#solution\n", 311 | "df.base_passenger_fare.sum().compute() / 1e9" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "7cdd5225-0479-4277-a767-1b43bb853459", 317 | "metadata": {}, 318 | "source": [ 319 | "How much did Uber/Lyft pay drivers?" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "8cf105ff-e366-468d-b736-d530d798041d", 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "id": "68bbfa3d-7a7e-4c2b-b84d-7653daa20640", 334 | "metadata": { 335 | "jupyter": { 336 | "source_hidden": true 337 | }, 338 | "tags": [] 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "#solution\n", 343 | "df.driver_pay.sum().compute() / 1e9" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "id": "d0e16825-8598-4cf0-bde0-db419f17e1ac", 349 | "metadata": {}, 350 | "source": [ 351 | "How much did Uber/Lyft drivers made on tips?" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "id": "f78a96c9-3d7f-44c6-8c52-f56a21a80460", 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "id": "ed858f9e-c6e3-44da-a6d5-296f985c9aa0", 366 | "metadata": { 367 | "jupyter": { 368 | "source_hidden": true 369 | }, 370 | "tags": [] 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "#solution\n", 375 | "df.tips.sum().compute() / 1e6" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "id": "3794ddc5-77fc-4d82-bd37-d02eb8ea6a12", 381 | "metadata": {}, 382 | "source": [ 383 | "### Are New Yorkers tippers? \n", 384 | "\n", 385 | "Let's make our data set smaller and create a column that is a Yes/No for the tip. " 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "id": "b1262fb9-adaa-4e23-9b5f-7e50e62128ce", 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "%%time\n", 396 | "##let's count to see NaN\n", 397 | "df.count().compute()" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "id": "eb55c7c8-bb9b-4fe5-9729-6b91f263a3ac", 404 | "metadata": { 405 | "tags": [] 406 | }, 407 | "outputs": [], 408 | "source": [ 409 | "# Create a column tip > 0 = True\n", 410 | "df[\"tip_flag\"] = df.tips > 0\n", 411 | "\n", 412 | "df = df[\n", 413 | " [\n", 414 | " \"hvfhs_license_num\",\n", 415 | " \"tips\",\n", 416 | " \"base_passenger_fare\",\n", 417 | " \"driver_pay\",\n", 418 | " \"trip_miles\",\n", 419 | " \"trip_time\",\n", 420 | " \"shared_request_flag\",\n", 421 | " \"tip_flag\",\n", 422 | " ]\n", 423 | "].persist()" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "id": "54707176-6ddf-4c3c-9870-7cdfd03f8335", 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "df.head()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "id": "2fcaad19-5ad8-4ce8-a5e6-669a802c37e3", 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "df.columns" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "id": "bba714f3-2a98-4e2d-83b2-bfa5516ae95f", 449 | "metadata": {}, 450 | "source": [ 451 | "### Exercise\n", 452 | "\n", 453 | "What percentage of rides received a tip?" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "id": "61cce3d7-c954-4b14-8f37-75b0eaff867a", 460 | "metadata": {}, 461 | "outputs": [], 462 | "source": [] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "id": "4c625daa-c794-4853-96bc-9a6685c36d26", 468 | "metadata": { 469 | "jupyter": { 470 | "source_hidden": true 471 | }, 472 | "tags": [] 473 | }, 474 | "outputs": [], 475 | "source": [ 476 | "#solution\n", 477 | "tip_count = df[\"tip_flag\"].value_counts().compute()\n", 478 | "\n", 479 | "perc_trip_tips = tip_count[True] * 100 / (tip_count[True] + tip_count[False])\n", 480 | "perc_trip_tips" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "id": "fc2110ff-d2fe-4a57-8556-f8381cd12d48", 486 | "metadata": {}, 487 | "source": [ 488 | "### How many trips have tip by provider?" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "id": "ac3192f3-7913-41e6-9b7b-ed01bb8a96c5", 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "tip_by_provider = df.groupby([\"hvfhs_license_num\"]).tip_flag.value_counts().compute()" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "id": "13062d36-4168-41b7-b2c5-d6c8e62de44e", 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "tip_by_provider" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "id": "fac3870d-0b82-47e9-856b-e15e77cf3425", 514 | "metadata": { 515 | "tags": [] 516 | }, 517 | "source": [ 518 | "**From the data dictionary we know:**\n", 519 | "\n", 520 | "As of September 2019, the HVFHS licenses are the following:\n", 521 | "\n", 522 | "- HV0002: Juno \n", 523 | "- HV0003: Uber \n", 524 | "- HV0004: Via \n", 525 | "- HV0005: Lyft " 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "id": "28c54f0e-0526-447e-8f5c-19fc6e6b1962", 532 | "metadata": {}, 533 | "outputs": [], 534 | "source": [ 535 | "type(tip_by_provider)" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "id": "fd1d5b68-5e2e-487d-9db7-b1bc0c5b8e2d", 542 | "metadata": {}, 543 | "outputs": [], 544 | "source": [ 545 | "## this is a pandas\n", 546 | "tip_by_provider = tip_by_provider.unstack(level=\"tip_flag\")\n", 547 | "tip_by_provider / 1e6" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "id": "3c42ee9b-b7db-4fde-8125-84c5aca7483c", 553 | "metadata": {}, 554 | "source": [ 555 | "### sum and mean of tips by provider " 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": null, 561 | "id": "9b10f437-ba9d-41d3-bc43-0d4e96e9dcdb", 562 | "metadata": {}, 563 | "outputs": [], 564 | "source": [ 565 | "tips_total = (\n", 566 | " df.loc[lambda x: x.tip_flag]\n", 567 | " .groupby(\"hvfhs_license_num\")\n", 568 | " .tips.agg([\"sum\", \"mean\"])\n", 569 | " .compute()\n", 570 | ")\n", 571 | "tips_total" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "id": "ef8c80f8-0d20-4bf4-a4ee-fa4a5325ce26", 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "provider = {\"HV0002\": \"Juno\", \"HV0005\": \"Lyft\", \"HV0003\": \"Uber\", \"HV0004\": \"Via\"}" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": null, 587 | "id": "776e2689-133d-4cfb-bd08-0ca34ee05f31", 588 | "metadata": {}, 589 | "outputs": [], 590 | "source": [ 591 | "tips_total = tips_total.assign(provider=lambda df: df.index.map(provider)).set_index(\n", 592 | " \"provider\"\n", 593 | ")\n", 594 | "tips_total" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "id": "e63f9aad-c1e0-49c6-9de2-8d99e4bd44e2", 600 | "metadata": {}, 601 | "source": [ 602 | "### What percentage of the passenger fare is the tip?" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "id": "8d089e27-053a-4f8c-820a-0fedb4d232ea", 608 | "metadata": {}, 609 | "source": [ 610 | "### Exercise\n", 611 | "\n", 612 | "Create a new column named \"tip_percentage\" that represents the what fraction of the passenger fare is the tip" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": null, 618 | "id": "0c03b374-d006-4b36-9de7-d583ba0f8e90", 619 | "metadata": {}, 620 | "outputs": [], 621 | "source": [] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "id": "cec9faaf-f329-4ebb-8e60-d1fe95d22f2c", 627 | "metadata": { 628 | "jupyter": { 629 | "source_hidden": true 630 | }, 631 | "tags": [] 632 | }, 633 | "outputs": [], 634 | "source": [ 635 | "# solution\n", 636 | "tip_percentage = df.tips / df.base_passenger_fare\n", 637 | "df[\"tip_percentage\"] = tip_percentage" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "id": "5b14e09d-7a4b-454f-ab39-12cbcd5d7ef4", 644 | "metadata": {}, 645 | "outputs": [], 646 | "source": [ 647 | "df = df.persist()" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "id": "6c5aaa54-f344-464e-90af-4270c1792528", 653 | "metadata": {}, 654 | "source": [ 655 | "## Tip percentage mean of trip" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "id": "0ab6fa5d-1b7b-4cd6-aa02-f5986784cd49", 662 | "metadata": {}, 663 | "outputs": [], 664 | "source": [ 665 | "tips_perc_mean = (\n", 666 | " df.loc[lambda x: x.tip_flag]\n", 667 | " .groupby(\"hvfhs_license_num\")\n", 668 | " .tip_percentage.mean()\n", 669 | " .compute()\n", 670 | ")\n", 671 | "tips_perc_mean" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "id": "c16d4260-46a4-4a5a-842d-b80f3e0cc76b", 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "(tips_perc_mean.to_frame().set_index(tips_perc_mean.index.map(provider)))" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "id": "5508b41a-e186-4af6-8a35-2a659c3a6975", 687 | "metadata": {}, 688 | "source": [ 689 | "### Get insight on the data\n", 690 | "\n", 691 | "We are seeing weird numbers, let's try to take a deeper look and remove some outliers" 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": null, 697 | "id": "15df0aab-84fb-45bc-af6d-d2f9892b38be", 698 | "metadata": { 699 | "tags": [] 700 | }, 701 | "outputs": [], 702 | "source": [ 703 | "(\n", 704 | " df[[\"trip_miles\", \"base_passenger_fare\", \"tips\", \"tip_flag\"]]\n", 705 | " .loc[lambda x: x.tip_flag]\n", 706 | " .describe()\n", 707 | " .compute()\n", 708 | ")" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "id": "5e970925-c96d-40e5-a906-176a6f69a44c", 714 | "metadata": {}, 715 | "source": [ 716 | "### Getting to know the data\n", 717 | "\n", 718 | "- How would you get more insights on the data?\n", 719 | "- Can you visualize it?\n", 720 | "\n", 721 | "**Hint:** Get a small sample, like 0.1% of the data to plot ~700_000 rows (go smaller if needed depending on your machine), compute it and work with that pandas dataframe." 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "id": "21a73002-ab6b-4627-b5af-7ecdeedd0d78", 728 | "metadata": { 729 | "tags": [] 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "# needed to avoid plots from breaking\n", 734 | "%matplotlib inline" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": null, 740 | "id": "f1abd1c5-8d60-4772-b1a9-4a180790c0d8", 741 | "metadata": { 742 | "tags": [] 743 | }, 744 | "outputs": [], 745 | "source": [ 746 | "## Take a sample\n", 747 | "df_sample = (\n", 748 | " df.loc[lambda x: x.tip_flag][[\"trip_miles\", \"base_passenger_fare\", \"tips\"]]\n", 749 | " .sample(frac=0.001)\n", 750 | " .compute()\n", 751 | ")" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": null, 757 | "id": "a46b9993-3009-4e5d-a0e4-2c8368ec0613", 758 | "metadata": { 759 | "tags": [] 760 | }, 761 | "outputs": [], 762 | "source": [ 763 | "# box plot\n", 764 | "df_sample.boxplot()" 765 | ] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "id": "500530b7-0f24-4566-b234-1f1a5acc2cc3", 770 | "metadata": {}, 771 | "source": [ 772 | "### Cleaning up outliers\n", 773 | "\n", 774 | "- Play with the pandas dataframe `df_tiny` to get insights on good filters for the bigger dataframe. \n", 775 | "\n", 776 | "Hint: think about pandas dataframe quantiles [docs here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)" 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "id": "0d674a25-104a-4506-8602-25e214023954", 783 | "metadata": { 784 | "tags": [] 785 | }, 786 | "outputs": [], 787 | "source": [ 788 | "df_sample.tips.quantile([0.25, 0.75])" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "id": "d552cdcf-2ea9-4443-a9a5-b55523e33e85", 794 | "metadata": {}, 795 | "source": [ 796 | "### Exercise\n", 797 | "\n", 798 | "Calculate the first and third quantiles for `base_passenger_fare` and `trip_miles`" 799 | ] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "execution_count": null, 804 | "id": "7c5673e2-072f-46e1-abcf-efd2721c1051", 805 | "metadata": {}, 806 | "outputs": [], 807 | "source": [] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "execution_count": null, 812 | "id": "905ed15c-106a-42eb-bfee-59f6d12b84d6", 813 | "metadata": { 814 | "jupyter": { 815 | "source_hidden": true 816 | }, 817 | "tags": [] 818 | }, 819 | "outputs": [], 820 | "source": [ 821 | "# solution\n", 822 | "df_sample.base_passenger_fare.quantile([0.25, 0.75])" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": null, 828 | "id": "f08228ab-5f07-460a-b268-d176e77aef94", 829 | "metadata": { 830 | "jupyter": { 831 | "source_hidden": true 832 | }, 833 | "tags": [] 834 | }, 835 | "outputs": [], 836 | "source": [ 837 | "# solution\n", 838 | "df_sample.trip_miles.quantile([0.25, 0.75])" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "id": "b0ee5c4c-3423-443c-8ce9-d0c597aa0758", 844 | "metadata": {}, 845 | "source": [ 846 | "### Conditions to filter the dataset\n", 847 | "\n", 848 | "We can use the information of Q1 and Q3 to create contions to filter the dataset" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": null, 854 | "id": "8fcd0e94-2c7a-423d-a410-e43eedecb3f7", 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "tips_filter_vals = df_sample.tips.quantile([0.25, 0.75]).values\n", 859 | "tips_condition = df_sample.tips.between(*tips_filter_vals)" 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "execution_count": null, 865 | "id": "1133eddf-b478-4f61-9889-3f41e9050689", 866 | "metadata": { 867 | "tags": [] 868 | }, 869 | "outputs": [], 870 | "source": [ 871 | "tips_condition" 872 | ] 873 | }, 874 | { 875 | "cell_type": "markdown", 876 | "id": "da2dfb9b-af06-4183-8ea5-a30dc02bdd5f", 877 | "metadata": {}, 878 | "source": [ 879 | "### Exercise\n", 880 | "\n", 881 | "Create filter conditions for the `base_passenger_fare` and `trip_miles`" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": null, 887 | "id": "56cd321b-2aae-473b-b53b-5225b4f894ce", 888 | "metadata": {}, 889 | "outputs": [], 890 | "source": [] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "id": "89d5bdc4-1685-471d-898e-9ed239ad281b", 896 | "metadata": { 897 | "jupyter": { 898 | "source_hidden": true 899 | }, 900 | "tags": [] 901 | }, 902 | "outputs": [], 903 | "source": [ 904 | "## Solution\n", 905 | "fare_filter_vals = df_sample.base_passenger_fare.quantile([0.25, 0.75]).values\n", 906 | "fares_condition = df_sample.base_passenger_fare.between(*fare_filter_vals)\n", 907 | "\n", 908 | "miles_filter_vals = df_sample.trip_miles.quantile([0.25, 0.75]).values\n", 909 | "miles_condition = df_sample.trip_miles.between(*miles_filter_vals)" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "id": "2dacba7f-bb18-4027-99e2-43a7c91ebd21", 915 | "metadata": {}, 916 | "source": [ 917 | "### Filter dataframe and plot" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": null, 923 | "id": "6a82012b-ee3a-4df9-b24b-877f114a207a", 924 | "metadata": { 925 | "tags": [] 926 | }, 927 | "outputs": [], 928 | "source": [ 929 | "df_sample.loc[(tips_condition & fares_condition) & miles_condition].boxplot()" 930 | ] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "id": "d1337177-5c12-4f73-bc68-f9156c1bbc6c", 935 | "metadata": {}, 936 | "source": [ 937 | "## Filtering our big dataset based on the insights\n", 938 | "\n", 939 | "Based on these numbers let's go back to our `df` dataset and try to filter it.\n" 940 | ] 941 | }, 942 | { 943 | "cell_type": "code", 944 | "execution_count": null, 945 | "id": "75a5a9be-5e13-4f0d-8c76-13ab7caea857", 946 | "metadata": {}, 947 | "outputs": [], 948 | "source": [ 949 | "tips_condition = df.tips.between(*tips_filter_vals)\n", 950 | "miles_condition = df.trip_miles.between(*miles_filter_vals)\n", 951 | "fares_condition = df.base_passenger_fare.between(*fare_filter_vals)" 952 | ] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "execution_count": null, 957 | "id": "879314c1-b9de-483f-af5d-59347db07f61", 958 | "metadata": {}, 959 | "outputs": [], 960 | "source": [ 961 | "df = df.loc[(tips_condition & fares_condition) & miles_condition].persist()" 962 | ] 963 | }, 964 | { 965 | "cell_type": "markdown", 966 | "id": "e4933836-e833-4787-b561-f2d0bb5f76ee", 967 | "metadata": {}, 968 | "source": [ 969 | "### Let's look at the `tip_percentage` again" 970 | ] 971 | }, 972 | { 973 | "cell_type": "markdown", 974 | "id": "aded0737-d5bf-4181-b9aa-2c5c72435d06", 975 | "metadata": {}, 976 | "source": [ 977 | "### Exercise \n", 978 | "Compute the `tip_percentage` mean by provider " 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": null, 984 | "id": "c6db8ed3-3aab-4e98-99aa-5379b3f1f0a0", 985 | "metadata": {}, 986 | "outputs": [], 987 | "source": [] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "execution_count": null, 992 | "id": "9e745175-5725-4b92-a58d-4f870a3d3ace", 993 | "metadata": { 994 | "jupyter": { 995 | "source_hidden": true 996 | }, 997 | "tags": [] 998 | }, 999 | "outputs": [], 1000 | "source": [ 1001 | "#Solution\n", 1002 | "tips_perc_avg = df.groupby(\"hvfhs_license_num\").tip_percentage.mean().compute()\n", 1003 | "tips_perc_avg" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": null, 1009 | "id": "46be3c7a-8128-4ab1-b8f1-3b398faa4238", 1010 | "metadata": {}, 1011 | "outputs": [], 1012 | "source": [ 1013 | "(tips_perc_avg.to_frame().set_index(tips_perc_avg.index.map(provider)))" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "code", 1018 | "execution_count": null, 1019 | "id": "1538a243-8de1-4dde-82f0-dd5434b21f11", 1020 | "metadata": {}, 1021 | "outputs": [], 1022 | "source": [ 1023 | "len(df)" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "markdown", 1028 | "id": "c3665fc2-8b8e-4e0e-a656-e47cefbafb86", 1029 | "metadata": {}, 1030 | "source": [ 1031 | "### Average trip time by provider" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": null, 1037 | "id": "176f6aa4-8eec-4d4b-9959-678fa61d50ea", 1038 | "metadata": { 1039 | "tags": [] 1040 | }, 1041 | "outputs": [], 1042 | "source": [ 1043 | "trips_time_avg = (\n", 1044 | " df.groupby(\"hvfhs_license_num\")\n", 1045 | " .trip_time.agg([\"min\", \"max\", \"mean\", \"std\"])\n", 1046 | " .compute()\n", 1047 | ")\n", 1048 | "trips_time_avg" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "markdown", 1053 | "id": "50ade8e4-2fd4-4cf4-b533-033c694ae04d", 1054 | "metadata": {}, 1055 | "source": [ 1056 | "### In minutes" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "execution_count": null, 1062 | "id": "b3715c5a-7520-42c3-a41c-c1295a19d5ff", 1063 | "metadata": { 1064 | "tags": [] 1065 | }, 1066 | "outputs": [], 1067 | "source": [ 1068 | "trips_time_avg.set_index(trips_time_avg.index.map(provider)) / 60" 1069 | ] 1070 | }, 1071 | { 1072 | "cell_type": "markdown", 1073 | "id": "bf7fdf77-320b-4d87-802c-1f260867a141", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "## What we've learned\n", 1077 | "- Most New Yorkers do not tip\n", 1078 | "- But it looks like of those who tip, it is common to tip around 20% regardless of the provider. Unless it's Via, they tend to tip slightly less.\n", 1079 | "- The trip_time column needs some cleaning of outliers. " 1080 | ] 1081 | }, 1082 | { 1083 | "cell_type": "code", 1084 | "execution_count": null, 1085 | "id": "451a0ec9-fc40-4f23-bd8a-69803f0a6fe2", 1086 | "metadata": {}, 1087 | "outputs": [], 1088 | "source": [ 1089 | "cluster.shutdown()\n", 1090 | "client.close()" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "markdown", 1095 | "id": "1c592b6f-6386-4e82-9cb2-b74637846a22", 1096 | "metadata": {}, 1097 | "source": [ 1098 | "### Useful links\n", 1099 | "\n", 1100 | "- https://tutorial.dask.org/01_dataframe.html\n", 1101 | "\n", 1102 | "**Useful links**\n", 1103 | "\n", 1104 | "* [DataFrames documentation](https://docs.dask.org/en/stable/dataframe.html)\n", 1105 | "* [Dataframes and parquet](https://docs.dask.org/en/stable/dataframe-parquet.html)\n", 1106 | "* [Dataframes examples](https://examples.dask.org/dataframe.html)\n", 1107 | "\n", 1108 | "### Other lesson\n", 1109 | "\n", 1110 | "Register [here](https://www.coiled.io/tutorials) for reminders. \n", 1111 | "\n", 1112 | "We have another lesson, where we’ll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We’ll get to: ‍\n", 1113 | "\n", 1114 | "- Learn how to do arbitrary task scheduling using the Dask Futures API\n", 1115 | "- Utilize blocking and non-blocking distributed calculations\n", 1116 | "\n", 1117 | "By the end, we’ll see how much faster this workflow is using Dask and how the Dask Futures API is particularly well-suited for this type of fine-grained execution.\n" 1118 | ] 1119 | } 1120 | ], 1121 | "metadata": { 1122 | "kernelspec": { 1123 | "display_name": "Python 3 (ipykernel)", 1124 | "language": "python", 1125 | "name": "python3" 1126 | }, 1127 | "language_info": { 1128 | "codemirror_mode": { 1129 | "name": "ipython", 1130 | "version": 3 1131 | }, 1132 | "file_extension": ".py", 1133 | "mimetype": "text/x-python", 1134 | "name": "python", 1135 | "nbconvert_exporter": "python", 1136 | "pygments_lexer": "ipython3", 1137 | "version": "3.10.0" 1138 | } 1139 | }, 1140 | "nbformat": 4, 1141 | "nbformat_minor": 5 1142 | } 1143 | -------------------------------------------------------------------------------- /1-Parallelize-your-python-code_Futures_API.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f04a0743-3d30-44ae-945b-c159160deaaf", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Coiled\n", 12 | "\n", 13 | "### Sign up for the next live session https://www.coiled.io/tutorials\n" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "3786371b-cdd0-473b-89bd-94573b3676b5", 19 | "metadata": {}, 20 | "source": [ 21 | "\"Dask\n", 25 | " \n", 26 | "# Parallelize your Python code\n", 27 | "\n", 28 | "In this lesson you will learn how to parallelize custom Python code using Dask using the Futures API.\n", 29 | "\n", 30 | "## Futures: a low-level collection.\n", 31 | "\n", 32 | "Dask low-level collections are the best tools when you need to have fine control to build custom parallel and distributed computations. \n", 33 | "\n", 34 | "The `futures` interface (derived from the built-in `concurrent.futures`) provides fine-grained real-time execution for custom situations. It allows you to submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. \n", 35 | "\n", 36 | "### Why use Futures?\n", 37 | "\n", 38 | "The `futures` API offers a work submission style that can easily emulate the map/reduce paradigm. If that is familiar to you then futures might be the simplest entrypoint into Dask.\n", 39 | "\n", 40 | "The other big benefit of futures is that the intermediate results, represented by futures, can be passed to new tasks without having to pull data locally from the cluster. The **call returns immediately**, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. \n", 41 | "\n", 42 | "### When do we use Futures?\n", 43 | "\n", 44 | "One of the most common cases where you can use `Futures` is when you have a for loop. For example, you need to apply a **read-transform-write** function over multiple files. Your serial code will look something like:\n", 45 | "\n", 46 | "\n", 47 | "```python\n", 48 | "# Serial code\n", 49 | "def process_file(filename):\n", 50 | " data = read_a_file(filename)\n", 51 | " data_transformed = do_a_transformation(data)\n", 52 | " destination = f\"results/{filename}\"\n", 53 | " write_out_data(data_transformed, destination)\n", 54 | " return destination\n", 55 | "\n", 56 | "files = [\"file_1\", \"file_2\", \"file_3\", ..., \"file_n\"] #list of files\n", 57 | "new_files = [] #where we save the destination file names\n", 58 | "\n", 59 | "for f in files:\n", 60 | " new_files.append(process_file(f) \n", 61 | "```\n", 62 | "\n", 63 | "Notice that every call of `process_file` is independent from each other, this is what is call an embarrassingly parallel problem. You can do this in parallel with Dask by doing\n", 64 | "\n", 65 | "```python\n", 66 | "#Parallel code\n", 67 | "futures = []\n", 68 | "for f in files:\n", 69 | " future = client.submit(process_file, f)\n", 70 | " futures.append(future)\n", 71 | " \n", 72 | "futures\n", 73 | "```\n", 74 | "\n", 75 | "## Example: Get SO get questions page title \n", 76 | "\n", 77 | "During this lesson, you will be working with the Stack Overflow question pages. To start let's see how to grab the title of each page, and how for multiple pages we can perform this in parallel. \n", 78 | "\n", 79 | "If you go to https://stackoverflow.com/questions/ you will see a list of the newest posts, if you go to the bottom of the page you can switch to the next page. For example, the top of page number two at the moment of the creation of this notebook, looked like \n", 80 | "\n", 81 | "
\n", 82 | "\"SO\n", 85 | "
\n", 86 | "\n", 87 | "### Get the title\n", 88 | "\n", 89 | "The title of the page is what is shown in the tab. The following function gets the title of a page given its page number:" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "id": "e6bebd77-c616-4899-9f38-d1a5d33b554d", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "import time\n", 100 | "\n", 101 | "import pandas as pd\n", 102 | "import requests\n", 103 | "from bs4 import BeautifulSoup as bs" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "id": "485fe9fe-da20-4e17-98be-3572f32077d8", 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "def get_questions_page_title(page_num):\n", 114 | " \"\"\"Get title of a SO questions page\"\"\"\n", 115 | " url = f\"https://stackoverflow.com/questions?tab=newest&page={page_num}\"\n", 116 | " req = requests.get(url)\n", 117 | " html = bs(req.text, \"html.parser\")\n", 118 | "\n", 119 | " return html.title.text" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "id": "033f1c7e-0769-439b-a640-b3bfeb3b3969", 126 | "metadata": { 127 | "tags": [] 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "page_2 = get_questions_page_title(2)\n", 132 | "page_2" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "id": "486c8e08-7044-4a8e-a1b4-0b4ea93b7d30", 138 | "metadata": { 139 | "tags": [] 140 | }, 141 | "source": [ 142 | "### Serial code to get 8 pages would be:" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "id": "57f2a7db-3b6b-4a85-bf09-f02c91fa5e05", 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "%%time\n", 153 | "page_title = []\n", 154 | "for p in range(1, 9): # page numbers start in 1\n", 155 | " page_title.append(get_questions_page_title(p))" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "id": "c903a112-7802-417b-9565-32457fa663c8", 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "page_title[:3]" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "id": "a06d3c08-b560-4abe-9d7b-40cdd41bba2a", 171 | "metadata": {}, 172 | "source": [ 173 | "### Exercise\n", 174 | "\n", 175 | "Run the code in parallel, using futures." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "id": "27785fcb-93fe-4197-8133-6744cc02e710", 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "from dask.distributed import Client, wait" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "id": "9467a843-1364-4458-b4b4-8d68060c256a", 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "client = Client(n_workers=4)\n", 196 | "client" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "id": "810b134d-c7af-4d34-a1e8-48aef0f16312", 203 | "metadata": { 204 | "tags": [] 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "# Solution\n", 209 | "futures = []\n", 210 | "for p in range(1, 9):\n", 211 | " future = client.submit(get_questions_page_title, p)\n", 212 | " futures.append(future)\n", 213 | "\n", 214 | "futures" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "id": "6514a138-a39f-49b2-ae7c-e7b786ea4511", 221 | "metadata": { 222 | "tags": [] 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "futures[0]" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "id": "31f5aad8-1d2c-4a14-9c19-8c2a6e179b3d", 233 | "metadata": { 234 | "tags": [] 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "futures[0].result()" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "id": "84ddfbd4-484d-403f-a219-c848b9982d56", 245 | "metadata": { 246 | "tags": [] 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "results = [future.result() for future in futures]\n", 251 | "results" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "id": "d809a1f0-b209-4ad8-b62a-01104eee28a0", 257 | "metadata": {}, 258 | "source": [ 259 | "**Extra:**\n", 260 | "\n", 261 | "To be able to `%%time` the cell and compare times with the serial version, you will need to wait for the futures to finish doing `wait(futures)`. If you try to do that and re-run the cell, you will notice it is immediate, this is because by default, distributed assumes that all functions are pure. Pure functions:\n", 262 | "\n", 263 | "- always return the same output for a given set of inputs\n", 264 | "- do not have side effects, like modifying global state or creating files\n", 265 | "\n", 266 | " \n", 267 | "You can use the `pure=False` keyword argument in the `client.submit()`. Modify your solution to match this code\n" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "id": "fe36ccb1-0248-456e-9370-6f9e0bee2ef6", 273 | "metadata": { 274 | "tags": [] 275 | }, 276 | "source": [ 277 | "```python\n", 278 | "%%time\n", 279 | "futures = []\n", 280 | "for p in range(1,9):\n", 281 | " future = client.submit(get_questions_page_title, p, pure=False)\n", 282 | " futures.append(future)\n", 283 | " \n", 284 | "wait(futures)\n", 285 | "```" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "id": "4019bbbe-de28-49e9-bf4b-ca18aa4afba2", 291 | "metadata": {}, 292 | "source": [ 293 | "**`client.map()`**\n", 294 | "\n", 295 | "With `client.submit()` you can submit individual functions for evaluation with one set of inputs, and together with a `for-loop` you can also evaluate over a sequence of inputs. `client.map()` provides a simpler interface to perform the former, let's see how to perform the example above now using `client.map()`\n" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "id": "a8f3eb84-ab8a-491d-a961-fa2bab0bd8c6", 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "futures = client.map(get_questions_page_title, range(1, 9))" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "id": "7a758e9f-a391-4eba-875a-8fd56ab03b03", 311 | "metadata": {}, 312 | "source": [ 313 | "`client.map()` returns a list of futures, you can block on the computation and gather the result by doing:" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "id": "668dfa0e-5dd3-4304-b6d7-e54eadfa5550", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "res = client.gather(futures)\n", 324 | "res" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "id": "2829f110-5843-4174-8d2a-b59d0de6b308", 330 | "metadata": {}, 331 | "source": [ 332 | "### Futures are great...\n", 333 | "\n", 334 | "The other big benefit of `futures` is that the intermediate results, represented by `futures`, can be passed to new tasks without having to pull data locally from the cluster. New operations can be set up to work on the output of previous jobs that have not even begun yet.\n", 335 | "\n", 336 | "Let's brake our steps into multiple functions\n", 337 | "\n", 338 | "- `request_html_page`: given a url returns the html of that page\n", 339 | "- `get_page_html_links`: given a SO questions page number returns the html for that page number.\n", 340 | "- `get_post_links_per_page`: given a SO questions html page, returns a list with the posts of that page.\n", 341 | "\n", 342 | "\n", 343 | "
\n", 344 | "\"dask\n", 347 | "
\n", 348 | "\n", 349 | "\n" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "id": "524bb7f1-49bd-4045-b3d8-f0a79228064b", 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "def request_html_page(url):\n", 360 | " \"\"\"Given a url returns the html of that page\"\"\"\n", 361 | " req = requests.get(url)\n", 362 | " html = bs(req.text, \"html.parser\")\n", 363 | " return html" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "id": "25a592e2-b30e-4c5e-8811-854426e34d1f", 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "def get_page_html_links(page_num, tag=\"dask\", query_filter=\"MostVotes\"):\n", 374 | " \"\"\"Given a SO questions page number returns the html for that page number\n", 375 | " for a tag and query_filter.\n", 376 | " \"\"\"\n", 377 | " base_url = \"https://stackoverflow.com/questions/tagged/\"\n", 378 | "\n", 379 | " page_url = f\"{base_url}{tag}?sort={query_filter}&page={page_num}\"\n", 380 | "\n", 381 | " page_html = request_html_page(page_url)\n", 382 | "\n", 383 | " return page_html" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "id": "c7230061-0551-44fd-9014-d75ffc362d3b", 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [ 393 | "def get_post_links_per_page(html_page):\n", 394 | " \"\"\"Given a SO questions html page, returns a list with the posts of that page.\"\"\"\n", 395 | " question_href = html_page.find_all(\"a\", class_=\"s-link\")[2:-1]\n", 396 | "\n", 397 | " question_link = [f\"https://stackoverflow.com{q['href']}\" for q in question_href]\n", 398 | "\n", 399 | " return question_link" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "id": "78e6bf98-faef-4343-afb3-2e800c3e416e", 405 | "metadata": {}, 406 | "source": [ 407 | "### Explore the functions: " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "id": "614af67b-1868-4c3b-a212-559ab8136e7f", 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "page_number = 3\n", 418 | "\n", 419 | "page_3_html = get_page_html_links(page_num=page_number)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "id": "3318156f-a299-43e1-a08f-d585503938d6", 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "post_links_page_3 = get_post_links_per_page(page_3_html)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "id": "4ce191a0-7746-4fd7-93c5-620e69453c56", 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "len(post_links_page_3)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "id": "4cc7fe9d-5129-431d-92e5-939b8e777252", 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [ 449 | "post_links_page_3[:3]" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "id": "359aeb5d-8c2e-4334-88c8-90d3d7fdcdd2", 455 | "metadata": {}, 456 | "source": [ 457 | "### Get post links for multiple pages" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "id": "532930c8-a46a-4788-990d-0079b149a70b", 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "# serial code\n", 468 | "page_posts_links = []\n", 469 | "for page in range(1, 5):\n", 470 | " page_html = get_page_html_links(page_num=page)\n", 471 | " posts_links = get_post_links_per_page(page_html)\n", 472 | "\n", 473 | " page_posts_links.append(posts_links)" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "id": "e7377b99-415b-400d-bc41-38e8142da3c8", 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "len(page_posts_links)" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "id": "7ed0ada1-8875-417f-a8b6-a887706b8eb7", 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "[len(l) for l in page_posts_links]" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "id": "d189e69c-648f-4e43-88df-08b3222b943c", 499 | "metadata": {}, 500 | "source": [ 501 | "**Parallel code: using `client.map()`**\n", 502 | "\n", 503 | "We can get first the futures for every page html, and pass those futures as the iterator to get the links per page." 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "id": "7417901f-f351-48ff-9083-685404f28742", 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "pages_html_futures = client.map(get_page_html_links, range(1, 5))\n", 514 | "wait(pages_html_futures) # wait until completed" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "id": "f8443c00-6592-4dbd-b660-80d0453872a8", 520 | "metadata": {}, 521 | "source": [ 522 | "**`wait()`**\n", 523 | "\n", 524 | "Notice that here we used `wait()`, you can wait on a future or collection of futures using the `wait` function, which blocks until all futures are finished or have erred. This is useful when you need all the futures to be completed to proceed with your computations. " 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "id": "3f9741ae-24e6-406c-ac69-e52ff5937d8c", 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "pages_html_futures[0]" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "id": "b14229c3-e063-4ebd-996f-ca3b9f449c5f", 540 | "metadata": {}, 541 | "source": [ 542 | "### Exercise:\n", 543 | "\n", 544 | "Using `client.map()` and the `pages_html_futures` you just got, to get the post's links for the four pages, in parallel." 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "id": "a0c18aaf-dce4-4a18-965f-3fc37e4532b0", 551 | "metadata": { 552 | "tags": [] 553 | }, 554 | "outputs": [], 555 | "source": [ 556 | "# Solution\n", 557 | "posts_links_futures = client.map(get_post_links_per_page, pages_html_futures)\n", 558 | "posts_links_futures" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "id": "6060850e-0647-431d-ab91-a7bf5015a4f8", 565 | "metadata": { 566 | "tags": [] 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "posts_links_futures[0]" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "id": "9a456089-a046-42f0-8dc5-ce76ea6d6496", 577 | "metadata": { 578 | "tags": [] 579 | }, 580 | "outputs": [], 581 | "source": [ 582 | "posts_links_futures[0].result()[:3]" 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "id": "7242493f-9026-40e8-80aa-48e16c8cbff7", 588 | "metadata": {}, 589 | "source": [ 590 | "**`as_completed()`**\n", 591 | "\n", 592 | "In the example above we waited for the the `pages_html_futures` to finish before we proceed to get the `posts_links_futures`. However, we can get the `post_links_futures` for every page as the `pages_html_futures` finish. \n", 593 | "\n", 594 | "`as_completed()` return futures in the order in which they complete. It returns an iterator that yields the input future objects in the order in which they complete. " 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": null, 600 | "id": "bf892eab-26f8-4fc5-9367-7711d52468f4", 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "from dask.distributed import as_completed" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": null, 610 | "id": "11a3bb79-0116-40a4-8c14-9bbdbfb96652", 611 | "metadata": {}, 612 | "outputs": [], 613 | "source": [ 614 | "pages_html_futures = client.map(\n", 615 | " get_page_html_links, range(1, 5), pure=False\n", 616 | ") # use pure=False to re-compute\n", 617 | "\n", 618 | "post_links_futures = []\n", 619 | "for p in as_completed(pages_html_futures):\n", 620 | " post_links_futures.append(client.submit(get_post_links_per_page, p))" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "id": "38a1dd26-699b-4800-b8c7-1229d3dcae00", 627 | "metadata": {}, 628 | "outputs": [], 629 | "source": [ 630 | "post_links_futures" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "id": "16467916-008c-4a21-bb1e-d374c530ee57", 636 | "metadata": {}, 637 | "source": [ 638 | "## Grown-up example: Scrape, crawl and get SO data\n", 639 | "\n", 640 | "Let's use all what we've learned in the examples above, to do something a bit more advanced. In this section, we graduate to a grown-up example. You will learn how to parallelize a scraping, crawling and get data workflow.\n", 641 | "\n", 642 | "Up to now, you learned how to scrape multiple pages from https://stackoverflow.com/questions/, and to get a list of the post links for every page. Let's go a step further and get some data of each post. For example we can \n", 643 | "\n", 644 | "- Title\n", 645 | "- Question body\n", 646 | "- Most voted answer\n", 647 | "- Number of votes for the best answer\n", 648 | "- Who authored the most voted answer\n", 649 | "\n", 650 | "
\n", 651 | "\"data\n", 654 | "
\n", 655 | "\n", 656 | "\n", 657 | "#### Data insights\n", 658 | "\n", 659 | "For every page we will end up with one dictionary per post that contains the information above, we can convert them into a dataframe and for example find useful aggregated information like:\n", 660 | "\n", 661 | "- Which username gets the most \"best answers\"?\n", 662 | "- Which of the best answer usernames is the most voted?\n", 663 | "\n", 664 | "\n", 665 | "
\n", 666 | "\"bar\n", 669 | "
\n", 670 | "\n", 671 | "**Note about throttling:**\n", 672 | "\n", 673 | "When scrapping directly from the pages and not using the API, it is not clear what are the throttling limitations, but from experience we run into them pretty quickly.\n", 674 | "\n", 675 | "The following examples, work as they are, if you change the number of pages you will likely hit a limit and be banned for few minutes. We will work around this towards the end, in the meantime avoid changing the number of pages" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "id": "d22c3321-d821-4c89-b90d-a1620312db0b", 681 | "metadata": {}, 682 | "source": [ 683 | "## Scrape, crawl, get data, and plot\n", 684 | "\n", 685 | "Below you have our set of functions that we use above, plus function that will allow us to scrape the data needed to get some insights.\n", 686 | "\n", 687 | "You will see this functions in action in serial and together we will use all what we learned about futures to run things in parallel." 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "id": "6c4faad5-4bf2-4534-b3e1-26e6b2eb0432", 694 | "metadata": {}, 695 | "outputs": [], 696 | "source": [ 697 | "def request_html_page(url):\n", 698 | " \"\"\"Given a url returns the html of that page\"\"\"\n", 699 | " req = requests.get(url)\n", 700 | " html = bs(req.text, \"html.parser\")\n", 701 | " return html" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "id": "308a11f4-69cc-412c-bcf9-e9994435a70c", 708 | "metadata": {}, 709 | "outputs": [], 710 | "source": [ 711 | "def get_page_html_links(page_num, tag=\"dask\", query_filter=\"MostVotes\"):\n", 712 | " \"\"\"Given a SO questions page number returns the html for that page number\n", 713 | " for a tag and query_filter.\n", 714 | " \"\"\"\n", 715 | " base_url = \"https://stackoverflow.com/questions/tagged/\"\n", 716 | "\n", 717 | " page_url = f\"{base_url}{tag}?sort={query_filter}&page={page_num}\"\n", 718 | "\n", 719 | " page_html = request_html_page(page_url)\n", 720 | "\n", 721 | " return page_html" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "id": "ccdb5443-9067-40c2-a56c-a5cc204acade", 728 | "metadata": {}, 729 | "outputs": [], 730 | "source": [ 731 | "def get_post_links_per_page(html_page):\n", 732 | " \"\"\"Given a SO questions html page, returns a list with the posts of that page.\"\"\"\n", 733 | " question_href = html_page.find_all(\"a\", class_=\"s-link\")[2:-1]\n", 734 | "\n", 735 | " question_link = [f\"https://stackoverflow.com{q['href']}\" for q in question_href]\n", 736 | "\n", 737 | " return question_link" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": null, 743 | "id": "2bd88acf-eb3b-4d7d-8735-8449fc53b788", 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "def get_data(post_link):\n", 748 | " \"\"\"Get data from a SO post as a dictionary\n", 749 | "\n", 750 | " - Title\n", 751 | " - Question body\n", 752 | " - Number of votes for the best answer\n", 753 | " - Who authored the most voted answer\n", 754 | "\n", 755 | " \"\"\"\n", 756 | " html_post = request_html_page(post_link)\n", 757 | " post_info = {}\n", 758 | "\n", 759 | " post_info[\"title\"] = html_post.title.text\n", 760 | " post_info[\"question\"] = html_post.find(\"div\", class_=\"s-prose js-post-body\").text\n", 761 | "\n", 762 | " answ = html_post.find(\n", 763 | " \"div\", class_=\"answer\"\n", 764 | " ) # this will gets us the first/most voted answer\n", 765 | "\n", 766 | " if answ:\n", 767 | " post_info[\"best_answer_votes\"] = int(answ[\"data-score\"])\n", 768 | "\n", 769 | " best_answer_author_obj = answ.find(\"span\", itemprop=\"name\")\n", 770 | "\n", 771 | " if best_answer_author_obj:\n", 772 | " best_answer_author = best_answer_author_obj.text\n", 773 | " else:\n", 774 | " best_answer_author = \"comunity_post\"\n", 775 | "\n", 776 | " post_info[\"best_answer_usrname\"] = best_answer_author\n", 777 | " else:\n", 778 | " post_info[\"best_answer_votes\"] = 0\n", 779 | " post_info[\"best_answer_usrname\"] = \"no-answer\"\n", 780 | "\n", 781 | " return post_info" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "id": "7ea45c41-734c-4dc0-bc64-cd3b401eb9a6", 787 | "metadata": {}, 788 | "source": [ 789 | "## Serial" 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "id": "f8ce2ce1-6cff-40d4-a495-d6a75ead372c", 795 | "metadata": {}, 796 | "source": [ 797 | "### Let's try only 1 page" 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": null, 803 | "id": "f7dde5aa-69f2-4290-a8e6-4a2c86ee80c4", 804 | "metadata": {}, 805 | "outputs": [], 806 | "source": [ 807 | "%%time\n", 808 | "page_num=1 #we try 1 page \n", 809 | "page_html = get_page_html_links(page_num)\n", 810 | "posts_links = get_post_links_per_page(page_html)\n", 811 | "list_post_data = []\n", 812 | "\n", 813 | "for link in posts_links:\n", 814 | " p_data = get_data(link)\n", 815 | " list_post_data.append(p_data)\n", 816 | "\n", 817 | "df1 = pd.DataFrame(list_post_data)" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "id": "3dc9e1e5-3a50-430c-b840-ec501d39aceb", 824 | "metadata": {}, 825 | "outputs": [], 826 | "source": [ 827 | "df1.head()" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "id": "33778b54-cda1-4b4a-8b7a-b477d2a2131d", 833 | "metadata": {}, 834 | "source": [ 835 | "### How about 2? \n", 836 | "\n", 837 | "We'll try 2 pages, we'll likely hit throttling, if you get an error message that says \n", 838 | "\n", 839 | "```python-traceback\n", 840 | "AttributeError: 'NoneType' object has no attribute 'text'\n", 841 | "```\n", 842 | "that's a result of hitting throttling. " 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": null, 848 | "id": "2402f42a-e528-45dd-bce8-0b3bc64022f7", 849 | "metadata": {}, 850 | "outputs": [], 851 | "source": [ 852 | "%%time\n", 853 | "\n", 854 | "df_list = []\n", 855 | "for page_num in range(1, 3): \n", 856 | " page_html = get_page_html_links(page_num)\n", 857 | " posts_links = get_post_links_per_page(page_html)\n", 858 | " list_post_data = []\n", 859 | "\n", 860 | " for link in posts_links:\n", 861 | " p_data = get_data(link)\n", 862 | " list_post_data.append(p_data)\n", 863 | "\n", 864 | " df = pd.DataFrame(list_post_data)\n", 865 | " df_list.append(df)" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": null, 871 | "id": "8a8a0cf1-4b0c-48d9-a037-d43306ff510c", 872 | "metadata": {}, 873 | "outputs": [], 874 | "source": [ 875 | "df_list[0].head()" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "id": "27c2fce7-3a4d-4d8c-9d39-1bb76bb581fc", 881 | "metadata": {}, 882 | "source": [ 883 | "## Parallel\n", 884 | "\n", 885 | "### Avoid throttling using a cluster. \n", 886 | "\n", 887 | "In the example above, if we try to work with more pages, you will hit throttling issues. This is a problem, but we can solve it by scaling out to a lot of machines. When using a cluster each worker has its own public IP-address so it is like we are requesting from different machines. \n", 888 | "\n", 889 | "Let's create a coiled cluster and scrape a bigger number of pages." 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "id": "479ef7ce-e9ca-48ae-8782-fe1691b34ac0", 896 | "metadata": {}, 897 | "outputs": [], 898 | "source": [ 899 | "# shutdown Local Cluster\n", 900 | "client.close()" 901 | ] 902 | }, 903 | { 904 | "cell_type": "markdown", 905 | "id": "4358ef9b-87c7-4033-86c4-48f7c26a88f1", 906 | "metadata": {}, 907 | "source": [ 908 | "**Note - Windows Users**\n", 909 | "\n", 910 | "If the following cell doesn't work, you will need to go to a command prompt or PowerShell window within an environment that includes coiled and run the following command from there:" 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": null, 916 | "id": "ab088ce0-b8f7-47f0-9a64-153c90dbebb3", 917 | "metadata": {}, 918 | "outputs": [], 919 | "source": [ 920 | "### coiled login\n", 921 | "#!coiled login --token ### --account dask-tutorials" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": null, 927 | "id": "9b81208e-1075-412a-ac48-e2193c984acb", 928 | "metadata": {}, 929 | "outputs": [], 930 | "source": [ 931 | "import coiled" 932 | ] 933 | }, 934 | { 935 | "cell_type": "code", 936 | "execution_count": null, 937 | "id": "c8bbdbe4-f486-49dc-9d0b-0e6230c5ad2c", 938 | "metadata": {}, 939 | "outputs": [], 940 | "source": [ 941 | "# we use this to avoid re-using clusters on a team\n", 942 | "import uuid\n", 943 | "\n", 944 | "id_cluster = uuid.uuid4().hex[:4]\n", 945 | "id_cluster" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "id": "6d91a1f5-b9e1-4553-b9c4-3da14127c59f", 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [ 955 | "%%time\n", 956 | "cluster = coiled.Cluster(\n", 957 | " name=f\"my-cluster-{id_cluster}\",\n", 958 | " account=\"dask-tutorials\",\n", 959 | " n_workers=10,\n", 960 | ")" 961 | ] 962 | }, 963 | { 964 | "cell_type": "code", 965 | "execution_count": null, 966 | "id": "4334ae4e-43a2-450e-a6fc-7f1b9c302f3f", 967 | "metadata": {}, 968 | "outputs": [], 969 | "source": [ 970 | "# When running from binder, the dask-lab extension won't work, use link to dashboard\n", 971 | "client = Client(cluster)\n", 972 | "client" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": null, 978 | "id": "401e1817-fe7a-4d3c-927e-44db2f5be74e", 979 | "metadata": {}, 980 | "outputs": [], 981 | "source": [ 982 | "%%time\n", 983 | "pages_futures = client.map(get_page_html_links, range(1, 11))\n", 984 | "wait(pages_futures)\n", 985 | "\n", 986 | "posts_links_futures = client.map(get_post_links_per_page, pages_futures)\n", 987 | "crawling = as_completed(posts_links_futures)\n", 988 | "\n", 989 | "dfs_data = []\n", 990 | "for future in crawling:\n", 991 | " list_links = future.result() # list of links per page\n", 992 | " df_data = []\n", 993 | " for link in list_links:\n", 994 | " fut_data = client.submit(get_data, link)\n", 995 | " df_data.append(fut_data)\n", 996 | "\n", 997 | " dfs_data.append(df_data)\n", 998 | "_ = wait(dfs_data)" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": null, 1004 | "id": "01a64300-e971-4235-b9a9-b16c778c1069", 1005 | "metadata": {}, 1006 | "outputs": [], 1007 | "source": [ 1008 | "len(dfs_data) # 10 pages" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "markdown", 1013 | "id": "6fda0881-f0a1-47d0-b019-8f7c4270b5f7", 1014 | "metadata": {}, 1015 | "source": [ 1016 | "At this point, we have the data to build each page dataframe. " 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": null, 1022 | "id": "67513dcf-0ee9-4981-a948-8f3697d0db0b", 1023 | "metadata": {}, 1024 | "outputs": [], 1025 | "source": [ 1026 | "dfs_data[0]" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "markdown", 1031 | "id": "1fa7fdfb-4a50-43ef-b662-76ccc892c141", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "### Let's get some dataframes:\n", 1035 | "In the serial code you ended up with dataframes and you could get insights on the data working with pandas dataframes. At the moment our futures do not have dataframes, but we can convert them by mapping pandas.Dataframe into our `dfs_data` futures." 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": null, 1041 | "id": "d13dedd5-18c7-41b4-b0a7-755502554b52", 1042 | "metadata": {}, 1043 | "outputs": [], 1044 | "source": [ 1045 | "df_futures = client.map(pd.DataFrame, dfs_data)" 1046 | ] 1047 | }, 1048 | { 1049 | "cell_type": "code", 1050 | "execution_count": null, 1051 | "id": "df97d442-f647-4a86-82f9-78bf85efd327", 1052 | "metadata": {}, 1053 | "outputs": [], 1054 | "source": [ 1055 | "df_futures[0]" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "markdown", 1060 | "id": "f0f1de10-3712-4a87-902e-3e26ecc8e276", 1061 | "metadata": {}, 1062 | "source": [ 1063 | "Now we have some pandas dataframes!!\n", 1064 | "\n", 1065 | "### Exercise\n", 1066 | "\n", 1067 | "Use the following function that returns the `value_counts` of the `best_answer_usrname`, to get the best answer value counts for the `df_futures`.\n" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "execution_count": null, 1073 | "id": "f01e35bf-c20a-4eac-847c-cf2298771cde", 1074 | "metadata": { 1075 | "tags": [] 1076 | }, 1077 | "outputs": [], 1078 | "source": [ 1079 | "def best_ans_val_counts(df):\n", 1080 | " return df.best_answer_usrname.value_counts()" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": null, 1086 | "id": "cbcb8357-5c47-4afe-a8c8-c7fca943d259", 1087 | "metadata": { 1088 | "tags": [] 1089 | }, 1090 | "outputs": [], 1091 | "source": [ 1092 | "###Solution\n", 1093 | "best_ans_val_counts_futures = client.map(best_ans_val_counts, df_futures)\n", 1094 | "best_ans_val_counts_futures[0].result()" 1095 | ] 1096 | }, 1097 | { 1098 | "cell_type": "markdown", 1099 | "id": "f51a90b3-036b-4f44-a03d-982e97646af8", 1100 | "metadata": {}, 1101 | "source": [ 1102 | "### Exercise\n", 1103 | "\n", 1104 | "Write a function that calculates the total amount of votes by best answer, and use it get the username with most votes for the `df_futures`. " 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "code", 1109 | "execution_count": null, 1110 | "id": "fa871753-e011-4987-89f2-e4b8fedfa58a", 1111 | "metadata": { 1112 | "tags": [] 1113 | }, 1114 | "outputs": [], 1115 | "source": [ 1116 | "# solution\n", 1117 | "def most_votes(df):\n", 1118 | " return df.groupby(\"best_answer_usrname\")[\"best_answer_votes\"].sum()\n", 1119 | "\n", 1120 | "\n", 1121 | "most_votes_futures = client.map(most_votes, df_futures, pure=False)" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "markdown", 1126 | "id": "690fe19b-0484-4984-8646-f6d98bfdebb8", 1127 | "metadata": {}, 1128 | "source": [ 1129 | "## Results for 10 pages aggregation\n", 1130 | "\n", 1131 | "Now we have 10 futures that each of them is a `pd.Series`. We can bring this to the client, concatenate them and re-do our plots.\n", 1132 | "\n", 1133 | "### Exercise\n", 1134 | "\n", 1135 | "Gather the results into a a list of `pd.Series`." 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "code", 1140 | "execution_count": null, 1141 | "id": "2bde999e-bc8d-477b-9be2-04c4f778e058", 1142 | "metadata": { 1143 | "tags": [] 1144 | }, 1145 | "outputs": [], 1146 | "source": [ 1147 | "# Solution\n", 1148 | "best_ans_count_res = client.gather(best_ans_val_counts_futures)\n", 1149 | "most_votes_res = client.gather(most_votes_futures)" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "markdown", 1154 | "id": "cadf3e1d-a4da-46fa-a914-82a5e1d56dcc", 1155 | "metadata": {}, 1156 | "source": [ 1157 | "### Let's Plot!" 1158 | ] 1159 | }, 1160 | { 1161 | "cell_type": "code", 1162 | "execution_count": null, 1163 | "id": "44e7962c-019b-471c-a3ca-fba7fbe11f7b", 1164 | "metadata": {}, 1165 | "outputs": [], 1166 | "source": [ 1167 | "most_answers_tot = (\n", 1168 | " pd.concat(best_ans_count_res, axis=1).sum(axis=1).sort_values(ascending=False)\n", 1169 | ")\n", 1170 | "most_voted_tot = (\n", 1171 | " pd.concat(most_votes_res, axis=1).sum(axis=1).sort_values(ascending=False)\n", 1172 | ")" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": null, 1178 | "id": "d5d50ca2-ad8e-4b74-aa46-cd7d8c7ec0c1", 1179 | "metadata": {}, 1180 | "outputs": [], 1181 | "source": [ 1182 | "most_answers_tot[:5].plot.bar(\n", 1183 | " title=\"Most Best Answers\", ylabel=\"Votes\", xlabel=\"Usernames\"\n", 1184 | ");" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "code", 1189 | "execution_count": null, 1190 | "id": "8e5b6813-e234-4f07-83a6-f083132098d8", 1191 | "metadata": {}, 1192 | "outputs": [], 1193 | "source": [ 1194 | "most_voted_tot[:5].plot.bar(title=\"Most Voted Usernames\", ylabel=\"Votes\");" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "markdown", 1199 | "id": "b3b05936-5863-4583-b0f0-44fe6fa44c00", 1200 | "metadata": {}, 1201 | "source": [ 1202 | "### Dask dataframes API\n", 1203 | "\n", 1204 | "Now we are on dataframe world, we can do pandas-like operations, for example." 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "markdown", 1209 | "id": "1a749893-30d1-4427-a293-ed91ef14ab5a", 1210 | "metadata": {}, 1211 | "source": [ 1212 | "We can do multiple operations on these dataframes using `futures` but at this point since we are working with dataframes we can use `dask.dataframes`. " 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "code", 1217 | "execution_count": null, 1218 | "id": "4429b8f4-79f5-4af4-a3cb-30d1730d82d1", 1219 | "metadata": {}, 1220 | "outputs": [], 1221 | "source": [ 1222 | "import dask.dataframe as dd" 1223 | ] 1224 | }, 1225 | { 1226 | "cell_type": "code", 1227 | "execution_count": null, 1228 | "id": "b63bef3e-9532-4fbe-b58e-2699633074bb", 1229 | "metadata": {}, 1230 | "outputs": [], 1231 | "source": [ 1232 | "ddf_so = dd.from_delayed(df_futures)" 1233 | ] 1234 | }, 1235 | { 1236 | "cell_type": "code", 1237 | "execution_count": null, 1238 | "id": "e17b6c1d-84e8-4160-a0b3-c0a38469fefe", 1239 | "metadata": {}, 1240 | "outputs": [], 1241 | "source": [ 1242 | "ddf_so" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "code", 1247 | "execution_count": null, 1248 | "id": "d4f19f68-1dee-4e4a-a748-df96260843aa", 1249 | "metadata": {}, 1250 | "outputs": [], 1251 | "source": [ 1252 | "ddf_so.columns" 1253 | ] 1254 | }, 1255 | { 1256 | "cell_type": "markdown", 1257 | "id": "158c1bce-23af-4310-ba78-cadfede71968", 1258 | "metadata": {}, 1259 | "source": [ 1260 | "We can check which of the user that got a best answer, has the most \"best answers\"" 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "code", 1265 | "execution_count": null, 1266 | "id": "a8c70813-5717-49c7-86ba-f705e1f32520", 1267 | "metadata": {}, 1268 | "outputs": [], 1269 | "source": [ 1270 | "ddf_so.best_answer_usrname.value_counts().compute()[:6]" 1271 | ] 1272 | }, 1273 | { 1274 | "cell_type": "markdown", 1275 | "id": "14ffeb25-b404-419f-a1f9-5a1d70b0616e", 1276 | "metadata": {}, 1277 | "source": [ 1278 | "We can also check how many votes, these users got:" 1279 | ] 1280 | }, 1281 | { 1282 | "cell_type": "code", 1283 | "execution_count": null, 1284 | "id": "90e8e18a-3004-4c5f-8e26-6a92fc65768f", 1285 | "metadata": {}, 1286 | "outputs": [], 1287 | "source": [ 1288 | "ddf_so.groupby(\"best_answer_usrname\")[\"best_answer_votes\"].sum().compute()" 1289 | ] 1290 | }, 1291 | { 1292 | "cell_type": "markdown", 1293 | "id": "e329af0b-bf41-43b3-9163-92d58386c8d4", 1294 | "metadata": {}, 1295 | "source": [ 1296 | "## Clean up" 1297 | ] 1298 | }, 1299 | { 1300 | "cell_type": "code", 1301 | "execution_count": null, 1302 | "id": "b0d37a84-97ee-4969-8846-876fe43e3455", 1303 | "metadata": {}, 1304 | "outputs": [], 1305 | "source": [ 1306 | "cluster.shutdown()\n", 1307 | "client.close()" 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "markdown", 1312 | "id": "57f28d2f-2b44-428c-afca-402b7988a6ea", 1313 | "metadata": {}, 1314 | "source": [ 1315 | "## Extra:\n", 1316 | "\n", 1317 | "Repeat the analysis we did for the `tag=\"dask\"` for a different one, like `tag=\"python\"`.\n", 1318 | "\n", 1319 | "You will need to modify this portion of the code to\n", 1320 | "\n", 1321 | "```python\n", 1322 | "pages_futures = client.map(get_page_html_links, range(1,3))\n", 1323 | "wait(pages_futures)\n", 1324 | "```\n", 1325 | "to use `client.map()` with `lambda` functions like:\n", 1326 | "\n", 1327 | "```python\n", 1328 | "# Solution\n", 1329 | "pages_py_futures = client.map(lambda p: get_page_html_links(p, tag=\"python\"), range(1, 3))\n", 1330 | "wait(pages_py_futures)\n", 1331 | "```" 1332 | ] 1333 | }, 1334 | { 1335 | "cell_type": "markdown", 1336 | "id": "62a7a6ef-b94b-46f8-af23-1a7ef27725f4", 1337 | "metadata": {}, 1338 | "source": [ 1339 | "### Useful links\n", 1340 | "\n", 1341 | "- https://tutorial.dask.org/05_futures.html\n", 1342 | "\n", 1343 | "**Useful links**\n", 1344 | "\n", 1345 | "* [Futures documentation](https://docs.dask.org/en/latest/futures.html)\n", 1346 | "* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)\n", 1347 | "* [Futures examples](https://examples.dask.org/futures.html)\n", 1348 | "\n", 1349 | "### Next lesson\n", 1350 | "\n", 1351 | "Register [here](https://www.coiled.io/tutorials) for reminders. \n", 1352 | "\n", 1353 | "In the next lesson, we’ll learn some best practices around working with larger-than-memory datasets. We’ll use the Uber/Lyft dataset to:\n", 1354 | "\n", 1355 | "- Manipulate Parquet files and optimize queries\n", 1356 | "- Navigate inconvenient file sizes and data types\n", 1357 | "- Extract useful features with Dask Dataframe\n", 1358 | "\n", 1359 | "By the end, we’ll learn the advantages of working with the Parquet file format and how to efficiently perform an exploratory analysis with Dask." 1360 | ] 1361 | } 1362 | ], 1363 | "metadata": { 1364 | "kernelspec": { 1365 | "display_name": "Python 3 (ipykernel)", 1366 | "language": "python", 1367 | "name": "python3" 1368 | }, 1369 | "language_info": { 1370 | "codemirror_mode": { 1371 | "name": "ipython", 1372 | "version": 3 1373 | }, 1374 | "file_extension": ".py", 1375 | "mimetype": "text/x-python", 1376 | "name": "python", 1377 | "nbconvert_exporter": "python", 1378 | "pygments_lexer": "ipython3", 1379 | "version": "3.10.0" 1380 | } 1381 | }, 1382 | "nbformat": 4, 1383 | "nbformat_minor": 5 1384 | } 1385 | -------------------------------------------------------------------------------- /2-Get_better-at-dask-dataframes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d3f8ffbf-9dc9-42d4-9482-c7fb9e4333a9", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Coiled\n", 12 | "\n", 13 | "### Sign up for the next live session https://www.coiled.io/tutorials" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "036478b0-c89d-4051-a513-f4436f48968f", 19 | "metadata": {}, 20 | "source": [ 21 | "\"Dask\n", 25 | "\n", 26 | "# Get better at Dask Dataframes\n", 27 | "\n", 28 | "In this lesson, you will learn the advantages of working with the parquet data format and best practices when working with big data. You will learn how to manipulate inconvenient file sizes and datatypes, as well as how to make your data easier to manipulate. You will be exploring the Uber/Lyft dataset and learning some key practices of feature engineering with Dask Dataframes.\n", 29 | "\n", 30 | "## Dask Dataframes \n", 31 | "\n", 32 | "\"Dask\n", 36 | "\n", 37 | "At its core, the `dask.dataframe` module implements a \"blocked parallel\" `DataFrame` object that looks and feels like the `pandas` API, but for parallel and distributed workflows. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n", 38 | "\n", 39 | "Dask dataframes are very useful, but getting the most out of them can be tricky. Where your data is stored, the format your data was saved in, the size of each file and the data types, are some examples of things you need to care when it comes to working with dataframes. " 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "4307fa56-f051-467b-bf1f-e3acae08e3a8", 45 | "metadata": {}, 46 | "source": [ 47 | "### Work close to your data\n", 48 | "\n", 49 | "To get started when you are working with data that is in the cloud it's always better to work close to your data to minimize the impact of IO networking. \n", 50 | "\n", 51 | "In this lesson, we will use Coiled Clusters that will be created on the same region that our datasets are stored. (the region is `\"us-east-2\"`)\n" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "id": "7a52c65b-f0bc-4341-9841-4de07f710dce", 57 | "metadata": {}, 58 | "source": [ 59 | "## Parquet vs CSV\n", 60 | "\n", 61 | "Most people are familiarized with **csv** files, but when it comes to working with data, working with **parquet** can make a big difference. \n", 62 | "\n", 63 | "### Parquet is where it's at!!\n", 64 | "\n", 65 | "The Parquet file format is column-oriented and it is designed to efficiently store and retrieve data. Columnar formats provide better compression and improved performance, and enable you to query data column by column. Consequently, aggregation queries are faster compared to row-oriented storage.\n", 66 | "\n", 67 | "\"Dask\n", 71 | " \n", 72 | " \n", 73 | "- **Column pruning:** Parquet lets you read specific columns from a dataset without reading the entire file.\n", 74 | "- **Better compression:** Because in each column the data types are fairly similar, the compression of each column is quite straightforward. (saves on storage)\n", 75 | "- **Schema:** Parquet stores the file schema in the file metadata.\n", 76 | "- **Column metadata:** Parquet stores metadata statistics for each column, which can make certain types of queries a lot more efficient.\n", 77 | "\n", 78 | " \n", 79 | "### Small motivation example: \n", 80 | "\n", 81 | "Let's see an example where we compare reading the same data but in one case it is stored as `csv` files, while the other as `parquet` files. " 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "id": "d336f266-3094-45e0-a1f7-f71dfb5f8962", 87 | "metadata": {}, 88 | "source": [ 89 | "**Note - Windows Users**\n", 90 | "\n", 91 | "Unless you are using WSL, you will need to go to a command prompt or PowerShell window within an environment that includes coiled and run the following command from there.\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "id": "e205e7c7-1d13-4f16-9035-ade902f43fad", 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "### coiled login\n", 102 | "#!coiled login --token ### --account dask-tutorials" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "id": "bea1905f-09aa-4865-844e-d752639e29c2", 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "import coiled\n", 113 | "import dask\n", 114 | "import dask.dataframe as dd\n", 115 | "from dask.distributed import Client" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "id": "f1162a7b-969a-46c7-a6b2-bfef92cade59", 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# we use this to avoid re-using clusters on a team\n", 126 | "import uuid\n", 127 | "\n", 128 | "id_cluster = uuid.uuid4().hex[:4]" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "e0ac86a4-5336-4c28-98e9-64323e450863", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "%%time\n", 139 | "cluster = coiled.Cluster(\n", 140 | " n_workers=10,\n", 141 | " name=f\"nyc-uber-lyft-{id_cluster}\",\n", 142 | " account=\"dask-tutorials\",\n", 143 | " worker_vm_types=[\"r6i.2xlarge\"],\n", 144 | " backend_options={\"region_name\": \"us-east-2\"},\n", 145 | ")" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "id": "254390a1-e055-446c-ab44-c8d76121928d", 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "client = Client(cluster)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "id": "2c665a6e-650a-4668-9a0c-8ae53c9a795f", 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "client" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "id": "6cb22fca-88e1-4019-9b26-d1f398b176bb", 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "# data dictionary\n", 176 | "data = {\n", 177 | " \"5GB-csv\": \"s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2/*.csv\",\n", 178 | " \"5GB-pq\": \"s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2_parquet/*.parquet\",\n", 179 | "}" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "id": "d6829081-0fe8-432b-a126-254ba8139f82", 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "ddf_csv = dd.read_csv(data[\"5GB-csv\"], storage_options={\"anon\": True})\n", 190 | "ddf_pq = dd.read_parquet(data[\"5GB-pq\"], storage_options={\"anon\": True})" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "id": "78de69de-fe56-46e4-b8b7-63936c7de7f1", 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "ddf_csv" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "id": "d1a9742b-a915-4e7f-8d81-59a127bf9e9c", 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "ddf_pq" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "id": "f938ab3c-30da-416c-87a4-58128c7973b7", 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "%%time\n", 221 | "ddf_csv.groupby(\"id1\").agg({\"v1\": \"sum\"}).compute()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "id": "84a6f2fb-b1ba-4b8d-ade5-be1c2c6f4a93", 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "%%time\n", 232 | "ddf_pq.groupby(\"id1\").agg({\"v1\": \"sum\"}).compute()" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "id": "0530b707-cad2-4b43-8921-1eb821616822", 238 | "metadata": {}, 239 | "source": [ 240 | "### Memory usage \n", 241 | "\n", 242 | "Notice that the `parquet` version without doing much it is already ~5X faster. Let's take a look at the memory usage as well as the `dtypes` in both cases." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "id": "4be1cc4f-4656-4565-bd83-dfec45e613fe", 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "## memory usage for 1 partition\n", 253 | "ddf_csv.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "011266f8-b9d7-46c4-b88d-574582f5528b", 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "ddf_pq.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "id": "8e323f48-9bbb-46bd-af08-7d6ded1ccead", 269 | "metadata": {}, 270 | "source": [ 271 | "## Uber/Lyft data transformation\n", 272 | "\n", 273 | "In the example above we saw that the format in which the data is stored, already makes a big difference. \n", 274 | "\n", 275 | "**Working with parquet** \n", 276 | "\n", 277 | "Let's use the Uber/Lyft dataset, as an example of a `parquet` dataset to learn how to troubleshoot the nuances of working with real data. The data comes from [High-Volume For-Hire Services](https://www.nyc.gov/site/tlc/businesses/high-volume-for-hire-services.page)\n", 278 | "\n", 279 | "_Data dictionary:_\n", 280 | "\n", 281 | "https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "id": "fda99d96-4b61-4bd8-882b-d7b457279276", 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "# inspect data\n", 292 | "import s3fs\n", 293 | "\n", 294 | "s3 = s3fs.S3FileSystem()\n", 295 | "files = s3.glob(\"nyc-tlc/trip data/fhvhv_tripdata_*.parquet\")\n", 296 | "files[:3]" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "id": "3af054b4-894c-4ff0-b5f2-e1dd9d117066", 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "len(files)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "id": "2b48f56f-2937-4b6d-8477-7c07721981c9", 312 | "metadata": {}, 313 | "source": [ 314 | "**Inspect the data**" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "id": "882652d4-cc6b-454a-be90-d6b8111186ef", 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "ddf = dd.read_parquet(\n", 325 | " \"s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet\",\n", 326 | ")\n", 327 | "ddf" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "id": "6c371785-d8c3-4882-86fc-0c6c9fd8ce43", 333 | "metadata": {}, 334 | "source": [ 335 | "## Note:\n", 336 | "If you are having problems reading the data, it is because even though the data is public, it requires local AWS credentials. We will provide a set of credentials live for this tutorial\"\n", 337 | "Edit the code above as:\n", 338 | "\n", 339 | "```python\n", 340 | "s3_storage_options = {\"key\": \"***\", \"secret\": \"***\"}\n", 341 | "\n", 342 | "ddf = dd.read_parquet(\n", 343 | " \"s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet\",\n", 344 | " storage_options=s3_storage_options,\n", 345 | ")\n", 346 | "```" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "id": "3727d3f4-3544-4889-8499-d4ea867bbf85", 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "# inspect dtypes\n", 357 | "ddf.dtypes" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "id": "70ea4e9a-d812-4d50-9f73-c34054b34007", 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "%%time\n", 368 | "# inspect memory usage of 1 partition\n", 369 | "ddf.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "id": "0547a9bf-436a-4a49-8cdd-7d79f4c281f0", 375 | "metadata": {}, 376 | "source": [ 377 | "### Challenges:\n", 378 | "\n", 379 | "- Big partitions\n", 380 | "- Inefficient data types\n", 381 | "\n", 382 | "### Recommendations and best practices:\n", 383 | "\n", 384 | "**Partition size**\n", 385 | "\n", 386 | "In general we recommend starting with partitions that are in the order of ~100MB (in memory). However, the choice of the partition size can vary depending on the worker memory that you have available. \n", 387 | "\n", 388 | "For documentation on partition sizes visit the [repartition docs](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html) as well as the repartition section in the Dask Dataframe [best practices](https://docs.dask.org/en/stable/dataframe-best-practices.html#repartition-to-reduce-overhead)\n", 389 | "\n", 390 | "**Data Types**\n", 391 | "\n", 392 | "- Avoid object types for strings: use `\"string[pyarrow]\"`\n", 393 | "- Reduce int/float representation if possible\n", 394 | "- Use categorical dtypes when possible (avoid high cardinality).\n", 395 | "- Consider using Nullable dtypes (very new/experimental)\n", 396 | "\n", 397 | "### Create conversions dictionary\n", 398 | "\n", 399 | "Based on these recommendations, let's work on better `dtypes`" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "id": "ac85d0af-b961-4fff-be71-b14b60cd52bd", 406 | "metadata": {}, 407 | "outputs": [], 408 | "source": [ 409 | "import pandas as pd" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": null, 415 | "id": "b76f77fb-67bd-4c3e-a6aa-3698365a0dd7", 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "dask.config.set({\"dataframe.convert-string\": True})" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "id": "b60dc7af-5906-42c3-8371-c20153e00394", 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "ddf = dd.read_parquet(\n", 430 | " \"s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet\",\n", 431 | ")\n", 432 | "ddf" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "id": "b606cb75-32f6-450b-9126-120faad85dff", 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "ddf.dtypes" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "id": "a9bccadb-6f7a-4dba-a4fd-113bd96fbbc9", 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "conversions = {}\n", 453 | "for column, dtype in ddf.dtypes.items():\n", 454 | " if dtype == \"float64\":\n", 455 | " conversions[column] = \"float32\"\n", 456 | " if dtype == \"int64\":\n", 457 | " conversions[column] = \"int32\"\n", 458 | " if \"flag\" in column:\n", 459 | " conversions[column] = pd.CategoricalDtype(categories=[\"Y\", \"N\"])\n", 460 | " if column == \"airport_fee\":\n", 461 | " conversions[\n", 462 | " column\n", 463 | " ] = \"float32\" # noticed that this has floats and the is making it an object\n", 464 | "conversions" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": null, 470 | "id": "53645d0d-8cd9-4fe8-892a-b4c11f2b787d", 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "# use new dtypes this takes a bit of time\n", 475 | "ddf = ddf.astype(conversions)\n", 476 | "ddf = ddf.persist()" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "id": "eed8a8ed-abb8-4fa4-8686-3a7a68bd28df", 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [ 486 | "ddf.partitions[0].memory_usage(deep=True).compute().apply(dask.utils.format_bytes)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "id": "f5970a21-6c8c-40dd-869e-07d998e7c2e3", 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [ 496 | "dask.utils.format_bytes(ddf.partitions[0].memory_usage(deep=True).compute().sum())" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "id": "ec42ac57-72fd-4ff3-8416-dcdb287e25b7", 502 | "metadata": {}, 503 | "source": [ 504 | "### Repartition" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "id": "4815f73c-0ae4-4b8b-84f2-1750cbaa969a", 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "ddf = ddf.repartition(partition_size=\"128MB\").persist()" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "id": "d6c13761-4906-47a6-b138-1ec5875e1242", 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "dask.utils.format_bytes(ddf.memory_usage(deep=True).compute().sum())" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "id": "329c807b-1dfe-4173-9d34-73bc8fc38695", 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "ddf.npartitions" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "id": "589ca6bf-5bba-4444-8ca2-671f45dec7c7", 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "dask.utils.format_bytes(ddf.partitions[0].memory_usage(deep=True).compute().sum())" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "id": "7c3fa314-86a2-4343-9a9a-c5b368e6a67a", 550 | "metadata": {}, 551 | "source": [ 552 | "### Other repartition options \n", 553 | "\n", 554 | "Sometimes, a repartition by size is not convenient for your use case. You can also repartition on a period of time if you have a timeseries with a datetime index. For example: if you where to need your data partition every `1d` you can do:\n", 555 | "\n", 556 | "```python\n", 557 | "ddf = ddf.set_index(\"request_datetime\")\n", 558 | "ddf = ddf.repartition(freq=\"1d\")\n", 559 | "```\n", 560 | "\n", 561 | "**Note:**\n", 562 | "Read more about repartition in the [dask documentation on this feature](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html#dask-dataframe-dataframe-repartition)" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "id": "9603233c-7e0f-41bf-aa70-84c4d55a7f12", 568 | "metadata": {}, 569 | "source": [ 570 | "### Save data to and S3 bucket" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "id": "56a3cf0e-edde-4979-b892-5149d0dbccf7", 577 | "metadata": {}, 578 | "outputs": [], 579 | "source": [ 580 | "# creds to be provided in live.\n", 581 | "s3_storage_options = {\"key\": \"***\", \"secret\": \"***\"}" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": null, 587 | "id": "2ca7ac61-9346-4ff9-82bc-4d8024e22ea8", 588 | "metadata": {}, 589 | "outputs": [], 590 | "source": [ 591 | "usr_id = \"your_name\"" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": null, 597 | "id": "7b0d899c-2155-4872-ade7-d80a19f5190e", 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "ddf.to_parquet(\n", 602 | " f\"s3://dask-tutorials-datasets/{usr_id}/\",\n", 603 | " storage_options=s3_storage_options,\n", 604 | ")" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": null, 610 | "id": "1c4291e8-0dce-467b-95ce-badd2513b10c", 611 | "metadata": {}, 612 | "outputs": [], 613 | "source": [ 614 | "cluster.shutdown()\n", 615 | "client.close()" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "id": "beeff184-6c3d-44bb-86f3-f071c1ab5781", 621 | "metadata": {}, 622 | "source": [ 623 | "## Let's do some data analysis\n", 624 | "\n", 625 | "Now we are at a stage that our whole dataset is ~80GB in memory. When it comes to exploring data we do not necessarily need the whole data set, we can work with a sample, as well as only select a subset of columns. One of the beauties of the parquet file format is **column pruning**\n", 626 | "\n", 627 | "Note: Keep in mind, that if you will do feature engineering, your data size will increase and having extra memory can help." 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "id": "287e76d2-04ce-4f5c-a620-39017f8b05a0", 633 | "metadata": {}, 634 | "source": [ 635 | "### Read data back\n", 636 | "\n", 637 | "After you save your data, you will want to read it back to do some data analysis or train a model. When reading data back, there are some caveats regarding the `dtypes`.\n", 638 | "\n", 639 | "- **Roundtriping for string pyarrow dtype** is not yet supported in pandas/dask. Hence when you read your data you need to tell pandas/dask to cast those columns as \"string[pyarrow]\" otherwise they'll be \"string[python]\". \n", 640 | "- **Nullable dtypes:** Using nullable dtypes is a fairly new feature and still under development, consider this experimental. Available in `dask >= 2022.12.0`\n", 641 | "\n", 642 | "**What are nullable dtypes?**\n", 643 | "\n", 644 | "Pandas (hence Dask) primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. \n", 645 | "\n", 646 | "Nullable dtypes, allow you to work around this issue. \n", 647 | "\n", 648 | "If you want to read more about nullable dtypes, check the pandas [missing data docs](\n", 649 | "https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data)" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "id": "00a71e42-785c-4d44-9ece-da7f138ff64c", 655 | "metadata": {}, 656 | "source": [ 657 | "NOTE: \n", 658 | "1. If you are in a live session you will be able to read the parquet files we stored, providing the credentials that we share with you live. \n", 659 | "2. If you are following this tutorial on your own the credentials will not work, but you can read a copy of the dataset we wrote, from `\"s3://coiled-datasets/uber-lyft-tlc/\"`" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "id": "3476887b-5643-4e4a-9cdb-d2488692cdf7", 666 | "metadata": {}, 667 | "outputs": [], 668 | "source": [ 669 | "%%time\n", 670 | "cluster = coiled.Cluster(\n", 671 | " name=f\"uber-lyft-{id_cluster}\",\n", 672 | " n_workers=15,\n", 673 | " account=\"dask-tutorials\",\n", 674 | " worker_vm_types=[\"m6i.xlarge\"],\n", 675 | " backend_options={\"region_name\": \"us-east-2\"},\n", 676 | ")" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "id": "304ba45b-5604-4bdd-8a33-7433cd9df26d", 683 | "metadata": {}, 684 | "outputs": [], 685 | "source": [ 686 | "client = Client(cluster)\n", 687 | "client" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "id": "3d5f536a-2f94-406a-8716-91d43ebf1e26", 694 | "metadata": {}, 695 | "outputs": [], 696 | "source": [ 697 | "# use bucket where you wrote above if you are following from public session\n", 698 | "# or public data uri (\"s3://coiled-datasets/uber-lyft-tlc/\") otherwise\n", 699 | "file_to_read = (\n", 700 | " f\"s3://dask-tutorials-datasets/{usr_id}/\" # replace for public uri if needed\n", 701 | ")" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "id": "15530dc1-703e-4491-9eec-c391cd663a34", 708 | "metadata": {}, 709 | "outputs": [], 710 | "source": [ 711 | "# if reading form public uri and in binder use storage_options=storage_options={'anon': True}\n", 712 | "df = dd.read_parquet(\n", 713 | " file_to_read, # replace for \"s3://coiled-datasets/uber-lyft-tlc/\" if needed\n", 714 | " storage_options = s3_storage_options,\n", 715 | ")" 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": null, 721 | "id": "2e2900ba-cbdf-46fc-9ee4-9fbc69ea881c", 722 | "metadata": {}, 723 | "outputs": [], 724 | "source": [ 725 | "df.dtypes" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "id": "80eb7dc4-e84e-40b9-9842-dad6670dad3b", 731 | "metadata": {}, 732 | "source": [ 733 | "## Memory usage \n", 734 | "\n", 735 | "```python\n", 736 | "dask.utils.format_bytes(\n", 737 | " df.memory_usage(deep=True).sum().compute()\n", 738 | ")\n", 739 | "```\n", 740 | "'82.81 GiB'\n" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": null, 746 | "id": "834816b0-5899-43a8-853e-ffc91a0a7da8", 747 | "metadata": {}, 748 | "outputs": [], 749 | "source": [ 750 | "df.head()" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "id": "dff7b58b-97c5-4e06-9fdf-f2fe4fc979a3", 757 | "metadata": {}, 758 | "outputs": [], 759 | "source": [ 760 | "df.columns" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": null, 766 | "id": "d52ca6d7-4030-4103-b128-4263f95044c7", 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "# len(df)\n", 771 | "## 783_431_901" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "id": "b1262fb9-adaa-4e23-9b5f-7e50e62128ce", 778 | "metadata": {}, 779 | "outputs": [], 780 | "source": [ 781 | "%%time\n", 782 | "##let's count to see NaN\n", 783 | "df.count().compute()" 784 | ] 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": null, 789 | "id": "eb55c7c8-bb9b-4fe5-9729-6b91f263a3ac", 790 | "metadata": { 791 | "tags": [] 792 | }, 793 | "outputs": [], 794 | "source": [ 795 | "# Create a column tip > 0 = True\n", 796 | "df[\"tip_flag\"] = df.tips > 0\n", 797 | "\n", 798 | "df_small = df[\n", 799 | " [\n", 800 | " \"hvfhs_license_num\",\n", 801 | " \"tips\",\n", 802 | " \"base_passenger_fare\",\n", 803 | " \"driver_pay\",\n", 804 | " \"trip_miles\",\n", 805 | " \"trip_time\",\n", 806 | " \"shared_request_flag\",\n", 807 | " \"tip_flag\",\n", 808 | " ]\n", 809 | "].persist()" 810 | ] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "execution_count": null, 815 | "id": "54707176-6ddf-4c3c-9870-7cdfd03f8335", 816 | "metadata": {}, 817 | "outputs": [], 818 | "source": [ 819 | "df_small.head()" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": null, 825 | "id": "fb809b7b-9cdf-4d54-b191-305a3c89eb97", 826 | "metadata": {}, 827 | "outputs": [], 828 | "source": [ 829 | "df_small.base_passenger_fare.sum().compute() / 1e9" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": null, 835 | "id": "286fc4f5-003b-47a9-8a56-fc227141e670", 836 | "metadata": {}, 837 | "outputs": [], 838 | "source": [ 839 | "df_small.driver_pay.sum().compute() / 1e9" 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": null, 845 | "id": "0ca23fa1-ab0c-4426-8165-7742bcae7891", 846 | "metadata": {}, 847 | "outputs": [], 848 | "source": [ 849 | "df_small.tips.sum().compute() / 1e6" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": null, 855 | "id": "2fcaad19-5ad8-4ce8-a5e6-669a802c37e3", 856 | "metadata": {}, 857 | "outputs": [], 858 | "source": [ 859 | "df_small.columns" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "id": "fc2110ff-d2fe-4a57-8556-f8381cd12d48", 865 | "metadata": {}, 866 | "source": [ 867 | "### Are New Yorkers tippers? \n", 868 | "\n", 869 | "Let's see how many trips have tip by provider " 870 | ] 871 | }, 872 | { 873 | "cell_type": "code", 874 | "execution_count": null, 875 | "id": "ac3192f3-7913-41e6-9b7b-ed01bb8a96c5", 876 | "metadata": {}, 877 | "outputs": [], 878 | "source": [ 879 | "tip_counts = df_small.groupby([\"hvfhs_license_num\"]).tip_flag.value_counts().compute()" 880 | ] 881 | }, 882 | { 883 | "cell_type": "code", 884 | "execution_count": null, 885 | "id": "13062d36-4168-41b7-b2c5-d6c8e62de44e", 886 | "metadata": {}, 887 | "outputs": [], 888 | "source": [ 889 | "tip_counts" 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "id": "fac3870d-0b82-47e9-856b-e15e77cf3425", 895 | "metadata": { 896 | "tags": [] 897 | }, 898 | "source": [ 899 | "**From the data dictionary we know:**\n", 900 | "\n", 901 | "As of September 2019, the HVFHS licenses are the following:\n", 902 | "\n", 903 | "- HV0002: Juno \n", 904 | "- HV0003: Uber \n", 905 | "- HV0004: Via \n", 906 | "- HV0005: Lyft " 907 | ] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "execution_count": null, 912 | "id": "28c54f0e-0526-447e-8f5c-19fc6e6b1962", 913 | "metadata": {}, 914 | "outputs": [], 915 | "source": [ 916 | "type(tip_counts)" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "id": "fd1d5b68-5e2e-487d-9db7-b1bc0c5b8e2d", 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [ 926 | "## this is a pandas\n", 927 | "tip_counts = tip_counts.unstack(level=\"tip_flag\")\n", 928 | "tip_counts / 1e6" 929 | ] 930 | }, 931 | { 932 | "cell_type": "markdown", 933 | "id": "e01315d8-0a69-47d1-a443-64bfa5d84aa7", 934 | "metadata": {}, 935 | "source": [ 936 | "### Percentage of total rides that tip" 937 | ] 938 | }, 939 | { 940 | "cell_type": "code", 941 | "execution_count": null, 942 | "id": "554de773-2424-44c5-b711-d1d4e4a6bad6", 943 | "metadata": {}, 944 | "outputs": [], 945 | "source": [ 946 | "tip_counts[True] * 100 / (tip_counts[True] + tip_counts[False])" 947 | ] 948 | }, 949 | { 950 | "cell_type": "markdown", 951 | "id": "3c42ee9b-b7db-4fde-8125-84c5aca7483c", 952 | "metadata": {}, 953 | "source": [ 954 | "### sum and mean of tips by provider " 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": null, 960 | "id": "9b10f437-ba9d-41d3-bc43-0d4e96e9dcdb", 961 | "metadata": {}, 962 | "outputs": [], 963 | "source": [ 964 | "tips_total = (\n", 965 | " df_small.loc[lambda x: x.tip_flag]\n", 966 | " .groupby(\"hvfhs_license_num\")\n", 967 | " .tips.agg([\"sum\", \"mean\"])\n", 968 | " .compute()\n", 969 | ")\n", 970 | "tips_total" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "id": "ef8c80f8-0d20-4bf4-a4ee-fa4a5325ce26", 977 | "metadata": {}, 978 | "outputs": [], 979 | "source": [ 980 | "provider = {\"HV0002\": \"Juno\", \"HV0005\": \"Lyft\", \"HV0003\": \"Uber\", \"HV0004\": \"Via\"}" 981 | ] 982 | }, 983 | { 984 | "cell_type": "code", 985 | "execution_count": null, 986 | "id": "776e2689-133d-4cfb-bd08-0ca34ee05f31", 987 | "metadata": {}, 988 | "outputs": [], 989 | "source": [ 990 | "tips_total = tips_total.assign(provider=lambda df: df.index.map(provider)).set_index(\n", 991 | " \"provider\"\n", 992 | ")\n", 993 | "tips_total" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "id": "e63f9aad-c1e0-49c6-9de2-8d99e4bd44e2", 999 | "metadata": {}, 1000 | "source": [ 1001 | "### What percentage of the passenger fare is the tip" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "id": "8d089e27-053a-4f8c-820a-0fedb4d232ea", 1007 | "metadata": {}, 1008 | "source": [ 1009 | "### Exercise\n", 1010 | "- Create a new column named \"tip_percentage\" that represents the what fraction of the passenger fare is the tip" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "code", 1015 | "execution_count": null, 1016 | "id": "cec9faaf-f329-4ebb-8e60-d1fe95d22f2c", 1017 | "metadata": {}, 1018 | "outputs": [], 1019 | "source": [ 1020 | "# solution\n", 1021 | "tip_percentage = df_small.tips / df_small.base_passenger_fare\n", 1022 | "df_small[\"tip_percentage\"] = tip_percentage" 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": null, 1028 | "id": "5b14e09d-7a4b-454f-ab39-12cbcd5d7ef4", 1029 | "metadata": {}, 1030 | "outputs": [], 1031 | "source": [ 1032 | "df_small = df_small.persist()" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "markdown", 1037 | "id": "6c5aaa54-f344-464e-90af-4270c1792528", 1038 | "metadata": {}, 1039 | "source": [ 1040 | "## Tip percentage mean of trip with tip" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "code", 1045 | "execution_count": null, 1046 | "id": "0ab6fa5d-1b7b-4cd6-aa02-f5986784cd49", 1047 | "metadata": {}, 1048 | "outputs": [], 1049 | "source": [ 1050 | "tips_perc_mean = (\n", 1051 | " df_small.loc[lambda x: x.tip_flag]\n", 1052 | " .groupby(\"hvfhs_license_num\")\n", 1053 | " .tip_percentage.mean()\n", 1054 | " .compute()\n", 1055 | ")\n", 1056 | "tips_perc_mean" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "execution_count": null, 1062 | "id": "c16d4260-46a4-4a5a-842d-b80f3e0cc76b", 1063 | "metadata": {}, 1064 | "outputs": [], 1065 | "source": [ 1066 | "(tips_perc_mean.to_frame().set_index(tips_perc_mean.index.map(provider)))" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "markdown", 1071 | "id": "42ba5d51-9537-47ad-86db-3aa7a98a885a", 1072 | "metadata": {}, 1073 | "source": [ 1074 | "### Base pay per mile per - by provider\n" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "code", 1079 | "execution_count": null, 1080 | "id": "70eaed48-4b26-48ce-8036-fc26f4e13bad", 1081 | "metadata": {}, 1082 | "outputs": [], 1083 | "source": [ 1084 | "dollars_per_mile = df_small.base_passenger_fare / df_small.trip_miles\n", 1085 | "df_small[\"dollars_per_mile\"] = dollars_per_mile\n", 1086 | "df_small = df_small.persist()" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "code", 1091 | "execution_count": null, 1092 | "id": "fb6f2544-c41f-45dc-8f8a-f3863f369044", 1093 | "metadata": {}, 1094 | "outputs": [], 1095 | "source": [ 1096 | "(\n", 1097 | " df_small.groupby(\"hvfhs_license_num\")\n", 1098 | " .dollars_per_mile.agg([\"min\", \"max\", \"mean\", \"std\"])\n", 1099 | " .compute()\n", 1100 | ")" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "code", 1105 | "execution_count": null, 1106 | "id": "910754f2-478c-4eeb-99a5-8ba4b2cf5d75", 1107 | "metadata": {}, 1108 | "outputs": [], 1109 | "source": [ 1110 | "# filter: check only trips with tip\n", 1111 | "(\n", 1112 | " df_small.loc[lambda x: x.tip_flag]\n", 1113 | " .groupby(\"hvfhs_license_num\")\n", 1114 | " .dollars_per_mile.agg([\"min\", \"max\", \"mean\", \"std\"])\n", 1115 | " .compute()\n", 1116 | ")" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "markdown", 1121 | "id": "5508b41a-e186-4af6-8a35-2a659c3a6975", 1122 | "metadata": {}, 1123 | "source": [ 1124 | "### Get insight on the data\n", 1125 | "\n", 1126 | "We are seeing weird numbers, let's try to take a deeper look and remove some outliers" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "code", 1131 | "execution_count": null, 1132 | "id": "15df0aab-84fb-45bc-af6d-d2f9892b38be", 1133 | "metadata": { 1134 | "tags": [] 1135 | }, 1136 | "outputs": [], 1137 | "source": [ 1138 | "(\n", 1139 | " df_small[[\"trip_miles\", \"base_passenger_fare\", \"tips\", \"tip_flag\"]]\n", 1140 | " .loc[lambda x: x.tip_flag]\n", 1141 | " .describe()\n", 1142 | " .compute()\n", 1143 | ")" 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "markdown", 1148 | "id": "5e970925-c96d-40e5-a906-176a6f69a44c", 1149 | "metadata": {}, 1150 | "source": [ 1151 | "### Getting to know the data\n", 1152 | "\n", 1153 | "- How would you get more insights on the data?\n", 1154 | "- Can you visualize it?\n", 1155 | "\n", 1156 | "**Hint:** Get a small sample, like 0.1% of the data to plot ~700_000 rows (go smaller if needed depending on your machine), compute it and work with that pandas dataframe." 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": null, 1162 | "id": "21a73002-ab6b-4627-b5af-7ecdeedd0d78", 1163 | "metadata": { 1164 | "tags": [] 1165 | }, 1166 | "outputs": [], 1167 | "source": [ 1168 | "# needed to avoid plots from breaking\n", 1169 | "%matplotlib inline" 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "code", 1174 | "execution_count": null, 1175 | "id": "f1abd1c5-8d60-4772-b1a9-4a180790c0d8", 1176 | "metadata": { 1177 | "tags": [] 1178 | }, 1179 | "outputs": [], 1180 | "source": [ 1181 | "## Take a sample\n", 1182 | "df_tiny = (\n", 1183 | " df_small.loc[lambda x: x.tip_flag][[\"trip_miles\", \"base_passenger_fare\", \"tips\"]]\n", 1184 | " .sample(frac=0.001)\n", 1185 | " .compute()\n", 1186 | ")" 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "code", 1191 | "execution_count": null, 1192 | "id": "a46b9993-3009-4e5d-a0e4-2c8368ec0613", 1193 | "metadata": { 1194 | "tags": [] 1195 | }, 1196 | "outputs": [], 1197 | "source": [ 1198 | "# box plot\n", 1199 | "df_tiny.boxplot()" 1200 | ] 1201 | }, 1202 | { 1203 | "cell_type": "markdown", 1204 | "id": "500530b7-0f24-4566-b234-1f1a5acc2cc3", 1205 | "metadata": {}, 1206 | "source": [ 1207 | "### Cleaning up outliers\n", 1208 | "\n", 1209 | "- Play with the pandas dataframe `df_tiny` to get insights on good filters for the bigger dataframe. \n", 1210 | "\n", 1211 | "Hint: think about pandas dataframe quantiles [docs here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "code", 1216 | "execution_count": null, 1217 | "id": "0d674a25-104a-4506-8602-25e214023954", 1218 | "metadata": { 1219 | "tags": [] 1220 | }, 1221 | "outputs": [], 1222 | "source": [ 1223 | "df_tiny.tips.quantile([0.25, 0.75])" 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "markdown", 1228 | "id": "d552cdcf-2ea9-4443-a9a5-b55523e33e85", 1229 | "metadata": {}, 1230 | "source": [ 1231 | "### Exercise\n", 1232 | "\n", 1233 | "- Calculate the first and third quantiles for `base_passenger_fare` and `trip_miles`" 1234 | ] 1235 | }, 1236 | { 1237 | "cell_type": "code", 1238 | "execution_count": null, 1239 | "id": "905ed15c-106a-42eb-bfee-59f6d12b84d6", 1240 | "metadata": { 1241 | "tags": [] 1242 | }, 1243 | "outputs": [], 1244 | "source": [ 1245 | "# solution\n", 1246 | "df_tiny.base_passenger_fare.quantile([0.25, 0.75])" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": null, 1252 | "id": "f08228ab-5f07-460a-b268-d176e77aef94", 1253 | "metadata": { 1254 | "tags": [] 1255 | }, 1256 | "outputs": [], 1257 | "source": [ 1258 | "# solution\n", 1259 | "df_tiny.trip_miles.quantile([0.25, 0.75])" 1260 | ] 1261 | }, 1262 | { 1263 | "cell_type": "markdown", 1264 | "id": "b0ee5c4c-3423-443c-8ce9-d0c597aa0758", 1265 | "metadata": {}, 1266 | "source": [ 1267 | "### Conditions to filter the dataset\n", 1268 | "\n", 1269 | "We can use the information of Q1 and Q3 to create contions to filter the dataset" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": null, 1275 | "id": "8fcd0e94-2c7a-423d-a410-e43eedecb3f7", 1276 | "metadata": {}, 1277 | "outputs": [], 1278 | "source": [ 1279 | "tips_filter_vals = df_tiny.tips.quantile([0.25, 0.75]).values\n", 1280 | "tips_condition = df_tiny.tips.between(*tips_filter_vals)" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "code", 1285 | "execution_count": null, 1286 | "id": "1133eddf-b478-4f61-9889-3f41e9050689", 1287 | "metadata": { 1288 | "tags": [] 1289 | }, 1290 | "outputs": [], 1291 | "source": [ 1292 | "tips_condition" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "markdown", 1297 | "id": "da2dfb9b-af06-4183-8ea5-a30dc02bdd5f", 1298 | "metadata": {}, 1299 | "source": [ 1300 | "### Exercise\n", 1301 | "\n", 1302 | "- Create filter conditions for the `base_passenger_fare` and `trip_miles`" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "code", 1307 | "execution_count": null, 1308 | "id": "89d5bdc4-1685-471d-898e-9ed239ad281b", 1309 | "metadata": { 1310 | "tags": [] 1311 | }, 1312 | "outputs": [], 1313 | "source": [ 1314 | "## Solution\n", 1315 | "fare_filter_vals = df_tiny.base_passenger_fare.quantile([0.25, 0.75]).values\n", 1316 | "fares_condition = df_tiny.base_passenger_fare.between(*fare_filter_vals)\n", 1317 | "\n", 1318 | "miles_filter_vals = df_tiny.trip_miles.quantile([0.25, 0.75]).values\n", 1319 | "miles_condition = df_tiny.trip_miles.between(*miles_filter_vals)" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "markdown", 1324 | "id": "2dacba7f-bb18-4027-99e2-43a7c91ebd21", 1325 | "metadata": {}, 1326 | "source": [ 1327 | "### Filter dataframe and plot" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "code", 1332 | "execution_count": null, 1333 | "id": "6a82012b-ee3a-4df9-b24b-877f114a207a", 1334 | "metadata": { 1335 | "tags": [] 1336 | }, 1337 | "outputs": [], 1338 | "source": [ 1339 | "# solution\n", 1340 | "df_tiny.loc[(tips_condition & fares_condition) & miles_condition].boxplot()" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "markdown", 1345 | "id": "d1337177-5c12-4f73-bc68-f9156c1bbc6c", 1346 | "metadata": {}, 1347 | "source": [ 1348 | "## Filtering our big dataset based on the insights\n", 1349 | "\n", 1350 | "Based on these numbers let's go back to our `df_small` dataset and try to filter it.\n", 1351 | "\n", 1352 | "**Note:**\n", 1353 | "\n", 1354 | "Sometimes when you are trying to filter and you have been doing feature engineering, you might get a divisions not known error.\n", 1355 | "If that's the case you can do \n", 1356 | "\n", 1357 | "```python\n", 1358 | "df_small = df_small.reset_index()\n", 1359 | "df_small = (df_small\n", 1360 | " .set_index(\"column_to_be_the_index\")\n", 1361 | " .persist()\n", 1362 | " )\n", 1363 | "```" 1364 | ] 1365 | }, 1366 | { 1367 | "cell_type": "code", 1368 | "execution_count": null, 1369 | "id": "75a5a9be-5e13-4f0d-8c76-13ab7caea857", 1370 | "metadata": {}, 1371 | "outputs": [], 1372 | "source": [ 1373 | "tips_condition = df_small.tips.between(*tips_filter_vals)\n", 1374 | "miles_condition = df_small.trip_miles.between(*miles_filter_vals)\n", 1375 | "fares_condition = df_small.base_passenger_fare.between(*fare_filter_vals)" 1376 | ] 1377 | }, 1378 | { 1379 | "cell_type": "code", 1380 | "execution_count": null, 1381 | "id": "879314c1-b9de-483f-af5d-59347db07f61", 1382 | "metadata": {}, 1383 | "outputs": [], 1384 | "source": [ 1385 | "df_small = df_small.loc[(tips_condition & fares_condition) & miles_condition].persist()" 1386 | ] 1387 | }, 1388 | { 1389 | "cell_type": "markdown", 1390 | "id": "39de8d6c-b94f-4b26-9f5e-5758f37efe1c", 1391 | "metadata": {}, 1392 | "source": [ 1393 | "### Stats on `dollars_per_mile`" 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "code", 1398 | "execution_count": null, 1399 | "id": "e33c8db5-e8f1-4559-bc02-deb716e19fcc", 1400 | "metadata": {}, 1401 | "outputs": [], 1402 | "source": [ 1403 | "(\n", 1404 | " df_small.groupby(\"hvfhs_license_num\")\n", 1405 | " .dollars_per_mile.agg([\"min\", \"max\", \"mean\", \"std\"])\n", 1406 | " .compute()\n", 1407 | ")" 1408 | ] 1409 | }, 1410 | { 1411 | "cell_type": "markdown", 1412 | "id": "e4933836-e833-4787-b561-f2d0bb5f76ee", 1413 | "metadata": {}, 1414 | "source": [ 1415 | "### Let's look at the `tip_percentage` again" 1416 | ] 1417 | }, 1418 | { 1419 | "cell_type": "markdown", 1420 | "id": "aded0737-d5bf-4181-b9aa-2c5c72435d06", 1421 | "metadata": {}, 1422 | "source": [ 1423 | "### Exercise \n", 1424 | "- Compute the `tip_percentage` mean by provider " 1425 | ] 1426 | }, 1427 | { 1428 | "cell_type": "code", 1429 | "execution_count": null, 1430 | "id": "9e745175-5725-4b92-a58d-4f870a3d3ace", 1431 | "metadata": {}, 1432 | "outputs": [], 1433 | "source": [ 1434 | "#Solution\n", 1435 | "tips_perc_avg = df_small.groupby(\"hvfhs_license_num\").tip_percentage.mean().compute()\n", 1436 | "tips_perc_avg" 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "code", 1441 | "execution_count": null, 1442 | "id": "46be3c7a-8128-4ab1-b8f1-3b398faa4238", 1443 | "metadata": {}, 1444 | "outputs": [], 1445 | "source": [ 1446 | "(tips_perc_avg.to_frame().set_index(tips_perc_avg.index.map(provider)))" 1447 | ] 1448 | }, 1449 | { 1450 | "cell_type": "code", 1451 | "execution_count": null, 1452 | "id": "1538a243-8de1-4dde-82f0-dd5434b21f11", 1453 | "metadata": {}, 1454 | "outputs": [], 1455 | "source": [ 1456 | "len(df_small)" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "markdown", 1461 | "id": "c3665fc2-8b8e-4e0e-a656-e47cefbafb86", 1462 | "metadata": {}, 1463 | "source": [ 1464 | "### Average trip time by provider" 1465 | ] 1466 | }, 1467 | { 1468 | "cell_type": "code", 1469 | "execution_count": null, 1470 | "id": "176f6aa4-8eec-4d4b-9959-678fa61d50ea", 1471 | "metadata": { 1472 | "tags": [] 1473 | }, 1474 | "outputs": [], 1475 | "source": [ 1476 | "trips_time_avg = (\n", 1477 | " df_small.groupby(\"hvfhs_license_num\")\n", 1478 | " .trip_time.agg([\"min\", \"max\", \"mean\", \"std\"])\n", 1479 | " .compute()\n", 1480 | ")\n", 1481 | "trips_time_avg" 1482 | ] 1483 | }, 1484 | { 1485 | "cell_type": "markdown", 1486 | "id": "50ade8e4-2fd4-4cf4-b533-033c694ae04d", 1487 | "metadata": {}, 1488 | "source": [ 1489 | "### In minutes" 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "code", 1494 | "execution_count": null, 1495 | "id": "b3715c5a-7520-42c3-a41c-c1295a19d5ff", 1496 | "metadata": { 1497 | "tags": [] 1498 | }, 1499 | "outputs": [], 1500 | "source": [ 1501 | "trips_time_avg.set_index(trips_time_avg.index.map(provider)) / 60" 1502 | ] 1503 | }, 1504 | { 1505 | "cell_type": "markdown", 1506 | "id": "bf7fdf77-320b-4d87-802c-1f260867a141", 1507 | "metadata": {}, 1508 | "source": [ 1509 | "## What we've learned\n", 1510 | "- Most New Yorkers do not tip\n", 1511 | "- But it looks like of those who tip, it is common to tip around 20% regardless of the provider. Unless it's Via, they tend to tip slightly less.\n", 1512 | "- The trip_time column needs some cleaning of outliers. " 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "code", 1517 | "execution_count": null, 1518 | "id": "451a0ec9-fc40-4f23-bd8a-69803f0a6fe2", 1519 | "metadata": {}, 1520 | "outputs": [], 1521 | "source": [ 1522 | "cluster.shutdown()\n", 1523 | "client.close()" 1524 | ] 1525 | }, 1526 | { 1527 | "cell_type": "markdown", 1528 | "id": "1c592b6f-6386-4e82-9cb2-b74637846a22", 1529 | "metadata": {}, 1530 | "source": [ 1531 | "### Useful links\n", 1532 | "\n", 1533 | "- https://tutorial.dask.org/01_dataframe.html\n", 1534 | "\n", 1535 | "**Useful links**\n", 1536 | "\n", 1537 | "* [DataFrames documentation](https://docs.dask.org/en/stable/dataframe.html)\n", 1538 | "* [Dataframes and parquet](https://docs.dask.org/en/stable/dataframe-parquet.html)\n", 1539 | "* [Dataframes examples](https://examples.dask.org/dataframe.html)\n", 1540 | "\n", 1541 | "### Other lesson\n", 1542 | "\n", 1543 | "Register [here](https://www.coiled.io/tutorials) for reminders. \n", 1544 | "\n", 1545 | "We have another lesson, where we’ll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We’ll get to: ‍\n", 1546 | "\n", 1547 | "- Learn how to do arbitrary task scheduling using the Dask Futures API\n", 1548 | "- Utilize blocking and non-blocking distributed calculations\n", 1549 | "\n", 1550 | "By the end, we’ll see how much faster this workflow is using Dask and how the Dask Futures API is particularly well-suited for this type of fine-grained execution.\n" 1551 | ] 1552 | } 1553 | ], 1554 | "metadata": { 1555 | "kernelspec": { 1556 | "display_name": "Python 3 (ipykernel)", 1557 | "language": "python", 1558 | "name": "python3" 1559 | }, 1560 | "language_info": { 1561 | "codemirror_mode": { 1562 | "name": "ipython", 1563 | "version": 3 1564 | }, 1565 | "file_extension": ".py", 1566 | "mimetype": "text/x-python", 1567 | "name": "python", 1568 | "nbconvert_exporter": "python", 1569 | "pygments_lexer": "ipython3", 1570 | "version": "3.10.0" 1571 | } 1572 | }, 1573 | "nbformat": 4, 1574 | "nbformat_minor": 5 1575 | } 1576 | --------------------------------------------------------------------------------