├── .dask └── config.yaml ├── .gitignore ├── 00-setup.ipynb ├── 01-introduction.ipynb ├── 02-dataframe.ipynb ├── README.md ├── binder ├── environment.yml └── start └── environment.yml /.dask/config.yaml: -------------------------------------------------------------------------------- 1 | distributed: 2 | logging: 3 | bokeh: critical 4 | 5 | dashboard: 6 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | .DS_Store 3 | dask-worker-space 4 | -------------------------------------------------------------------------------- /00-setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Setup\n", 8 | "\n", 9 | "## Notebook objectives\n", 10 | "\n", 11 | "* **Introduction to JupyterLab**, a web-based IDE for interactive python.\n", 12 | "* Show how to **complete exercises**.\n" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "## Introduction to JupyterLab" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "**JupyterLab** is the web based interactive development environment for Jupyter Notebooks. Jupyter Notebooks are web applications for creating and sharing live code, visualizations, and text." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "from time import sleep" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stdout", 45 | "output_type": "stream", 46 | "text": [ 47 | "CPU times: user 603 µs, sys: 957 µs, total: 1.56 ms\n", 48 | "Wall time: 2 s\n" 49 | ] 50 | }, 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "32" 55 | ] 56 | }, 57 | "execution_count": 2, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "%%time\n", 64 | "\n", 65 | "def inc(x):\n", 66 | " sleep(1)\n", 67 | " return x + 1\n", 68 | "\n", 69 | "a = inc(10)\n", 70 | "b = inc(20)\n", 71 | "\n", 72 | "a + b" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "**IPython** is the interactive shell for Python, powering JupyterLab and Jupyter Notebooks." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "'Hello World!'" 91 | ] 92 | }, 93 | "execution_count": 3, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "hi = \"Hello World!\"\n", 100 | "\n", 101 | "hi" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Keyboard Shortcuts for JupyterLab\n", 109 | "\n", 110 | "* `Shift+Enter` (`Shift+Return` in macOS) for executing a cell\n", 111 | "* `A` for inserting a cell above the current cell\n", 112 | "* `B` for inserting a cell below the current cell\n", 113 | "* `M` for switching to markdown\n", 114 | "* `Y` for switching to code cell\n", 115 | "* `Ctrl+S` (`Cmd+S` in macOS) to save the state of the notebook" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Completing Exercises\n", 123 | "\n", 124 | "Every notebook has some exercises, that consist of:\n", 125 | "* a question\n", 126 | "* a blank cell where you write your answer\n", 127 | "* a hidden cell with the answer that can be opened by clicking on the elipses " 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "**Question:** Write a Python function to add two numbers." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# Your answer goes here" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 1, 149 | "metadata": { 150 | "jupyter": { 151 | "source_hidden": true 152 | } 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "# Answer 1\n", 157 | "\n", 158 | "def sum(a, b):\n", 159 | " return a+b" 160 | ] 161 | } 162 | ], 163 | "metadata": { 164 | "kernelspec": { 165 | "display_name": "Python 3", 166 | "language": "python", 167 | "name": "python3" 168 | }, 169 | "language_info": { 170 | "codemirror_mode": { 171 | "name": "ipython", 172 | "version": 3 173 | }, 174 | "file_extension": ".py", 175 | "mimetype": "text/x-python", 176 | "name": "python", 177 | "nbconvert_exporter": "python", 178 | "pygments_lexer": "ipython3", 179 | "version": "3.8.8" 180 | } 181 | }, 182 | "nbformat": 4, 183 | "nbformat_minor": 4 184 | } 185 | -------------------------------------------------------------------------------- /01-introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction\n", 8 | "\n", 9 | "## Notebook Objectives\n", 10 | "* **Spin up a Dask cluster.** A cluster consists of a scheduler (that manages flow of work) and workers (that perform actual computations).\n", 11 | "* **Introduction to Dask Delayed API**, an interface for parallelizing Python operations." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Spin up a Dask Cluster\n", 19 | "\n", 20 | "Spin up a new cluster with the following code. You can specify the number of workers with `n_workers`." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": {}, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "\n", 32 | "\n", 33 | "\n", 40 | "\n", 48 | "\n", 49 | "
\n", 34 | "

Client

\n", 35 | "\n", 39 | "
\n", 41 | "

Cluster

\n", 42 | "
    \n", 43 | "
  • Workers: 4
  • \n", 44 | "
  • Cores: 12
  • \n", 45 | "
  • Memory: 17.18 GB
  • \n", 46 | "
\n", 47 | "
" 50 | ], 51 | "text/plain": [ 52 | "" 53 | ] 54 | }, 55 | "execution_count": 1, 56 | "metadata": {}, 57 | "output_type": "execute_result" 58 | } 59 | ], 60 | "source": [ 61 | "from dask.distributed import Client\n", 62 | "\n", 63 | "client = Client(n_workers=4)\n", 64 | "\n", 65 | "client" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "Here the 4 workers have 12 cores overall and 17GB of memory to use (might vary on your machine!).\n", 73 | "\n", 74 | "The Daskboard link takes you to Dask's diagnostic dashboard that contains real-time information about the state of your cluster." 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "Always remember to close the session with:" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 2, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "client.close()" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Introduction to **Dask Delayed API**\n", 98 | "\n", 99 | "Dask Delayed is a low-level collection that can be used to parallelize most python operations.\n", 100 | "\n", 101 | "For example, consider the following functions for incrementing a number and adding two numbers." 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 3, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "from time import sleep\n", 111 | "\n", 112 | "def inc(x):\n", 113 | " sleep(1)\n", 114 | " return x + 1\n", 115 | "\n", 116 | "def add(x, y):\n", 117 | " sleep(1)\n", 118 | " return x + y" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "CPU times: user 1.1 ms, sys: 1.12 ms, total: 2.22 ms\n", 131 | "Wall time: 3.01 s\n" 132 | ] 133 | }, 134 | { 135 | "data": { 136 | "text/plain": [ 137 | "22" 138 | ] 139 | }, 140 | "execution_count": 4, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "%%time\n", 147 | "\n", 148 | "a = 10\n", 149 | "b = 10\n", 150 | "\n", 151 | "a = inc(a)\n", 152 | "b = inc(b)\n", 153 | "\n", 154 | "c = add(a, b)\n", 155 | "c" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "This can be parallelized using Dask Delayed." 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 5, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "from dask import delayed" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 6, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "CPU times: user 773 µs, sys: 612 µs, total: 1.39 ms\n", 184 | "Wall time: 909 µs\n" 185 | ] 186 | }, 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "Delayed('add-9158607d-7eb8-4415-b777-ec3030d51285')" 191 | ] 192 | }, 193 | "execution_count": 6, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "%%time\n", 200 | "\n", 201 | "x = delayed(inc)(10)\n", 202 | "y = delayed(inc)(10)\n", 203 | "\n", 204 | "z = delayed(add)(x, y)\n", 205 | "z" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "### Lazy evaluation\n", 213 | "\n", 214 | "The above code does not produce any output because Dask's Delayed API that evaluates _lazily_. Lazy evaluation refers to the paradigm of generating the entire task graph but evaluating it only when necessary.\n", 215 | "\n", 216 | "To evaluate and get an output, you can use the `compute()` method." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 7, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "CPU times: user 4.87 ms, sys: 3.21 ms, total: 8.08 ms\n", 229 | "Wall time: 2.01 s\n" 230 | ] 231 | }, 232 | { 233 | "data": { 234 | "text/plain": [ 235 | "22" 236 | ] 237 | }, 238 | "execution_count": 7, 239 | "metadata": {}, 240 | "output_type": "execute_result" 241 | } 242 | ], 243 | "source": [ 244 | "%%time\n", 245 | "\n", 246 | "z.compute()" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "### Task Graph\n", 254 | "\n", 255 | "As mentioned earlier, a task graph determines how the computation must be executed in parallel. To view the task graph, you can call `visualize()` on interfaces that are backed by the Delayed API." 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 8, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "data": { 265 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAALMAAAF2CAYAAAAlRqlAAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nO3deVxU9f4/8NdsyDKImmsoYigoivuS2cPUruVaUKa551ZW6u1e00KzfNzsmnu5lFrXElxBXFpwKRFyRdTcUkDU3NJElGVwYZh5//7oCz+JRQbOzGfmc97Px4M/HA7nvHz7cjizfY6GiAiMub5oregEjCmFy8ykwWVm0tCLDqC0q1ev4sCBA6JjOL2BAweKjqA4jWwPAKOiojBo0CDRMZyeZP/sgMwPAImIv0r42rhxo+h/GruRtsxMfbjMTBpcZiYNLjOTBpeZSYPLzKTBZWbS4DIzaXCZmTS4zEwaXGYmDS4zkwaXmUmDy8ykwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGl5lJg8vMpMFlZtLgMjNpcJmZNLjMTBpcZiYNLjOTBpeZSYPLzKTBZWbS4DIzaXCZmTS4zEwaXGYmDS4zkwaXmUmDy8ykwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGl9kBLl68KDqCKuhFB7CXqKgo0REAACaTCV988QWmTp0qOgoA4ODBg6Ij2I20ZR40aJDoCEU4Wx4ZSXeaMXDgQBCR03x17twZALB06VLhWR7+kpF0ZXYmV65cwaFDhwAAERERgtPIj8tsRxs2bIBOpwMAJCUl8QNBO+My21FERAQsFgsAQK/XY+PGjYITyY3LbCfJyck4ffp04fmp2WzG6tWrBaeSG5fZTtavXw+DwVDktoKCM/vgMttJZGQkzGZzkdvc3NywYcMGQYnkx2W2g9Ie7OXl5eGbb76R9qkx0bjMdlDSKUaBP/74A4mJiQ5OpA5cZoVZrVasWbOm2ClGATc3N6xfv97BqdSBy6yw+Ph4pKenl/r9vLw8REZGIj8/34Gp1IHLrLB169aVeopR4M6dO9izZ4+DEqkHl1lBeXl52LRpEywWCwwGAwwGA/R6PfR6feGfC4rOpxrKk/ZdcyJkZ2dj7ty5RW47cuQIvvrqK6xYsaLI7VWrVnVkNFXQED9PZFdRUVEYNGgQPx1nf9F8msGkwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGl5lJg8vMpMFlZtLgMjNpcJmZNLjMTBpcZiYNLjOTBpeZSYPLzKTBZWbS4DIzaXCZmTS4zEwaXGYmDS4zkwaXmUmDy8ykwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGLzausLy8PJw/fx7Xrl2DyWTCvn37AADR0dHw8vKCl5cXAgICUL9+fcFJ5cOLjVfSmTNnEBcXh/j4eJw8eRIXL14scvEdg8EAT09PZGVlFfk5o9GIoKAgdOzYEd27d0f37t1Rs2ZNR8eXSTSXuQLOnDmDyMhIrF27FleuXIGPjw+6du2KDh06ICgoCIGBgfDz84PRaISbm1vhz5lMJphMJqSmpiI1NRXJycnYt28fjhw5AqvVis6dO2PEiBEYOHAgqlevLvBv6JKiQaxcrFYrbd26lTp37kwAyM/Pj6ZNm0aJiYmUn59fqX1nZWXR1q1baciQIeTp6UlVqlSh1157jc6ePatQelWI4jKXQ0xMDLVo0YI0Gg2FhoZSXFwcWSwWuxwrOzubvv76a2ratClptVp65ZVXKDU11S7HkgyXuSznzp2jXr16kUajocGDB9Pp06cddmyLxULR0dHUokULqlKlCs2YMYPu3r3rsOO7IC5zSaxWK3322Wfk7u5OLVu2pL179wrLYjabaeHChVS1alVq3LgxJSUlCcvi5LjMf3f79m0KDQ0lvV5Ps2bNIrPZLDoSERFdu3aNevbsSVWqVKHFixeLjuOMuMwPO3/+PDVu3Jh8fX3pl19+ER2nGIvFQh9//DHpdDoaOXKk0/xHcxJc5gLHjx+nevXqUdu2benGjRui45Rp+/bt5OXlRX379qXc3FzRcZxFFL+cDeD48ePo1q0bmjVrhj179qBOnTqiI5WpV69e2L17NxITE9GnTx/cv39fdCSnoPoyX7hwAb1790a7du0QGxvrMte07tSpE+Li4nDy5EkMHjwYFotFdCThVF3mjIwMPPfcc/D19cWWLVtQpUoV0ZFsEhISgu+//x47d+7ExIkTRccRTrVlJiK89tprMJvNiI2Nhbe3t+hIFdKlSxesXbsWy5cvx9q1a0XHEUv0Wbso8+bNI71eT/v27RMdRRHvvPMOGY1GOnPmjOgookSp8o1GZ8+eRevWrfGf//wH7733nug4isjLy8PTTz8NnU6H/fv3Q6tV3S9ddb5r7tlnn0VmZiYOHz4MnU4nOo5ifvvtN7Rp0wbLli3DuHHjRMdxtGjV/fddv3494uPj8cUXX0hVZABo3rw5Jk6ciPDwcGRkZIiO43Cqume2WCxo1qwZunTpgm+++UZ0HLvIyclBQEAAXn/9dcyaNUt0HEdS1z1zdHQ0Lly4gPDwcNFR7Mbb2xvvvPMOlixZgszMTNFxHEpVZZ4zZw5eeeUVBAYGio5iV2+//Ta0Wi2+/PJL0VEcSjVlPnLkCI4fP4533nlHdBS78/HxwahRo7Bq1Sqo6CxSPWWOiIhAkyZN0KlTJ9FRHGL48OFIS0vDwYMHRUdxGFWU2WKxYMOGDRg5cqToKA7Tpk0bhISEYM2aNaKjOIwqynz06FGkp6cjLCxMdBSHCg0Nxc6dO0XHcBhVlDkuLg516tRBs2bNhBzfZDLh+++/L/PVxvJsY6sePXrgwoUL+P333xXbpzNTRZnj4+PRo0cPaDQaIcffsWMHJk2ahA0bNlRqG1t17twZHh4e2LNnj2L7dGaqKPOJEyfQsWNHYccfMGAAOnbsCL2+9NXQyrONrapUqYJWrVrh+PHjiu3TmUlf5szMTNy4cQNNmzYVmkOr1T7yzT/l2cZWQUFBSE1NVXSfzkr6hRNTUlIA/PWPqpTU1FQcOnQIJ0+eRJcuXUp8YHn79m1s2rQJv//+O9q3bw8iKnaaU55tKisoKAgJCQmK7tNpCXv3qYNs3ryZAFBeXp4i+1u0aBF169aNrFYrXbx4kfz9/emLL74osk1ycjJ16NCBDhw4QGazmVasWEFVqlShwMBAm7ZRwpo1a8hgMCi6Tycl/wdac3Jy4OHhAYPBoMj+li1bhubNm0Oj0cDf3x+tW7fGDz/8UGSbkSNHolu3bujcuTP0ej3GjRsHX19fm7dRgre3N8xmMx48eKD4vp2NKspsNBoV2198fHzhu9HOnDmDK1eu4Ny5c4Xfj4uLQ2JiIrp37154m0ajQYcOHQpPIcqzjVIKPg6Wk5Oj6H6dkfRlfvDgAdzd3RXbn6+vLw4fPoxJkybh7NmzCAgIgNVqLfz+iRMnAAAtWrQo8nMPl7Q82yjFw8MDAHD37l3F9+1spH8A6Onpqeg/5IwZM5CQkICdO3fCw8MDMTExRb6fnZ0NAEhMTESDBg2KfK+grOXZRim5ubkAoOhvJ2cl/T2z0WhU7FfsxYsXMWvWLAwbNqzwHu/he2Xgr4//A3+dSpSmPNsopeA/jqt++twmoh+C2tu2bdsIgCLLwZ48eZIAULdu3SgrK4t++eUXqlevHtWoUYNycnIoOzubzGYzNW3alIxGIyUkJBDRX4se1qtXj4xGI504cYLu3bv3yG2UWkfu22+/JQ8PD0X25eTkfzbD398fwF/3qpUVEhKC0aNHY9++fWjXrh3OnDmDJUuWwGQy4cUXX4TZbIZer8f27dvRrFkzPPPMMwgICMCUKVPQvn17tG7dGgcOHACAR27z8HVRKuP8+fOFM5Cd9J8BvHfvHoxGIzZt2qTYu+ZycnKK/Np+8OBBiashpaenw9PTE15eXjCZTCWet5Znm8oYNGgQ8vLysGXLFkX364Tk/wygh4cH/Pz8kJycrNg+/37+WdqyXrVq1YKXlxeA0h+AlWebykhJSVH01U9nJn2ZAaB9+/aFv97VJCsrC6dPn0aHDh1ER3EIVZS5e/fuSEhIgNlsFh3FoeLj40FEeOaZZ0RHcQhVlPnZZ59FTk4Ojhw5IjqKQ8XFxaFly5aquVimKsocFBSEJk2aICoqSnQUh7FarYiJiUG/fv1ER3EYVZQZAIYNG4Z169Yp9pSXs4uLi8O1a9cwdOhQ0VEcRjVlHj58ONLT07Fjxw7RURxi9erV6NSpk/APJTiS9M8zP6x37964e/eu9G9Wv3z5Mho3boyVK1fitddeEx3HUeR/nvlhH374IX755Rfs27dPdBS7mjNnDurVq4chQ4aIjuJQqrpnBoCuXbvCYDBg9+7doqPYxaVLl9C0aVMsWLAAb731lug4jqS+xcaTkpLw5JNPYv369Rg4cKDoOIoLDQ3FmTNncOrUKZe74FAlqes0AwA6dOiAUaNG4Z///CeysrJEx1HUjh07sG3bNixevFhtRQagwtMMALh16xaaNWuGvn374ttvvxUdRxHp6elo06YNunbtinXr1omOI4L67pkBoGbNmoiIiEBERARWr14tOk6lERHGjBkDrVaLJUuWiI4jjpj3UTuHKVOmkJeXFx07dkx0lEqZOXMmubm50aFDh0RHEUndF4LPy8ujnj17Up06dSgtLU10nAr56quvSKPR0PLly0VHEU3dZSYiys7Opvbt21NAQABdvXpVdBybxMTEkE6no48++kh0FGfAZSYiunnzJgUHB1PDhg0pOTlZdJxy+frrr0mv19Pbb78tOoqz4DIXuHXrFj355JNUq1YtOnDggOg4pbJarfTxxx+TRqOhGTNmiI7jTLjMD8vNzaX+/fuTm5sbLVy4kKxWq+hIRdy6dYv69etHer2+2Pp2jMtcjNVqpdmzZ5Ner6f+/fs7zXn0rl27yM/Pj/z8/Gj//v2i4zgjLnNp9u7dSwEBAeTt7U3z588vXEXUYrHY/dgPH+Pq1as0cOBAAkAvv/wy3bp1y+7Hd1FRqnwFsLzu37+PTz/9FHPmzEGDBg3w3nvvIS0tDbNnz7brcZOSkpCUlIQLFy5g+fLlqFOnDpYsWYI+ffrY9bguLprvmcvh/PnzNGrUKNJqtVSlShX69NNP6cqVK3Y51v79+2nIkCEEgKpXr04LFixQZDUmFZB/RSMl+Pv7w2q1wmq1on79+pg7dy4aNmyInj17YsmSJfjtt98qvO/79+8jLi4O06dPR5MmTdClSxecOHECGo0G2dnZaNSoUeG6dqxsfJrxCFarFePGjSt8Q9KECRMwb948/Pjjj1i7di12796NzMxM1K1bF23btkXTpk0RFBSEBg0awGg0wmg0wsvLCzk5OcjMzEROTg7S0tKQkpKC5ORkHDlyBPfv30fjxo3Rv39/DB8+HG3atEHNmjWRkZEBvV6PmJgYvPDCC2IH4fzU935mWxAR3n77baxYsQJWqxVubm74+OOPMXXq1MJtLBYLjh07hoSEBJw4cQIpKSlISUkpXH2zJPXr10dQUBACAwPRqVMndO/eHX5+fkW2admyJU6dOgWNRgOdTseFfrRo6ddnrigiwoQJEwqLDAD5+fmoX79+ke10Oh06dOhQbNWgnJwcmEwmmEwm5ObmomrVqvDx8YHRaCzXe40bNmyIU6dOgYhgsVjw8ssvY/Pmzejfv79yf0nJcJlLQESYNGkSli9fXmT9ZavVWu7rjnh7e1dqTeQGDRrAYDDAbDYXFvqll17C1q1b0bdv3wrvV2b8ALAE4eHhWLZsWbGFxAHY5SI6JfH19S1yTUAigtVqRVhYGGJjYx2SwdVwmf8mPDwcc+fORWkPJerVq+eQHL6+vsXWxrNarbBYLAgLC3PIqvuuhsv8kOnTp2POnDmlFtnb27tw+Vl78/X1LfE3g9VqRX5+Pvr06aOaa2KXF5f5/8yYMQOzZ88utcgAULduXYflKet0xmq1wmw2o3fv3oiPj3dYJmfHZcZfi8PMmjWrzCIDKHZlKHt61Lm51WpFXl4e+vXrp8q1p0vCZQbw8ssvY+DAgdBqtXBzcytxG71ej4YNGzosk4+PT5mv/Ol0Onh7e2PKlCmqWk+uLFxmAK1atcLGjRuRlpaGN954AwaDAXp90WctdTpdseeY7a127dpF/qzRaKDValGjRg188MEHuHz5Mj766CPUqFHDobmcFZf5IY0aNcLixYvRtWtX1KlTp8g1ty0WCx5//HGH5in4TaDVaqHRaODr6wuNRoN58+Zh5syZ8PHxcWgeZ8dl/puTJ08iLi4OK1euxNWrV/HBBx/Ax8cH+fn5DnuOuUBBmYODg7FhwwZcunQJQ4cOxbx580p8pkP1RLxXz5m9+uqr1LJlyyIfmTKZTLRw4UJKSUlxaJa1a9fS9u3bi9x29uxZ0mq1tGXLFodmcQH85vyHXbhwAUFBQYiMjMSrr74qOk6pwsLCcO3aNRw+fFh0FGeizuW5SjNnzhz4+flhwIABoqOUadq0aUhKSuJXAf+G75n/z40bN9CoUSN8/vnneP3110XHeaR//OMf0Gg0+Omnn0RHcRZ8z1xgwYIFqFatGkaMGCE6SrmEh4fj559/xsGDB0VHcRp8zwzg9u3b8Pf3x0cffYTJkyeLjlNuTz31FOrWrYvNmzeLjuIM+J4ZAJYuXQqDweASpxcPmzJlCrZu3VqpzyDKRPVlvnv3LpYuXYoJEyZU6s30IoSGhiI4OBjz5s0THcUpqL7MK1euRG5uLiZMmCA6is00Gg2mTJmCdevW4ffffxcdRzhVl9lsNmPRokV4/fXXUatWLdFxKmTIkCHw9fXFggULREcRTtVljoiIwPXr1/Gvf/1LdJQKMxgMmDx5Mr7++mvcuHFDdByhVFtmq9WKuXPnYsSIEcU+5u9qxo4di2rVqqn7eiZQcZljYmKQlpZWZA0MV+Xu7o6JEydi6dKlyMzMFB1HGNWWec6cOXj55ZcRGBgoOooiJkyYAK1Wi+XLl4uOIowqy7xz504cPXoU7733nugoiqlatSrGjx+PRYsW4d69e6LjCKHKVwC7desGd3d37NixQ3QURd28eRP+/v6YP3++2q6bDajxFcDExEQkJCQgPDxcdBTF1a5dG6NGjcKcOXOKrbmhBqq7Z+7fvz/S09Nx6NAh0VHs4vLly2jcuDFWrVqFYcOGiY7jSOpaBfTMmTMICQnBtm3b0K9fP9Fx7Gb48OE4evQoTp8+XWSJL8mpq8xDhw7F8ePHcerUKan/kc+ePYsWLVpg8+bNePHFF0XHcRT1lPnixYsIDAzE6tWrMWTIENFx7C40NBTXr19HYmKi6CiOop4HgHPnzkWDBg0wcOBA0VEcYtq0aTh8+LCq1qNTxT3zn3/+iUaNGmHhwoUYP3686DgO06NHD+j1euzatUt0FEdQxz3zggULULVqVYwcOVJ0FIcKDw/HTz/9hKSkJNFRHEL6e+asrCw0bNgQ06dPx5QpU0THcbiOHTuiQYMGiImJER3F3uS/Zy54J5mrfSRKKe+//z62bNmiio9WSV3mu3fvYvHixZg4caJq12ULCwtDcHAw5s+fLzqK3Uld5q+++gq5ubmYNGmS6CjCaDQavPvuu1i7di0uXbokOo5dSVvmgo9EjRs3zmU/EqWUoUOH4vHHH8fChQtFR7Eracu8Zs0a/PHHHy79kSilGAwG/Pvf/8bXX3+Nmzdvio5jN1KW2Wq1Yv78+Rg2bJhDV7t3ZmPHjoXRaMTixYtFR7EbKcu8ZcsWJCcn49133xUdxWl4enpi0qRJWLp0KbKyskTHsQspy/zpp58WPopn/1/B2iArVqwQnMQ+pCvzrl27cOTIEbz//vuiozgdHx8fvPHGG1i4cKGUH62S7hXA7t27w83NDTt37hQdxSkVvE9lwYIFePPNN0XHUVJ0sQvB37lzB2lpaSLCVNq5c+cQHx+PL7/80u7vR+jQoYNd9uuI+ffp0wdz5sxBu3btoNFo7Hoseylx/n+/MMTGjRsJAH894steeP4Vnn9UsXvmAhcuXCjtW6oWGxvrkEUWef4lK2v+pZa5UaNGdgvkyhz1aiLPv2RlzV+6ZzOYenGZmTS4zEwaXGYmDS4zkwaXmUmDy8ykwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGl5lJg8vMpMFlZtLgMjNpcJmZNLjMTBpcZiYNLjOTBpeZSYPLzKTBZWbS4DIzaXCZmTS4zEwaXGYmDS4zkwaXmUmDy8ykwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGl5lJg8vMpMFlZtLgMjNpcJmZNLjMTBouUebc3FzREVTNVeZf6oXgk5KSHJmjVHl5eViyZAkmT54sOgoA4Pz58w45Ds+/ZGXNv9Qyd+zY0S5hKmrjxo2iIzgUz992GiKih2/Izc3FzZs3ReUp5q233sKOHTuwbNky9O7dW3ScQo0aNbLLfnn+5VPC/KOLldmZ5OTkoGbNmsjLy8OLL76IrVu3io6kKi42/2infgC4ZcsW5OfnAwBiY2ORmZkpOJG6uNr8nbrMa9asgUajAQBYrVZs2bJFcCJ1cbX5O22Z09PTERcXB4vFUnhbZGSkwETq4orzd9oyR0dHF/mzxWJBQkICbty4ISiRurji/J22zBEREfj7Y1OtVltsyMw+XHH+TvlsxpUrV9CwYcNiw9RoNGjXrp3TvKAgKxedv3M+m7Fu3TrodLpitxMRjhw5grS0NAGp1MNV5++UZY6IiCjywONhBoMBUVFRDk6kLq46f6c7zUhOTkazZs3K3KZx48Y4d+6cgxKpiwvP3/lOM9auXQuDwVDmNmlpaTh16pSDEqmLK8/f6cocEREBs9n8yO3Wr1/vgDTq48rzL/VdcyL88ccf6NSpEzp16lR42/Xr1/Hrr7+iT58+RbYtz8CZbVx9/k53zvx3UVFRGDRoULGniZhjuND8ne+cmbGK4jIzaXCZmTS4zEwaXGYmDS4zkwaXmUmDy8ykwWVm0uAyM2lwmZk0uMxMGlxmJg0uM5MGl5lJg8vMpMFlZtLgMjNpcJmZNLjMTBpcZiYNLjOTBpeZSYPLzKTBZWbS4DIzaXCZmTS4zEwaXGYmDS4zkwaXmUnDqRYbL5CRkYHLly8jMzMTx44dAwDs2rULXl5eqFatGho3bowqVaoITikvV52/8MXGTSYT9u7diz179uDgwYNITk7GrVu3yvwZrVYLf39/NG/eHM888wy6d++O1q1bQ6vlXzS2kmj+0ULKnJeXh+3bt2P16tX48ccfkZeXh+DgYHTt2hXNmzdHUFAQ/P39Ub16dXh6esLT0xNZWVnIzc1FRkYGUlNTkZqail9//RXx8fFIT09H3bp1MWTIEIwcORItW7Z09F/JpUg6/2iQA2VnZ9PcuXOpbt26pNVqqUePHvTNN9/QjRs3KrxPq9VKJ06coJkzZ1JAQAABoKeeeop++OEHslqtCqZ3fZLPP8ohZTabzTR//nyqUaMGeXt709SpU+ny5cuKH8dqtVJCQgL169ePNBoNtWnThhISEhQ/jqtRyfztX+Z9+/ZRSEgIubu70wcffEC3b9+29yGJiOjXX3+l3r17k0ajoREjRtDNmzcdclxno6L526/M+fn5NHPmTNLpdNSrVy9KS0uz16HKtHnzZvLz86N69erRnj17hGQQQYXzt0+ZMzIyqEePHuTu7k7Lli2zxyFskpmZSQMGDCCdTkeffPKJ6Dh2p9L5K1/mK1euUHBwMPn7+9OxY8eU3n2lLF68mPR6PY0fP57y8/NFx7ELFc9f2TKnpaWRn58ftWjRgq5evarkrhWzdetWcnd3pwEDBpDZbBYdR1Eqn79yZb5+/ToFBARQ+/btHfYgo6Li4+PJ09OTRo8eLc3Tdzx/hcqcnZ1NrVq1oqCgIJd51uCHH34gg8FA06dPFx2l0nj+RKRUmYcMGUK1a9em33//XYndOcyqVatIo9HQtm3bREepFJ4/ESlR5pUrV5JWq6WdO3cqEcjhxowZQ9WrV6eLFy+KjlIhPP9ClSvzpUuXyMvLi8LDwysbRJjc3Fxq3rw59ezZU3QUm/H8i6hcmcPCwqhJkyZ07969ygYRKjExkbRaLUVFRYmOYhOefxEVL/POnTsJAO3atasyAZzG6NGjydfXl+7evSs6Srnw/IupeJm7dOlCffr0qeiPO50bN26Qh4cHLV68WHSUcuH5F1OxMsfHxxMA2rt3b0UP7JQmTpxIDRo0oAcPHoiOUiaef4kqVuawsDDq2rVrRX7UqV2+fJn0ej2tX79edJQy8fxLFGXz51xu376N2NhYjB07VtnPCTiBBg0a4Pnnn0dkZKToKKXi+ZfO5jKvX78eBoMBL730UoUO6OyGDx+OXbt24caNG6KjlIjnXzqby/zdd9+hX79+8PLysvlgruCFF16AwWDA9u3bRUcpEc+/dDaVOS8vD/v378ezzz5r84H+7sKFCxg9ejSuXr1a6X0pycPDA0899RT27NkjOkoxPP+y2VTmpKQk5Obmonv37jYf6O+OHTuGb775BqdOnar0vpT27LPPIi4uTnSMYnj+j2DLw8WlS5fSY489VpFHmiVKT09XbF9K+vnnnwmA070DjedfJtuezUhOTkZQUJDt/2NKUbNmTcX2paTAwEAAQGpqquAkRfH8y2ZTmVNTUxUbptVqxZ49e5CUlFTk9itXruDzzz+H1WrF6dOn8cknnyAyMhJWq7XIdiaTCWvWrMGMGTMQFRWFrKwsRXIBQP369eHl5YWUlBTF9qkEnv8j2HI/3rp1a5o2bZptvzNK8Ntvv9GAAQMIAH355ZeFt3/33XdUq1YtAkCLFi2iUaNGUb9+/QgA/fe//y3c7uzZs9SnTx86ceIEmc1mGjx4MD322GN0/vz5Smcr8MQTT9Ds2bMV258SeP5lsu00IycnB0aj0bb/LSUIDg7Ghx9+WOz2/v37Y8yYMQCAkJAQrFq1Ct9//z3atm2LmJgYAIDFYsHgwYMRGhqKli1bQq/X491330VOTg7OnDlT6WwFvL29kZOTo9j+lMDzL5tNq4CaTCZFhgmg1FUkPTw8AABNmzYtvC04OBg7d+4EAMTGxuL48aErHOsAAAWOSURBVOPo27dv4ffbtm2LnJwcuLm5KZINAKpWrYrs7GzF9qcEnn/ZbLpn1mg0IAGLhup0usLjnjhxAl5eXqhVq1aRbZQcJPDXOaVOp1N0n5XF8y+bTWV2hl+9VqsVubm5dn9RIzs7G97e3nY9hq14/mVzuTKHhIQAANatW1fk9oyMDGzZskWx45hMJi5zCZx5/jadM9erV0+xlz8fPHgAAMUWti44T8rLyyu87datW3jw4AGICC+88ALatGmD1atXw93dHa+88gpOnjyJ+Ph4REVFKZLNYrHg+vXrePzxxxXZn1J4/o9gy3MfkydPpvbt29vyIyU6dOhQ4VNDLVq0oB9++IGI/nrT+RNPPEEAaOzYsXT9+nVav349Va1alQDQzJkzyWw209WrV6lnz56k0WhIo9FQt27dFF3BJy0tjQDQ4cOHFdunEnj+ZbLtzfkrV64kb29vp1kF6M6dO5SRkaH4fn/88UcCQJmZmYrvuzJ4/mWy7Xnmdu3aIScnx2nenFKtWjXUqFFD8f0eOHAATZo0gY+Pj+L7rgyef9lsKnPr1q3x2GOPOeXbI5W0e/du9OjRQ3SMYnj+ZbOpzFqtFl27dsVPP/1k84FcxZ07d3DkyBGnLDPPv2w2f9IkLCwMu3bteuTltVxVdHQ0DAYDnn/+edFRSsTzL53NZX7ppZfg7u6ODRs22HwwVxAZGYnQ0FCnO18uwPMvQ0UebY4aNYpCQkKc5lG1Uk6dOkUajYa2b98uOkqZeP4lqti6Gb/99htptVqXXwr27wYPHkzBwcFksVhERykTz79EFV+eKzQ0lDp27CjNvUNycjLpdDqnXwCmAM+/mIqX+dixY6TT6Wj16tUV3YVT6dWrF4WEhLjMhXt4/sVUbknbt956i2rXru3019B4lJiYGNJoNC53nUCefxGVK/OdO3eoTp06NGzYsMrsRqg///yTHn/8cRo5cqToKDbj+RdR+ctAbN++nbRaLf3vf/+r7K4czmKx0PPPP09+fn52eY+BI/D8CylzgZ7w8HDy9PSkxMREJXbnMNOnTyc3NzdKSkoSHaVSeP5EpFSZzWYz9e3bl2rWrElnz55VYpd2t2zZMtJoNLRq1SrRUSqN509ESl7UMjc3lzp37kx+fn507tw5pXZrF6tXryatVlvk4/Oujuev8OWGMzIyqEOHDlSnTh06evSokrtWzLx580ij0Siy/oSzUfn8lb8QfE5ODj333HPk7e1NmzZtUnr3FXbv3j0aP348abVa+uyzz0THsRsVz1/5MhMRPXjwgN58800CQG+//bbwS3ulpqZSmzZtyMfHh2JiYoRmcQSVzt8+ZS6wefNmqlatGgUEBFBsbKw9D1WivLw8+uyzz8hoNFLbtm2d/lxSaSqbv33LTPTXRVdeeuklAkBhYWF0/Phxex+S8vPzad26ddSkSRPy8vKiTz/91OmvIGUvKpq//ctcIDY2llq1akUajYb69+9PP//8s+LvTsvMzKQVK1ZQYGAg6XQ6Gjp0KF26dEnRY7gqFczfcWUmIrJarfT9999Tly5dCADVr1+fpk6dSnv27KH79+9XaJ9//vknbdiwgQYOHEju7u7k7u5OY8aMUd0pRXlIPv8oDZGAxcsApKSkIDIyEhs3bkRaWho8PDzQqVMnBAcHIygoCAEBATAajYVfJpMJd+7cwZ07d5CamoqUlBQcP34cp0+fhk6nw9NPP41hw4ZhwIABTvspEWci4fyjhZX5YZcuXUJcXBz279+PlJQUpKSkID09vcRtdTod/P39ERgYiBYtWuCZZ55B165dnW4pLVciyfydo8wlyc3NhclkgslkQm5uLoxGI6pXrw6j0QiDwSA6nvRccP7OW2bGbBRt86ezGXNWXGYmDT2An0WHYEwBp/4fuwoG8EFo0LcAAAAASUVORK5CYII=\n", 266 | "text/plain": [ 267 | "" 268 | ] 269 | }, 270 | "execution_count": 8, 271 | "metadata": {}, 272 | "output_type": "execute_result" 273 | } 274 | ], 275 | "source": [ 276 | "z.visualize()" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "We will take a deeper look at the Delayed API in the next course." 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [] 292 | } 293 | ], 294 | "metadata": { 295 | "kernelspec": { 296 | "display_name": "Python 3", 297 | "language": "python", 298 | "name": "python3" 299 | }, 300 | "language_info": { 301 | "codemirror_mode": { 302 | "name": "ipython", 303 | "version": 3 304 | }, 305 | "file_extension": ".py", 306 | "mimetype": "text/x-python", 307 | "name": "python", 308 | "nbconvert_exporter": "python", 309 | "pygments_lexer": "ipython3", 310 | "version": "3.8.8" 311 | } 312 | }, 313 | "nbformat": 4, 314 | "nbformat_minor": 4 315 | } 316 | -------------------------------------------------------------------------------- /02-dataframe.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Dask DataFrame\n", 8 | "\n", 9 | "## Notebook Objectives\n", 10 | "* **Download NYC Yellow Taxi Cab Dataset for 2019**.\n", 11 | "* **Reading and working with tabular data using pandas**, a popular library for data analysis.\n", 12 | "* **Reading and working with tabular data using Dask DataFrame** - an interface to scale pandas code, and a look at **Dask Dashboards** for real-time visualization of the state of your cluster.\n", 13 | "* **Scaling Dask computation to the Cloud** using Coiled, a deployment-as-a-service library for scaling Python. (Optional)\n", 14 | "* **Limitations of Dask DataFrame**.\n", 15 | "* **References** for further reading." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Download NYC Yellow Taxi Cab Dataset for 2019\n", 23 | "\n", 24 | "A typical data science workflow starts with some data that needs to be understood. A typical first step is data cleaning and exploratory analysis to find interesting details and patterns.\n", 25 | "\n", 26 | "In this notebook, we will be working with the [New York City Yellow Taxi Trips Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) for 2019." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Reading and working with tabular data using **pandas**" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### Reading data\n", 41 | "\n", 42 | "pandas has a `read_csv` method to import data into your workspace. We use it to read the taxi data for January 2019.\n", 43 | "\n", 44 | "`%%time` is a [magic function](https://ipython.readthedocs.io/en/stable/interactive/magics.html) in IPython to compute the execution time of a Python expression.\n", 45 | "\n", 46 | "pandas reads data in the form of a 'dataframe' -- a structured format consisting of rows and column, along with some metadata about the values." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 1, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | "CPU times: user 3.16 s, sys: 1.66 s, total: 4.82 s\n", 59 | "Wall time: 53.5 s\n" 60 | ] 61 | }, 62 | { 63 | "data": { 64 | "text/html": [ 65 | "
\n", 66 | "\n", 79 | "\n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
012019-01-01 00:46:402019-01-01 00:53:201.01.501.0N15123917.000.500.51.650.000.39.95NaNNone
112019-01-01 00:59:472019-01-01 01:18:591.02.601.0N239246114.000.500.51.000.000.316.30NaNNone
222018-12-21 13:48:302018-12-21 13:52:403.00.001.0N23623614.500.500.50.000.000.35.80NaNNone
322018-11-28 15:52:252018-11-28 15:55:455.00.001.0N19319323.500.500.50.000.000.37.55NaNNone
422018-11-28 15:56:572018-11-28 15:58:335.00.002.0N193193252.000.000.50.000.000.355.55NaNNone
............................................................
769661222019-01-31 23:37:202019-02-01 00:10:43NaN10.24NaNNone1429500.002.750.00.005.760.30.00NaNNone
769661322019-01-31 23:28:002019-01-31 23:50:50NaN12.43NaNNone48213048.805.500.00.000.000.354.60NaNNone
769661422019-01-31 23:11:002019-01-31 23:46:00NaN9.14NaNNone159246051.052.750.50.000.000.354.60NaNNone
769661522019-01-31 23:03:002019-01-31 23:14:00NaN0.00NaNNone26526500.000.000.59.820.000.30.00NaNNone
769661622019-01-31 23:41:032019-02-01 00:19:16NaN12.30NaNNone23719700.002.750.00.000.000.30.00NaNNone
\n", 349 | "

7696617 rows × 19 columns

\n", 350 | "
" 351 | ], 352 | "text/plain": [ 353 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", 354 | "0 1 2019-01-01 00:46:40 2019-01-01 00:53:20 1.0 \n", 355 | "1 1 2019-01-01 00:59:47 2019-01-01 01:18:59 1.0 \n", 356 | "2 2 2018-12-21 13:48:30 2018-12-21 13:52:40 3.0 \n", 357 | "3 2 2018-11-28 15:52:25 2018-11-28 15:55:45 5.0 \n", 358 | "4 2 2018-11-28 15:56:57 2018-11-28 15:58:33 5.0 \n", 359 | "... ... ... ... ... \n", 360 | "7696612 2 2019-01-31 23:37:20 2019-02-01 00:10:43 NaN \n", 361 | "7696613 2 2019-01-31 23:28:00 2019-01-31 23:50:50 NaN \n", 362 | "7696614 2 2019-01-31 23:11:00 2019-01-31 23:46:00 NaN \n", 363 | "7696615 2 2019-01-31 23:03:00 2019-01-31 23:14:00 NaN \n", 364 | "7696616 2 2019-01-31 23:41:03 2019-02-01 00:19:16 NaN \n", 365 | "\n", 366 | " trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n", 367 | "0 1.50 1.0 N 151 \n", 368 | "1 2.60 1.0 N 239 \n", 369 | "2 0.00 1.0 N 236 \n", 370 | "3 0.00 1.0 N 193 \n", 371 | "4 0.00 2.0 N 193 \n", 372 | "... ... ... ... ... \n", 373 | "7696612 10.24 NaN None 142 \n", 374 | "7696613 12.43 NaN None 48 \n", 375 | "7696614 9.14 NaN None 159 \n", 376 | "7696615 0.00 NaN None 265 \n", 377 | "7696616 12.30 NaN None 237 \n", 378 | "\n", 379 | " DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n", 380 | "0 239 1 7.00 0.50 0.5 1.65 \n", 381 | "1 246 1 14.00 0.50 0.5 1.00 \n", 382 | "2 236 1 4.50 0.50 0.5 0.00 \n", 383 | "3 193 2 3.50 0.50 0.5 0.00 \n", 384 | "4 193 2 52.00 0.00 0.5 0.00 \n", 385 | "... ... ... ... ... ... ... \n", 386 | "7696612 95 0 0.00 2.75 0.0 0.00 \n", 387 | "7696613 213 0 48.80 5.50 0.0 0.00 \n", 388 | "7696614 246 0 51.05 2.75 0.5 0.00 \n", 389 | "7696615 265 0 0.00 0.00 0.5 9.82 \n", 390 | "7696616 197 0 0.00 2.75 0.0 0.00 \n", 391 | "\n", 392 | " tolls_amount improvement_surcharge total_amount \\\n", 393 | "0 0.00 0.3 9.95 \n", 394 | "1 0.00 0.3 16.30 \n", 395 | "2 0.00 0.3 5.80 \n", 396 | "3 0.00 0.3 7.55 \n", 397 | "4 0.00 0.3 55.55 \n", 398 | "... ... ... ... \n", 399 | "7696612 5.76 0.3 0.00 \n", 400 | "7696613 0.00 0.3 54.60 \n", 401 | "7696614 0.00 0.3 54.60 \n", 402 | "7696615 0.00 0.3 0.00 \n", 403 | "7696616 0.00 0.3 0.00 \n", 404 | "\n", 405 | " congestion_surcharge airport_fee \n", 406 | "0 NaN None \n", 407 | "1 NaN None \n", 408 | "2 NaN None \n", 409 | "3 NaN None \n", 410 | "4 NaN None \n", 411 | "... ... ... \n", 412 | "7696612 NaN None \n", 413 | "7696613 NaN None \n", 414 | "7696614 NaN None \n", 415 | "7696615 NaN None \n", 416 | "7696616 NaN None \n", 417 | "\n", 418 | "[7696617 rows x 19 columns]" 419 | ] 420 | }, 421 | "execution_count": 1, 422 | "metadata": {}, 423 | "output_type": "execute_result" 424 | } 425 | ], 426 | "source": [ 427 | "%%time\n", 428 | "\n", 429 | "import pandas as pd\n", 430 | "\n", 431 | "df = pd.read_parquet(\"s3://nyc-tlc/trip data/yellow_tripdata_2019-01.parquet\")\n", 432 | "df" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "Note the time taken, it's ~5 seconds in our case. pandas has read all the data for January and inferred the datatypes for each column. The `.info()` method can be used to gather a concise summary of the dataframe." 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 2, 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "name": "stdout", 449 | "output_type": "stream", 450 | "text": [ 451 | "\n", 452 | "RangeIndex: 7696617 entries, 0 to 7696616\n", 453 | "Data columns (total 19 columns):\n", 454 | " # Column Dtype \n", 455 | "--- ------ ----- \n", 456 | " 0 VendorID int64 \n", 457 | " 1 tpep_pickup_datetime datetime64[ns]\n", 458 | " 2 tpep_dropoff_datetime datetime64[ns]\n", 459 | " 3 passenger_count float64 \n", 460 | " 4 trip_distance float64 \n", 461 | " 5 RatecodeID float64 \n", 462 | " 6 store_and_fwd_flag object \n", 463 | " 7 PULocationID int64 \n", 464 | " 8 DOLocationID int64 \n", 465 | " 9 payment_type int64 \n", 466 | " 10 fare_amount float64 \n", 467 | " 11 extra float64 \n", 468 | " 12 mta_tax float64 \n", 469 | " 13 tip_amount float64 \n", 470 | " 14 tolls_amount float64 \n", 471 | " 15 improvement_surcharge float64 \n", 472 | " 16 total_amount float64 \n", 473 | " 17 congestion_surcharge float64 \n", 474 | " 18 airport_fee object \n", 475 | "dtypes: datetime64[ns](2), float64(11), int64(4), object(2)\n", 476 | "memory usage: 1.1+ GB\n" 477 | ] 478 | } 479 | ], 480 | "source": [ 481 | "df.info()" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "### Working with the data\n", 489 | "\n", 490 | "After importing the data, the next step is working on the data to find some useful information.\n", 491 | "\n", 492 | "In the following blocks, the mean of the tip amount is calculated as a function of passenger count." 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "In pandas, you can use `mean()` to calculate mean, and `groupby()` for mapping to a column." 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 4, 505 | "metadata": {}, 506 | "outputs": [ 507 | { 508 | "name": "stdout", 509 | "output_type": "stream", 510 | "text": [ 511 | "CPU times: user 124 ms, sys: 17.1 ms, total: 141 ms\n", 512 | "Wall time: 138 ms\n" 513 | ] 514 | }, 515 | { 516 | "data": { 517 | "text/plain": [ 518 | "passenger_count\n", 519 | "0.0 1.786901\n", 520 | "1.0 1.828352\n", 521 | "2.0 1.833932\n", 522 | "3.0 1.795589\n", 523 | "4.0 1.702710\n", 524 | "5.0 1.869868\n", 525 | "6.0 1.856830\n", 526 | "7.0 6.542632\n", 527 | "8.0 6.480690\n", 528 | "9.0 3.116667\n", 529 | "Name: tip_amount, dtype: float64" 530 | ] 531 | }, 532 | "execution_count": 4, 533 | "metadata": {}, 534 | "output_type": "execute_result" 535 | } 536 | ], 537 | "source": [ 538 | "%%time\n", 539 | "\n", 540 | "df.groupby(\"passenger_count\").tip_amount.mean()" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### Limitation in pandas\n", 548 | "\n", 549 | "pandas is the most popular library for exploratory data analysis, but it has a limitation. pandas is great at handling small quantities of data, but fails with a `MemoryError` when using larger datasets. This is where Dask comes in." 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "Optional: Uncomment and run the following code block to read the entire dataset in pandas." 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": 1, 562 | "metadata": {}, 563 | "outputs": [], 564 | "source": [ 565 | "# import glob\n", 566 | "\n", 567 | "# df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))\n", 568 | "# df" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "## Reading and working with tabular data using **Dask DataFrame**" 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "### Reading data" 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": {}, 588 | "source": [ 589 | "Dask can be used to scale pandas to larger datasets. Dask's DataFrame API has the same functions as the pandas API because it's a wrapper around pandas. This makes Dask code familiar and easy to use.\n", 590 | "\n", 591 | "First, spin up a cluster! " 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": 5, 597 | "metadata": {}, 598 | "outputs": [ 599 | { 600 | "data": { 601 | "text/html": [ 602 | "
\n", 603 | "
\n", 604 | "
\n", 605 | "

Client

\n", 606 | "

Client-4e817d36-7a77-11ed-9f63-92492cdc1fe7

\n", 607 | " \n", 608 | "\n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | "\n", 616 | " \n", 617 | " \n", 618 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | "\n", 625 | "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", 619 | " Dashboard: http://127.0.0.1:8787/status\n", 620 | "
\n", 626 | "\n", 627 | " \n", 628 | " \n", 631 | " \n", 632 | "\n", 633 | " \n", 634 | "
\n", 635 | "

Cluster Info

\n", 636 | "
\n", 637 | "
\n", 638 | "
\n", 639 | "
\n", 640 | "

LocalCluster

\n", 641 | "

bbd8c95a

\n", 642 | " \n", 643 | " \n", 644 | " \n", 647 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 655 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | "\n", 664 | "\n", 665 | " \n", 666 | "
\n", 645 | " Dashboard: http://127.0.0.1:8787/status\n", 646 | " \n", 648 | " Workers: 4\n", 649 | "
\n", 653 | " Total threads: 8\n", 654 | " \n", 656 | " Total memory: 16.00 GiB\n", 657 | "
Status: runningUsing processes: True
\n", 667 | "\n", 668 | "
\n", 669 | " \n", 670 | "

Scheduler Info

\n", 671 | "
\n", 672 | "\n", 673 | "
\n", 674 | "
\n", 675 | "
\n", 676 | "
\n", 677 | "

Scheduler

\n", 678 | "

Scheduler-2e085ad6-9c17-41bb-9173-d81745b7ff2b

\n", 679 | " \n", 680 | " \n", 681 | " \n", 684 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 692 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 700 | " \n", 703 | " \n", 704 | "
\n", 682 | " Comm: tcp://127.0.0.1:51662\n", 683 | " \n", 685 | " Workers: 4\n", 686 | "
\n", 690 | " Dashboard: http://127.0.0.1:8787/status\n", 691 | " \n", 693 | " Total threads: 8\n", 694 | "
\n", 698 | " Started: Just now\n", 699 | " \n", 701 | " Total memory: 16.00 GiB\n", 702 | "
\n", 705 | "
\n", 706 | "
\n", 707 | "\n", 708 | "
\n", 709 | " \n", 710 | "

Workers

\n", 711 | "
\n", 712 | "\n", 713 | " \n", 714 | "
\n", 715 | "
\n", 716 | "
\n", 717 | "
\n", 718 | " \n", 719 | "

Worker: 0

\n", 720 | "
\n", 721 | " \n", 722 | " \n", 723 | " \n", 726 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 734 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 748 | " \n", 749 | "\n", 750 | " \n", 751 | "\n", 752 | " \n", 753 | "\n", 754 | "
\n", 724 | " Comm: tcp://127.0.0.1:51680\n", 725 | " \n", 727 | " Total threads: 2\n", 728 | "
\n", 732 | " Dashboard: http://127.0.0.1:51681/status\n", 733 | " \n", 735 | " Memory: 4.00 GiB\n", 736 | "
\n", 740 | " Nanny: tcp://127.0.0.1:51666\n", 741 | "
\n", 746 | " Local directory: /var/folders/hf/2s7qjx7j5ndc5220_qxv8y800000gn/T/dask-worker-space/worker-uhglwxcd\n", 747 | "
\n", 755 | "
\n", 756 | "
\n", 757 | "
\n", 758 | " \n", 759 | "
\n", 760 | "
\n", 761 | "
\n", 762 | "
\n", 763 | " \n", 764 | "

Worker: 1

\n", 765 | "
\n", 766 | " \n", 767 | " \n", 768 | " \n", 771 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 779 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 793 | " \n", 794 | "\n", 795 | " \n", 796 | "\n", 797 | " \n", 798 | "\n", 799 | "
\n", 769 | " Comm: tcp://127.0.0.1:51679\n", 770 | " \n", 772 | " Total threads: 2\n", 773 | "
\n", 777 | " Dashboard: http://127.0.0.1:51684/status\n", 778 | " \n", 780 | " Memory: 4.00 GiB\n", 781 | "
\n", 785 | " Nanny: tcp://127.0.0.1:51665\n", 786 | "
\n", 791 | " Local directory: /var/folders/hf/2s7qjx7j5ndc5220_qxv8y800000gn/T/dask-worker-space/worker-vdo838m4\n", 792 | "
\n", 800 | "
\n", 801 | "
\n", 802 | "
\n", 803 | " \n", 804 | "
\n", 805 | "
\n", 806 | "
\n", 807 | "
\n", 808 | " \n", 809 | "

Worker: 2

\n", 810 | "
\n", 811 | " \n", 812 | " \n", 813 | " \n", 816 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 824 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 838 | " \n", 839 | "\n", 840 | " \n", 841 | "\n", 842 | " \n", 843 | "\n", 844 | "
\n", 814 | " Comm: tcp://127.0.0.1:51677\n", 815 | " \n", 817 | " Total threads: 2\n", 818 | "
\n", 822 | " Dashboard: http://127.0.0.1:51683/status\n", 823 | " \n", 825 | " Memory: 4.00 GiB\n", 826 | "
\n", 830 | " Nanny: tcp://127.0.0.1:51667\n", 831 | "
\n", 836 | " Local directory: /var/folders/hf/2s7qjx7j5ndc5220_qxv8y800000gn/T/dask-worker-space/worker-_g98czai\n", 837 | "
\n", 845 | "
\n", 846 | "
\n", 847 | "
\n", 848 | " \n", 849 | "
\n", 850 | "
\n", 851 | "
\n", 852 | "
\n", 853 | " \n", 854 | "

Worker: 3

\n", 855 | "
\n", 856 | " \n", 857 | " \n", 858 | " \n", 861 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 869 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 883 | " \n", 884 | "\n", 885 | " \n", 886 | "\n", 887 | " \n", 888 | "\n", 889 | "
\n", 859 | " Comm: tcp://127.0.0.1:51678\n", 860 | " \n", 862 | " Total threads: 2\n", 863 | "
\n", 867 | " Dashboard: http://127.0.0.1:51682/status\n", 868 | " \n", 870 | " Memory: 4.00 GiB\n", 871 | "
\n", 875 | " Nanny: tcp://127.0.0.1:51668\n", 876 | "
\n", 881 | " Local directory: /var/folders/hf/2s7qjx7j5ndc5220_qxv8y800000gn/T/dask-worker-space/worker-xqi396ff\n", 882 | "
\n", 890 | "
\n", 891 | "
\n", 892 | "
\n", 893 | " \n", 894 | "\n", 895 | "
\n", 896 | "
\n", 897 | "\n", 898 | "
\n", 899 | "
\n", 900 | "
\n", 901 | "
\n", 902 | " \n", 903 | "\n", 904 | "
\n", 905 | "
" 906 | ], 907 | "text/plain": [ 908 | "" 909 | ] 910 | }, 911 | "execution_count": 5, 912 | "metadata": {}, 913 | "output_type": "execute_result" 914 | } 915 | ], 916 | "source": [ 917 | "from dask.distributed import Client\n", 918 | "\n", 919 | "client = Client(n_workers=4)\n", 920 | "client" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "Open the Dask Dashboard in JupyterLab -- Cluster Map, Task Stream, and Dask workers\n", 928 | "\n", 929 | "* **Cluster map** (also called the pew-pew map) visualizes interactions between the scheduler and the workers.\n", 930 | "* **Task stream** shows tasks performed by each worker in real-time.\n", 931 | "* **Dask workers** displays CPU and memory being used by each worker." 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": {}, 937 | "source": [ 938 | "The same reading operation with Dask, but this time read the complete dataset - data for all the years." 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": 6, 944 | "metadata": {}, 945 | "outputs": [ 946 | { 947 | "name": "stdout", 948 | "output_type": "stream", 949 | "text": [ 950 | "CPU times: user 443 ms, sys: 151 ms, total: 594 ms\n", 951 | "Wall time: 2.05 s\n" 952 | ] 953 | }, 954 | { 955 | "data": { 956 | "text/html": [ 957 | "
Dask DataFrame Structure:
\n", 958 | "
\n", 959 | "\n", 972 | "\n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
npartitions=12
int64datetime64[ns]datetime64[ns]float64float64float64objectint64int64int64float64float64float64float64float64float64float64float64object
.........................................................
............................................................
.........................................................
.........................................................
\n", 1132 | "
\n", 1133 | "
Dask Name: read-parquet, 1 graph layer
" 1134 | ], 1135 | "text/plain": [ 1136 | "Dask DataFrame Structure:\n", 1137 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee\n", 1138 | "npartitions=12 \n", 1139 | " int64 datetime64[ns] datetime64[ns] float64 float64 float64 object int64 int64 int64 float64 float64 float64 float64 float64 float64 float64 float64 object\n", 1140 | " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", 1141 | "... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", 1142 | " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", 1143 | " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", 1144 | "Dask Name: read-parquet, 1 graph layer" 1145 | ] 1146 | }, 1147 | "execution_count": 6, 1148 | "metadata": {}, 1149 | "output_type": "execute_result" 1150 | } 1151 | ], 1152 | "source": [ 1153 | "%%time\n", 1154 | "\n", 1155 | "import dask.dataframe as dd\n", 1156 | "\n", 1157 | "df = dd.read_parquet(\"s3://nyc-tlc/trip data/yellow_tripdata_2019-*.parquet\")\n", 1158 | "df" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "markdown", 1163 | "metadata": {}, 1164 | "source": [ 1165 | "That took ~600 milliseconds because Dask hasn't actually imported all the data. It has created partitions and estimated the datatypes of each column.\n", 1166 | "\n", 1167 | "Let's look at the first few rows, `head()` pandas method can be used for this." 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": 7, 1173 | "metadata": {}, 1174 | "outputs": [ 1175 | { 1176 | "data": { 1177 | "text/html": [ 1178 | "
\n", 1179 | "\n", 1192 | "\n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
012019-01-01 00:46:402019-01-01 00:53:201.01.51.0N15123917.00.50.51.650.00.39.95NaNNone
112019-01-01 00:59:472019-01-01 01:18:591.02.61.0N239246114.00.50.51.000.00.316.30NaNNone
222018-12-21 13:48:302018-12-21 13:52:403.00.01.0N23623614.50.50.50.000.00.35.80NaNNone
322018-11-28 15:52:252018-11-28 15:55:455.00.01.0N19319323.50.50.50.000.00.37.55NaNNone
422018-11-28 15:56:572018-11-28 15:58:335.00.02.0N193193252.00.00.50.000.00.355.55NaNNone
\n", 1330 | "
" 1331 | ], 1332 | "text/plain": [ 1333 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", 1334 | "0 1 2019-01-01 00:46:40 2019-01-01 00:53:20 1.0 \n", 1335 | "1 1 2019-01-01 00:59:47 2019-01-01 01:18:59 1.0 \n", 1336 | "2 2 2018-12-21 13:48:30 2018-12-21 13:52:40 3.0 \n", 1337 | "3 2 2018-11-28 15:52:25 2018-11-28 15:55:45 5.0 \n", 1338 | "4 2 2018-11-28 15:56:57 2018-11-28 15:58:33 5.0 \n", 1339 | "\n", 1340 | " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n", 1341 | "0 1.5 1.0 N 151 239 \n", 1342 | "1 2.6 1.0 N 239 246 \n", 1343 | "2 0.0 1.0 N 236 236 \n", 1344 | "3 0.0 1.0 N 193 193 \n", 1345 | "4 0.0 2.0 N 193 193 \n", 1346 | "\n", 1347 | " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n", 1348 | "0 1 7.0 0.5 0.5 1.65 0.0 \n", 1349 | "1 1 14.0 0.5 0.5 1.00 0.0 \n", 1350 | "2 1 4.5 0.5 0.5 0.00 0.0 \n", 1351 | "3 2 3.5 0.5 0.5 0.00 0.0 \n", 1352 | "4 2 52.0 0.0 0.5 0.00 0.0 \n", 1353 | "\n", 1354 | " improvement_surcharge total_amount congestion_surcharge airport_fee \n", 1355 | "0 0.3 9.95 NaN None \n", 1356 | "1 0.3 16.30 NaN None \n", 1357 | "2 0.3 5.80 NaN None \n", 1358 | "3 0.3 7.55 NaN None \n", 1359 | "4 0.3 55.55 NaN None " 1360 | ] 1361 | }, 1362 | "execution_count": 7, 1363 | "metadata": {}, 1364 | "output_type": "execute_result" 1365 | } 1366 | ], 1367 | "source": [ 1368 | "df.head()" 1369 | ] 1370 | }, 1371 | { 1372 | "cell_type": "markdown", 1373 | "metadata": {}, 1374 | "source": [ 1375 | "To look at the last few rows, use the `tail()` pandas method." 1376 | ] 1377 | }, 1378 | { 1379 | "cell_type": "code", 1380 | "execution_count": 8, 1381 | "metadata": {}, 1382 | "outputs": [ 1383 | { 1384 | "data": { 1385 | "text/html": [ 1386 | "
\n", 1387 | "\n", 1400 | "\n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
689631222019-12-31 23:56:292020-01-01 00:11:17NaN2.82NaNNone143141018.952.750.00.00.000.322.00NaNNone
689631322019-12-31 23:11:532019-12-31 23:30:56NaN3.75NaNNone148246022.452.750.00.00.000.325.50NaNNone
689631422019-12-31 23:57:212020-01-01 00:23:34NaN6.46NaNNone197205034.862.750.00.00.000.337.91NaNNone
689631522019-12-31 23:37:292020-01-01 00:28:21NaN5.66NaNNone9074036.452.750.00.00.000.339.50NaNNone
689631622019-12-31 23:09:002019-12-31 23:54:00NaN-15.50NaNNone142149053.032.750.50.06.120.362.70NaNNone
\n", 1538 | "
" 1539 | ], 1540 | "text/plain": [ 1541 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", 1542 | "6896312 2 2019-12-31 23:56:29 2020-01-01 00:11:17 NaN \n", 1543 | "6896313 2 2019-12-31 23:11:53 2019-12-31 23:30:56 NaN \n", 1544 | "6896314 2 2019-12-31 23:57:21 2020-01-01 00:23:34 NaN \n", 1545 | "6896315 2 2019-12-31 23:37:29 2020-01-01 00:28:21 NaN \n", 1546 | "6896316 2 2019-12-31 23:09:00 2019-12-31 23:54:00 NaN \n", 1547 | "\n", 1548 | " trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n", 1549 | "6896312 2.82 NaN None 143 \n", 1550 | "6896313 3.75 NaN None 148 \n", 1551 | "6896314 6.46 NaN None 197 \n", 1552 | "6896315 5.66 NaN None 90 \n", 1553 | "6896316 -15.50 NaN None 142 \n", 1554 | "\n", 1555 | " DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n", 1556 | "6896312 141 0 18.95 2.75 0.0 0.0 \n", 1557 | "6896313 246 0 22.45 2.75 0.0 0.0 \n", 1558 | "6896314 205 0 34.86 2.75 0.0 0.0 \n", 1559 | "6896315 74 0 36.45 2.75 0.0 0.0 \n", 1560 | "6896316 149 0 53.03 2.75 0.5 0.0 \n", 1561 | "\n", 1562 | " tolls_amount improvement_surcharge total_amount \\\n", 1563 | "6896312 0.00 0.3 22.00 \n", 1564 | "6896313 0.00 0.3 25.50 \n", 1565 | "6896314 0.00 0.3 37.91 \n", 1566 | "6896315 0.00 0.3 39.50 \n", 1567 | "6896316 6.12 0.3 62.70 \n", 1568 | "\n", 1569 | " congestion_surcharge airport_fee \n", 1570 | "6896312 NaN None \n", 1571 | "6896313 NaN None \n", 1572 | "6896314 NaN None \n", 1573 | "6896315 NaN None \n", 1574 | "6896316 NaN None " 1575 | ] 1576 | }, 1577 | "execution_count": 8, 1578 | "metadata": {}, 1579 | "output_type": "execute_result" 1580 | } 1581 | ], 1582 | "source": [ 1583 | "df.tail()" 1584 | ] 1585 | }, 1586 | { 1587 | "cell_type": "markdown", 1588 | "metadata": {}, 1589 | "source": [ 1590 | "This is different from pandas. pandas reads the complete dataset before inferring the datatypes and null-value information, which wouldn't be ideal for a larger-than-memory dataset.\n", 1591 | "\n", 1592 | "Dask estimates the datatypes with a small sample of data to stay efficient, so a good practice is to specify datatypes during the function call.\n", 1593 | "\n", 1594 | "*Note that Dask also provides a helpful error message to diagnose this issue.*" 1595 | ] 1596 | }, 1597 | { 1598 | "cell_type": "code", 1599 | "execution_count": 8, 1600 | "metadata": {}, 1601 | "outputs": [], 1602 | "source": [ 1603 | "df = dd.read_parquet(\n", 1604 | " \"s3://nyc-tlc/trip data/yellow_tripdata_2019-*.parquet\",\n", 1605 | " dtype={'RatecodeID': 'float64',\n", 1606 | " 'VendorID': 'float64',\n", 1607 | " 'passenger_count': 'float64',\n", 1608 | " 'payment_type': 'float64'}\n", 1609 | ")\n", 1610 | "# repartition the dataset to a more optimal size for faster computations\n", 1611 | "df = df.repartition(partition_size=\"100MB\").persist()" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "code", 1616 | "execution_count": 9, 1617 | "metadata": {}, 1618 | "outputs": [ 1619 | { 1620 | "data": { 1621 | "text/html": [ 1622 | "
\n", 1623 | "\n", 1636 | "\n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
012019-01-01 00:46:402019-01-01 00:53:201.01.51.0N15123917.00.50.51.650.00.39.95NaNNone
112019-01-01 00:59:472019-01-01 01:18:591.02.61.0N239246114.00.50.51.000.00.316.30NaNNone
222018-12-21 13:48:302018-12-21 13:52:403.00.01.0N23623614.50.50.50.000.00.35.80NaNNone
322018-11-28 15:52:252018-11-28 15:55:455.00.01.0N19319323.50.50.50.000.00.37.55NaNNone
422018-11-28 15:56:572018-11-28 15:58:335.00.02.0N193193252.00.00.50.000.00.355.55NaNNone
\n", 1774 | "
" 1775 | ], 1776 | "text/plain": [ 1777 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", 1778 | "0 1 2019-01-01 00:46:40 2019-01-01 00:53:20 1.0 \n", 1779 | "1 1 2019-01-01 00:59:47 2019-01-01 01:18:59 1.0 \n", 1780 | "2 2 2018-12-21 13:48:30 2018-12-21 13:52:40 3.0 \n", 1781 | "3 2 2018-11-28 15:52:25 2018-11-28 15:55:45 5.0 \n", 1782 | "4 2 2018-11-28 15:56:57 2018-11-28 15:58:33 5.0 \n", 1783 | "\n", 1784 | " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n", 1785 | "0 1.5 1.0 N 151 239 \n", 1786 | "1 2.6 1.0 N 239 246 \n", 1787 | "2 0.0 1.0 N 236 236 \n", 1788 | "3 0.0 1.0 N 193 193 \n", 1789 | "4 0.0 2.0 N 193 193 \n", 1790 | "\n", 1791 | " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n", 1792 | "0 1 7.0 0.5 0.5 1.65 0.0 \n", 1793 | "1 1 14.0 0.5 0.5 1.00 0.0 \n", 1794 | "2 1 4.5 0.5 0.5 0.00 0.0 \n", 1795 | "3 2 3.5 0.5 0.5 0.00 0.0 \n", 1796 | "4 2 52.0 0.0 0.5 0.00 0.0 \n", 1797 | "\n", 1798 | " improvement_surcharge total_amount congestion_surcharge airport_fee \n", 1799 | "0 0.3 9.95 NaN None \n", 1800 | "1 0.3 16.30 NaN None \n", 1801 | "2 0.3 5.80 NaN None \n", 1802 | "3 0.3 7.55 NaN None \n", 1803 | "4 0.3 55.55 NaN None " 1804 | ] 1805 | }, 1806 | "execution_count": 9, 1807 | "metadata": {}, 1808 | "output_type": "execute_result" 1809 | } 1810 | ], 1811 | "source": [ 1812 | "df.head()" 1813 | ] 1814 | }, 1815 | { 1816 | "cell_type": "code", 1817 | "execution_count": 10, 1818 | "metadata": {}, 1819 | "outputs": [ 1820 | { 1821 | "data": { 1822 | "text/html": [ 1823 | "
\n", 1824 | "\n", 1837 | "\n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surchargeairport_fee
689631222019-12-31 23:56:292020-01-01 00:11:17NaN2.82NaNNone143141018.952.750.00.00.000.322.00NaNNone
689631322019-12-31 23:11:532019-12-31 23:30:56NaN3.75NaNNone148246022.452.750.00.00.000.325.50NaNNone
689631422019-12-31 23:57:212020-01-01 00:23:34NaN6.46NaNNone197205034.862.750.00.00.000.337.91NaNNone
689631522019-12-31 23:37:292020-01-01 00:28:21NaN5.66NaNNone9074036.452.750.00.00.000.339.50NaNNone
689631622019-12-31 23:09:002019-12-31 23:54:00NaN-15.50NaNNone142149053.032.750.50.06.120.362.70NaNNone
\n", 1975 | "
" 1976 | ], 1977 | "text/plain": [ 1978 | " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", 1979 | "6896312 2 2019-12-31 23:56:29 2020-01-01 00:11:17 NaN \n", 1980 | "6896313 2 2019-12-31 23:11:53 2019-12-31 23:30:56 NaN \n", 1981 | "6896314 2 2019-12-31 23:57:21 2020-01-01 00:23:34 NaN \n", 1982 | "6896315 2 2019-12-31 23:37:29 2020-01-01 00:28:21 NaN \n", 1983 | "6896316 2 2019-12-31 23:09:00 2019-12-31 23:54:00 NaN \n", 1984 | "\n", 1985 | " trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n", 1986 | "6896312 2.82 NaN None 143 \n", 1987 | "6896313 3.75 NaN None 148 \n", 1988 | "6896314 6.46 NaN None 197 \n", 1989 | "6896315 5.66 NaN None 90 \n", 1990 | "6896316 -15.50 NaN None 142 \n", 1991 | "\n", 1992 | " DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n", 1993 | "6896312 141 0 18.95 2.75 0.0 0.0 \n", 1994 | "6896313 246 0 22.45 2.75 0.0 0.0 \n", 1995 | "6896314 205 0 34.86 2.75 0.0 0.0 \n", 1996 | "6896315 74 0 36.45 2.75 0.0 0.0 \n", 1997 | "6896316 149 0 53.03 2.75 0.5 0.0 \n", 1998 | "\n", 1999 | " tolls_amount improvement_surcharge total_amount \\\n", 2000 | "6896312 0.00 0.3 22.00 \n", 2001 | "6896313 0.00 0.3 25.50 \n", 2002 | "6896314 0.00 0.3 37.91 \n", 2003 | "6896315 0.00 0.3 39.50 \n", 2004 | "6896316 6.12 0.3 62.70 \n", 2005 | "\n", 2006 | " congestion_surcharge airport_fee \n", 2007 | "6896312 NaN None \n", 2008 | "6896313 NaN None \n", 2009 | "6896314 NaN None \n", 2010 | "6896315 NaN None \n", 2011 | "6896316 NaN None " 2012 | ] 2013 | }, 2014 | "execution_count": 10, 2015 | "metadata": {}, 2016 | "output_type": "execute_result" 2017 | } 2018 | ], 2019 | "source": [ 2020 | "df.tail()" 2021 | ] 2022 | }, 2023 | { 2024 | "cell_type": "markdown", 2025 | "metadata": {}, 2026 | "source": [ 2027 | "This now works!" 2028 | ] 2029 | }, 2030 | { 2031 | "cell_type": "markdown", 2032 | "metadata": {}, 2033 | "source": [ 2034 | "### Working with the data\n", 2035 | "\n", 2036 | "The same computation (to calculate mean for the tip amount as a function of passenger count) is now performed on the entire dataset using Dask DataFrame.\n", 2037 | "\n", 2038 | "*Note that Dask code is similar to pandas code.*" 2039 | ] 2040 | }, 2041 | { 2042 | "cell_type": "code", 2043 | "execution_count": 11, 2044 | "metadata": {}, 2045 | "outputs": [ 2046 | { 2047 | "name": "stdout", 2048 | "output_type": "stream", 2049 | "text": [ 2050 | "CPU times: user 5.54 ms, sys: 1.69 ms, total: 7.24 ms\n", 2051 | "Wall time: 6.61 ms\n" 2052 | ] 2053 | }, 2054 | { 2055 | "data": { 2056 | "text/plain": [ 2057 | "Dask Series Structure:\n", 2058 | "npartitions=1\n", 2059 | " float64\n", 2060 | " ...\n", 2061 | "Name: tip_amount, dtype: float64\n", 2062 | "Dask Name: truediv, 6 graph layers" 2063 | ] 2064 | }, 2065 | "execution_count": 11, 2066 | "metadata": {}, 2067 | "output_type": "execute_result" 2068 | } 2069 | ], 2070 | "source": [ 2071 | "%%time\n", 2072 | "\n", 2073 | "mean_tip_amount = df.groupby(\"passenger_count\").tip_amount.mean()\n", 2074 | "mean_tip_amount" 2075 | ] 2076 | }, 2077 | { 2078 | "cell_type": "markdown", 2079 | "metadata": {}, 2080 | "source": [ 2081 | "Dask DataFrame is backed by the Delayed API we saw in the previous notebook, so the evaluations here are also lazy.\n", 2082 | "\n", 2083 | "You can use `compute()` to get the output." 2084 | ] 2085 | }, 2086 | { 2087 | "cell_type": "code", 2088 | "execution_count": null, 2089 | "metadata": {}, 2090 | "outputs": [], 2091 | "source": [ 2092 | "%%time\n", 2093 | "\n", 2094 | "mean_tip_amount.compute()" 2095 | ] 2096 | }, 2097 | { 2098 | "cell_type": "markdown", 2099 | "metadata": {}, 2100 | "source": [ 2101 | "Dask deletes intermediate results, like the full pandas dataframe for each file. This lets us handle datasets that are larger than memory, but also means that repeated computations will have to load all of the data in each time.\n", 2102 | "\n", 2103 | "You can use `persist()` to store intermediate results for future use:" 2104 | ] 2105 | }, 2106 | { 2107 | "cell_type": "markdown", 2108 | "metadata": {}, 2109 | "source": [ 2110 | "```\n", 2111 | "mean_tip_persist = mean_tip_amount.persist()\n", 2112 | "```" 2113 | ] 2114 | }, 2115 | { 2116 | "cell_type": "markdown", 2117 | "metadata": {}, 2118 | "source": [ 2119 | "### Checkpoint" 2120 | ] 2121 | }, 2122 | { 2123 | "cell_type": "markdown", 2124 | "metadata": {}, 2125 | "source": [ 2126 | "**Question:** Compute the standard deviation for tip_amount as a function of passenger_count for the entire dataset." 2127 | ] 2128 | }, 2129 | { 2130 | "cell_type": "code", 2131 | "execution_count": null, 2132 | "metadata": {}, 2133 | "outputs": [], 2134 | "source": [ 2135 | "#your answer here" 2136 | ] 2137 | }, 2138 | { 2139 | "cell_type": "code", 2140 | "execution_count": null, 2141 | "metadata": { 2142 | "jupyter": { 2143 | "source_hidden": true 2144 | } 2145 | }, 2146 | "outputs": [], 2147 | "source": [ 2148 | "# Solution 1\n", 2149 | "\n", 2150 | "std_tip = df.groupby(\"passenger_count\").tip_amount.std().compute()" 2151 | ] 2152 | }, 2153 | { 2154 | "cell_type": "markdown", 2155 | "metadata": {}, 2156 | "source": [ 2157 | "### Sharing intermediate outputs" 2158 | ] 2159 | }, 2160 | { 2161 | "cell_type": "markdown", 2162 | "metadata": {}, 2163 | "source": [ 2164 | "Sometimes individual computations may related to each other, and can benefit from sharing intermediate results. For example, computing minimum and maximum values." 2165 | ] 2166 | }, 2167 | { 2168 | "cell_type": "markdown", 2169 | "metadata": {}, 2170 | "source": [ 2171 | "In pandas (and therefore in Dask DataFrame), you can use `min()` and `max()` to compute minimum and maximum respectively." 2172 | ] 2173 | }, 2174 | { 2175 | "cell_type": "code", 2176 | "execution_count": 13, 2177 | "metadata": {}, 2178 | "outputs": [], 2179 | "source": [ 2180 | "max_tip_amount = df.tip_amount.max()\n", 2181 | "min_tip_amount = df.tip_amount.min()\n", 2182 | "median_tip_amount = df.tip_amount.median()" 2183 | ] 2184 | }, 2185 | { 2186 | "cell_type": "markdown", 2187 | "metadata": {}, 2188 | "source": [ 2189 | "### Without Sharing" 2190 | ] 2191 | }, 2192 | { 2193 | "cell_type": "code", 2194 | "execution_count": 14, 2195 | "metadata": {}, 2196 | "outputs": [ 2197 | { 2198 | "name": "stdout", 2199 | "output_type": "stream", 2200 | "text": [ 2201 | "CPU times: user 1min 16s, sys: 2.13 s, total: 1min 18s\n", 2202 | "Wall time: 1min 33s\n" 2203 | ] 2204 | } 2205 | ], 2206 | "source": [ 2207 | "%%time\n", 2208 | "max_tip = max_tip_amount.compute()\n", 2209 | "min_tip = min_tip_amount.compute()\n", 2210 | "median_tip = median_tip_amount.compute()" 2211 | ] 2212 | }, 2213 | { 2214 | "cell_type": "markdown", 2215 | "metadata": {}, 2216 | "source": [ 2217 | "### With Sharing" 2218 | ] 2219 | }, 2220 | { 2221 | "cell_type": "code", 2222 | "execution_count": 15, 2223 | "metadata": {}, 2224 | "outputs": [], 2225 | "source": [ 2226 | "import dask" 2227 | ] 2228 | }, 2229 | { 2230 | "cell_type": "code", 2231 | "execution_count": 16, 2232 | "metadata": {}, 2233 | "outputs": [ 2234 | { 2235 | "name": "stdout", 2236 | "output_type": "stream", 2237 | "text": [ 2238 | "CPU times: user 47.6 s, sys: 1.21 s, total: 48.8 s\n", 2239 | "Wall time: 51.8 s\n" 2240 | ] 2241 | } 2242 | ], 2243 | "source": [ 2244 | "%%time\n", 2245 | "max_tip, min_tip = dask.compute(max_tip_amount, min_tip_amount)" 2246 | ] 2247 | }, 2248 | { 2249 | "cell_type": "markdown", 2250 | "metadata": {}, 2251 | "source": [ 2252 | "Notice the shared computation is significantly faster!" 2253 | ] 2254 | }, 2255 | { 2256 | "cell_type": "markdown", 2257 | "metadata": {}, 2258 | "source": [ 2259 | "### Checkpoint" 2260 | ] 2261 | }, 2262 | { 2263 | "cell_type": "markdown", 2264 | "metadata": {}, 2265 | "source": [ 2266 | "**Question:** Compute the mean and standard deviation for total amount by sharing intermediate results." 2267 | ] 2268 | }, 2269 | { 2270 | "cell_type": "code", 2271 | "execution_count": null, 2272 | "metadata": {}, 2273 | "outputs": [], 2274 | "source": [ 2275 | "#your answer here" 2276 | ] 2277 | }, 2278 | { 2279 | "cell_type": "code", 2280 | "execution_count": null, 2281 | "metadata": { 2282 | "jupyter": { 2283 | "source_hidden": true 2284 | }, 2285 | "tags": [] 2286 | }, 2287 | "outputs": [], 2288 | "source": [ 2289 | "# Solution 2\n", 2290 | "\n", 2291 | "import dask\n", 2292 | "\n", 2293 | "mean_total = df.total_amount.mean()\n", 2294 | "std_total = df.total_amount.mean()\n", 2295 | "\n", 2296 | "dask.compute(mean_total, std_total)" 2297 | ] 2298 | }, 2299 | { 2300 | "cell_type": "code", 2301 | "execution_count": 17, 2302 | "metadata": {}, 2303 | "outputs": [], 2304 | "source": [ 2305 | "client.close()" 2306 | ] 2307 | }, 2308 | { 2309 | "cell_type": "markdown", 2310 | "metadata": {}, 2311 | "source": [ 2312 | "## Scaling to the Cloud (Optional)\n", 2313 | "\n", 2314 | "We can now scale our Dask workflow to the cloud. There are many different ways to do this, but here we'll use [Coiled](https://www.coiled.io/). Coiled allows us to stay in this same notebook and makes the process much easier (see the [Coiled documentation](https://docs.coiled.io/user_guide/index.html)).\n", 2315 | "\n", 2316 | "1. Sign in to [cloud.coiled.io](https://cloud.coiled.io/)\n", 2317 | "2. In your terminal (or command prompt in Windows) run `coiled login`\n", 2318 | "4. Set up Coiled with your cloud provider account by running `coiled setup wizard`\n", 2319 | "\n", 2320 | "*Coiled is free to start!*\n", 2321 | "\n", 2322 | "That's it! Now in the same notebook, let's connect to our Coiled cluster." 2323 | ] 2324 | }, 2325 | { 2326 | "cell_type": "code", 2327 | "execution_count": null, 2328 | "metadata": {}, 2329 | "outputs": [], 2330 | "source": [ 2331 | "import coiled\n", 2332 | "\n", 2333 | "cluster = coiled.Cluster(\n", 2334 | " name=\"talkpython\",\n", 2335 | " n_workers=10,\n", 2336 | " # uncomment if you're running on binder\n", 2337 | " # scheduler_port=443\n", 2338 | ")" 2339 | ] 2340 | }, 2341 | { 2342 | "cell_type": "code", 2343 | "execution_count": 2, 2344 | "metadata": {}, 2345 | "outputs": [], 2346 | "source": [ 2347 | "from dask.distributed import Client\n", 2348 | "\n", 2349 | "client = Client(cluster)" 2350 | ] 2351 | }, 2352 | { 2353 | "cell_type": "code", 2354 | "execution_count": 3, 2355 | "metadata": { 2356 | "tags": [] 2357 | }, 2358 | "outputs": [], 2359 | "source": [ 2360 | "import dask.dataframe as dd" 2361 | ] 2362 | }, 2363 | { 2364 | "cell_type": "code", 2365 | "execution_count": 13, 2366 | "metadata": {}, 2367 | "outputs": [], 2368 | "source": [ 2369 | "df = dd.read_parquet(\n", 2370 | " \"s3://nyc-tlc/trip data/yellow_tripdata_2019-*.parquet\"\n", 2371 | ")\n", 2372 | "df = df.repartition(partition_size=\"100MB\").persist()" 2373 | ] 2374 | }, 2375 | { 2376 | "cell_type": "code", 2377 | "execution_count": 14, 2378 | "metadata": {}, 2379 | "outputs": [ 2380 | { 2381 | "data": { 2382 | "text/plain": [ 2383 | "passenger_count\n", 2384 | "0.0 2.122789\n", 2385 | "1.0 2.206793\n", 2386 | "2.0 2.214356\n", 2387 | "3.0 2.137791\n", 2388 | "4.0 2.023801\n", 2389 | "5.0 2.235441\n", 2390 | "6.0 2.221106\n", 2391 | "7.0 6.675962\n", 2392 | "8.0 7.111625\n", 2393 | "9.0 7.377822\n", 2394 | "Name: tip_amount, dtype: float64" 2395 | ] 2396 | }, 2397 | "execution_count": 14, 2398 | "metadata": {}, 2399 | "output_type": "execute_result" 2400 | } 2401 | ], 2402 | "source": [ 2403 | "df.groupby(\"passenger_count\").tip_amount.mean().compute()" 2404 | ] 2405 | }, 2406 | { 2407 | "cell_type": "code", 2408 | "execution_count": 15, 2409 | "metadata": {}, 2410 | "outputs": [], 2411 | "source": [ 2412 | "# Close the cluster\n", 2413 | "# Will close automatically after 20 minutes of inactivity\n", 2414 | "cluster.close()\n", 2415 | "\n", 2416 | "# Close the client\n", 2417 | "client.close()" 2418 | ] 2419 | }, 2420 | { 2421 | "cell_type": "markdown", 2422 | "metadata": {}, 2423 | "source": [ 2424 | "## Limitations of Dask DataFrame\n", 2425 | "\n", 2426 | "Dask DataFrame API does not implement the complete pandas interface because some pandas operations are not suited for a parallel and distributed environment.\n", 2427 | "\n", 2428 | "### Data Shuffling\n", 2429 | "\n", 2430 | "Dask DataFrames consist of multiple pandas dataframes, each of which has it's index starting from zero. Some operations like indexing (`set_index`, `reset_index`) may need the data to be sorted, which requires a lot of time-consuming shuffling of data. These operations are slower in Dask. Hence, presorting the index and making logical partitions are good practices.\n" 2431 | ] 2432 | }, 2433 | { 2434 | "cell_type": "markdown", 2435 | "metadata": {}, 2436 | "source": [ 2437 | "## References\n", 2438 | "\n", 2439 | "* [Dask DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n", 2440 | "* [Dask DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n", 2441 | "* [Dask DataFrame examples](https://examples.dask.org/dataframe.html)\n", 2442 | "* [Dask Tutorial - DataFrames](https://github.com/pavithraes/dask-tutorial/blob/master/04_dataframe.ipynb)" 2443 | ] 2444 | }, 2445 | { 2446 | "cell_type": "code", 2447 | "execution_count": null, 2448 | "metadata": {}, 2449 | "outputs": [], 2450 | "source": [] 2451 | } 2452 | ], 2453 | "metadata": { 2454 | "kernelspec": { 2455 | "display_name": "Python 3 (ipykernel)", 2456 | "language": "python", 2457 | "name": "python3" 2458 | }, 2459 | "language_info": { 2460 | "codemirror_mode": { 2461 | "name": "ipython", 2462 | "version": 3 2463 | }, 2464 | "file_extension": ".py", 2465 | "mimetype": "text/x-python", 2466 | "name": "python", 2467 | "nbconvert_exporter": "python", 2468 | "pygments_lexer": "ipython3", 2469 | "version": "3.9.15" 2470 | } 2471 | }, 2472 | "nbformat": 4, 2473 | "nbformat_minor": 4 2474 | } 2475 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Getting Started with Dask 2 | 3 | This repository contains the material for **Talk Python Training course** on Getting Started with Dask. 4 | 5 | In this **free** course, we will get you up to speed with Dask and show you how to easily convert pandas workloads to blazing Dask clusters (locally across cores or scaled-out across cloud servers). 6 | 7 | Learn more and take the course at: [training.talkpython.fm](https://training.talkpython.fm/courses/introduction-to-scaling-python-and-pandas-with-dask) 8 | 9 | In this course, you will: 10 | 11 | * Explore the problems solved by Dask: What is big data and how can you work with it? 12 | * Learn the Dask API and how to use it 13 | * Analyze the NYC taxicab dataset with Dask on a local cluster 14 | * Scale that same computation to the cloud with Coiled 15 | * Connect to local and remote Dask cluster visualization and reporting dashboards 16 | * And more! 17 | 18 | 19 | ## Prerequisites 20 | 21 | - Basic Python 22 | 23 | Not required, but nice to have: 24 | - pandas 25 | - JupyterLab 26 | - conda (for local setup) 27 | - terminal (for local setup) 28 | 29 | ## Setup 30 | 31 | You get up and running in two ways: 32 | 33 | ### Launch Binder 34 | 35 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/coiled/talkpython-getting-started-with-dask/master?urlpath=lab/tree/00-setup.ipynb) 36 | 37 | The binder project allows you to open Jupyter notebooks in this repository in an online executable environment. Click on the "launch binder" link in your browser window to get started. It might take a few minutes to start. 38 | 39 | *Note: Binder notebooks timeout if inactive for more than 10 mins.* 40 | 41 | ### Local setup (recommended) 42 | 43 | * [Fork this repository](https://docs.github.com/en/free-pro-team@latest/github/getting-started-with-github/fork-a-repo) 44 | 45 | * Clone your forked repository: 46 | 47 | ```git clone http://github.com//talkpython-getting-started-with-dask``` 48 | 49 | * From root directory: 50 | 51 | ```cd talkpython-getting-started-with-dask``` 52 | 53 | create a new conda environment: 54 | 55 | ```conda env create -f environment.yml``` 56 | 57 | * Activate the conda environment: 58 | 59 | ``` conda activate talkpython-dask``` 60 | 61 | * Start JupyterLab 62 | 63 | ```jupyter lab``` 64 | -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: talkpython-dask 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.8 6 | - nodejs 7 | - dask>=2021.3.0 8 | - dask-ml>=1.7.0 9 | - distributed>=2021.3.0 10 | - jupyterlab>=3.0 11 | - notebook 12 | - pandas>=1.0.1 13 | - numpy>=1.19.2 14 | - scipy>=1.4.1 15 | - scikit-learn>=0.22.1 16 | - scikit-image>=0.15.0 17 | - ipywidgets>=7.5 18 | - bokeh>=2.3.0 19 | - pip>=20.3.0 20 | - pip: 21 | - dask-labextension>=3.0.0 22 | - coiled 23 | - python-graphviz 24 | - h5py 25 | - mimesis 26 | -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | 6 | exec "$@" 7 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: talkpython-dask 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.9 6 | - coiled-runtime>=0.2.1 7 | --------------------------------------------------------------------------------