├── 1-Futures.ipynb ├── 2-Dataframes.ipynb ├── README.md ├── binder ├── environment.yml ├── jupyterlab-workspace.json └── start ├── images └── nyc-taxi-scatter.png └── requirements.txt /1-Futures.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3786371b-cdd0-473b-89bd-94573b3676b5", 6 | "metadata": {}, 7 | "source": [ 8 | "\"Dask\n", 12 | " \n", 13 | "# Parallelize Python with
Dask Futures\n", 14 | "\n", 15 | "In this lesson, we will take normal for-loopy Python code that looks like this:\n", 16 | "\n", 17 | "```python\n", 18 | "urls = [...]\n", 19 | "results = []\n", 20 | "for url in urls:\n", 21 | " page = download(url)\n", 22 | " result = process(page)\n", 23 | " results.append(result)\n", 24 | "```\n", 25 | "\n", 26 | "or more dynamic Python code that looks like this:\n", 27 | "\n", 28 | "```python\n", 29 | "urls = [...]\n", 30 | "results = []\n", 31 | "while urls:\n", 32 | " url = urls.pop()\n", 33 | " page = download(url)\n", 34 | " result = process(page)\n", 35 | " results.append(result)\n", 36 | " \n", 37 | " new_urls = scrape(page)\n", 38 | " urls.extend(new_urls)\n", 39 | "```\n", 40 | "\n", 41 | "and parallelize it out into a dynamic task graph (image right) and then run that code in parallel. \n", 42 | "\n", 43 | "This will give us the foundation of parallel computing used by Dask. In future examples, we'll see how other library developers have used this to give us higher-level parallel APIs .\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "id": "e5eca4d3-0a1a-4b30-946b-676fa0daa32b", 49 | "metadata": {}, 50 | "source": [ 51 | "\n", 52 | "\n", 53 | "## Outline\n", 54 | "\n", 55 | "We will learn how to use futures, and then use them on a real-world example, first in a simple case, and then in a complex case:\n", 56 | "\n", 57 | "1. (Learn) How to use Futures \n", 58 | "2. (Do) Use futures to download and parse webpages\n", 59 | "3. (Learn) Dynamic/changing workloads\n", 60 | "4. (Do) Crawl and scrape a website\n", 61 | "\n" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "id": "c1810a74-7c2b-41bd-b795-3b1a9aba71b7", 67 | "metadata": {}, 68 | "source": [ 69 | "\n", 70 | "## 1. (Learn) How to use Futures\n", 71 | "\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "id": "80a6e52b-c243-43e9-b255-bd60d6f142ca", 77 | "metadata": { 78 | "tags": [] 79 | }, 80 | "source": [ 81 | "### Parallel Code with low-level Futures\n", 82 | "\n", 83 | "This is an example of an embarrassingly parallel computation. We want to run the same Python code on many pieces of data. This is a very simple and also very common case that comes up all the time.\n", 84 | "\n", 85 | "Let's learn how to do this with [Dask futures](https://docs.dask.org/en/stable/futures.html)\n", 86 | "\n", 87 | "First, we're going to see a very simple example, then we'll try to parallelize the code above.\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "id": "01cf74da-c8d7-402a-ab4e-09785061c70e", 93 | "metadata": {}, 94 | "source": [ 95 | "### Set up a Dask cluster locally" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "id": "f512e342-4446-4b23-ab78-7d6bb4405e77", 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "from dask.distributed import Client\n", 106 | "\n", 107 | "client = Client()\n", 108 | "client" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "8f4478ef-fc35-4e26-a1de-7ecd690ead1f", 114 | "metadata": {}, 115 | "source": [ 116 | "### Dask Futures introduction" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "id": "d7e89388-eea3-4c66-ae63-ebe792806f5c", 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "import time\n", 127 | "import random\n", 128 | "\n", 129 | "def inc(x):\n", 130 | " time.sleep(random.random())\n", 131 | " return x + 1\n", 132 | "\n", 133 | "def double(x):\n", 134 | " time.sleep(random.random())\n", 135 | " return 2 * x\n", 136 | "\n", 137 | "def add(x, y):\n", 138 | " time.sleep(random.random())\n", 139 | " return 2 * x\n", 140 | " " 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "id": "491cc8d1-1a37-4d3f-b9ab-fa27d6283771", 146 | "metadata": {}, 147 | "source": [ 148 | "Dask futures lets us run Python functions remotely on parallel hardware. Rather than calling the function directly:" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "id": "784cdc94-c20c-4d4a-93c8-5f390671fa18", 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "%%time\n", 159 | "\n", 160 | "y = inc(10)\n", 161 | "z = double(y)\n", 162 | "z" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "id": "be50d006-c1c3-4796-9240-9b42048f09be", 168 | "metadata": {}, 169 | "source": [ 170 | "We can ask Dask to run that function, `slowinc` on the data `10` by passing each as arguments into the `client.submit` method. The first argument is the function to call and the rest of the arguments are arguments to that function.\n", 171 | "\n", 172 | "Normal Execution\n", 173 | "\n", 174 | "```python\n", 175 | "result = function(*args, **kwargs)\n", 176 | "```\n", 177 | "\n", 178 | "Submit function for remote execution\n", 179 | "\n", 180 | "```python\n", 181 | "future = client.submit(function, *args, **kwargs) # instantaneously fire off work\n", 182 | "...\n", 183 | "result = future.result() # when we need, block until done and collect the result\n", 184 | "```" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "id": "acbde717-4779-4679-b21a-e7e46b8a8ea9", 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "%%time\n", 195 | "\n", 196 | "y = client.submit(inc, 10)\n", 197 | "z = client.submit(double, y)\n", 198 | "z" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "id": "2bb7b612-e8e6-4df9-8931-291e1c4e37d1", 204 | "metadata": {}, 205 | "source": [ 206 | "You'll notice that that happened immediately. That's because all we did was submit the `inc` function to run on Dask, and then return a `Future`, or a pointer to where the data will eventually be.\n", 207 | "\n", 208 | "We can gather the future by calling `future.result()`" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "id": "d19b2bb0-8d16-4fc7-bf34-f62272c8bdf9", 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "z" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "id": "92857028-91f6-4c52-8ded-1b90824ff591", 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "z.result()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "id": "41f0e330-f7da-43c5-b14c-fb23ea527663", 234 | "metadata": {}, 235 | "source": [ 236 | "### Submit many tasks in a loop\n", 237 | "\n", 238 | "We can submit lots of functions to run at once, and then gather them when we're done. This allows us to easily parallelize simple for loops.\n", 239 | "\n", 240 | "*This section uses the following API*:\n", 241 | "\n", 242 | "- [Client.submit and Future.result](https://docs.dask.org/en/stable/futures.html#submit-tasks)\n" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "cc323bbf-fadb-4ced-8cc5-ac4434de93f5", 248 | "metadata": {}, 249 | "source": [ 250 | "#### Sequential code" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "id": "27af5073-7276-477b-8a26-7b87e58c7e93", 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "%%time \n", 261 | "\n", 262 | "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 263 | "results = []\n", 264 | "\n", 265 | "for x in data:\n", 266 | " y = inc(x)\n", 267 | " z = double(y)\n", 268 | " results.append(z)\n", 269 | " \n", 270 | "results" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "id": "1426a61d-971f-44df-bfc0-21806d3d673d", 276 | "metadata": {}, 277 | "source": [ 278 | "#### Parallel code" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "id": "684ecb0c-9dc8-4aaf-af84-3d3114ca9fe6", 285 | "metadata": { 286 | "jupyter": { 287 | "source_hidden": true 288 | }, 289 | "tags": [] 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "%%time \n", 294 | "\n", 295 | "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", 296 | "results = []\n", 297 | "\n", 298 | "for x in data:\n", 299 | " y = client.submit(inc, x)\n", 300 | " z = client.submit(double, y)\n", 301 | " results.append(z)\n", 302 | " \n", 303 | "results = client.gather(results)\n", 304 | "results" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "id": "e5675eb0-5533-457e-9ee1-da574300dc9c", 310 | "metadata": {}, 311 | "source": [ 312 | "### Lessons:\n", 313 | "\n", 314 | "1. Submit a function to run elsewhere\n", 315 | "\n", 316 | " ```python\n", 317 | " y = f(x)\n", 318 | " future = client.submit(f, x)\n", 319 | " ```\n", 320 | " \n", 321 | " \n", 322 | "2. Get results when you're done\n", 323 | "\n", 324 | " ```python\n", 325 | " y = future.result()\n", 326 | " # or \n", 327 | " results = client.gather(futures)\n", 328 | " ```" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "id": "2d4cc17e-25c1-4fef-a0a4-697bf67fa099", 334 | "metadata": { 335 | "tags": [] 336 | }, 337 | "source": [ 338 | "## 2. (Do) Use futures to download and parse webpages\n", 339 | "\n", 340 | "### Sequential Code\n", 341 | "\n", 342 | "The code below downloads 50 question pages from a Stack Overflow tag, parses those pages, and collects the title and list of tags from each page.\n", 343 | "\n", 344 | "We then count up all the tags to see what are the most popular kinds of questions. We divide this code into four sections:\n", 345 | "\n", 346 | "1. Define useful functions\n", 347 | "2. Get a list of pages to download and scrape\n", 348 | "3. Download and scrape\n", 349 | "4. Analyze results" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "id": "93e385ba-320b-49c3-8e82-eaec3c190750", 355 | "metadata": {}, 356 | "source": [ 357 | "#### Define useful functions\n", 358 | "\n", 359 | "You don't need to study these. Feel free to skip." 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "id": "d68476f8-ab20-4f34-b3a2-e5a063c39835", 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "import re\n", 370 | "import requests\n", 371 | "from bs4 import BeautifulSoup\n", 372 | "import time\n", 373 | "\n", 374 | "def download(url: str, delay=0) -> str:\n", 375 | " time.sleep(delay)\n", 376 | " response = requests.get(url)\n", 377 | " if response.status_code == 200:\n", 378 | " return response.text\n", 379 | " else:\n", 380 | " response.raise_for_status()\n", 381 | " \n", 382 | " \n", 383 | "def scrape_title(body: str) -> str:\n", 384 | " html = BeautifulSoup(body, \"html.parser\")\n", 385 | " return str(html.html.title)\n", 386 | "\n", 387 | "\n", 388 | "def scrape_links(body: str, base_url=\"\") -> list[str]:\n", 389 | " html = BeautifulSoup(body, \"html.parser\")\n", 390 | " \n", 391 | " return [\n", 392 | " str(base_url + link.attrs[\"href\"]).split(\"?\")[0]\n", 393 | " for link in html.find_all(\"a\") \n", 394 | " if re.match(\"/questions/\\d{5}\", link.attrs.get(\"href\", \"\"))\n", 395 | " ]\n", 396 | "\n", 397 | "\n", 398 | "def scrape_tags(body: str) -> list[str]:\n", 399 | " html = BeautifulSoup(body, \"html.parser\")\n", 400 | " \n", 401 | " return sorted({\n", 402 | " str(list(link.children)[0])\n", 403 | " for link in html.find_all(\"a\", class_=\"post-tag\")\n", 404 | " })" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "id": "c55e8b99-ce1b-4c0d-8c9d-6e9db2b47339", 410 | "metadata": {}, 411 | "source": [ 412 | "#### Get list of pages to download and scrape" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "id": "a4da310d-aab4-4e52-8597-77dc8a6432e5", 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "url = \"https://stackoverflow.com/questions/tagged/jupyter\"\n", 423 | "body = download(url)\n", 424 | "urls = scrape_links(body, base_url=\"https://stackoverflow.com\")\n", 425 | "urls[:5]" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "id": "b0fae6d1-d0e2-46df-b130-e13655d8d981", 431 | "metadata": {}, 432 | "source": [ 433 | "#### Download and scrape" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "id": "a7ee1f7d-f0de-42a0-b99d-173103a39b20", 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "%%time\n", 444 | "\n", 445 | "all_tags = []\n", 446 | "titles = []\n", 447 | "\n", 448 | "for url in urls:\n", 449 | " page = download(url)\n", 450 | " print(\".\", end=\"\")\n", 451 | " tags = scrape_tags(page)\n", 452 | " title = scrape_title(page)\n", 453 | " \n", 454 | " all_tags.append(tags)\n", 455 | " titles.append(title)\n", 456 | "print()" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "id": "199ed8c4-9f9e-47bd-9ac9-a6453729c6a9", 462 | "metadata": {}, 463 | "source": [ 464 | "#### Analyze Results\n", 465 | "\n", 466 | "Aggregate tags to find related topics" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "id": "bba06861-4b0d-4e21-af30-875129713d83", 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "import collections\n", 477 | "\n", 478 | "tag_counter = collections.defaultdict(int)\n", 479 | "\n", 480 | "for tags in all_tags:\n", 481 | " for tag in tags:\n", 482 | " tag_counter[tag] += 1\n", 483 | " \n", 484 | "sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:10]" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "id": "55a5f41f-ad62-4503-ad1c-ab68d9a9869c", 490 | "metadata": {}, 491 | "source": [ 492 | "### Exercise: Parallelize this code\n", 493 | "\n", 494 | "Take the code above, and use Dask futures to run it in parallel\n", 495 | "\n", 496 | "Which sections should we think about parallelizing?" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": null, 502 | "id": "bd3e7682-7220-480a-9ca3-4278a5383e8c", 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [ 506 | "url = \"https://stackoverflow.com/questions/tagged/jupyter\"\n", 507 | "body = download(url)\n", 508 | "urls = scrape_links(body, base_url=\"https://stackoverflow.com\")" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": null, 514 | "id": "91f71522-5808-4950-92a5-f25bdc769a29", 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "# TODO: parallelize me\n", 519 | "\n", 520 | "%%time\n", 521 | "\n", 522 | "all_tags = []\n", 523 | "titles = []\n", 524 | "\n", 525 | "for url in urls:\n", 526 | " page = download(url)\n", 527 | " print(\".\", end=\"\")\n", 528 | " tags = scrape_tags(page)\n", 529 | " title = scrape_title(page)\n", 530 | " \n", 531 | " all_tags.append(tags)\n", 532 | " titles.append(title)\n", 533 | "print()" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "id": "5a430e17-5bca-4b48-b914-95657515867e", 539 | "metadata": { 540 | "tags": [] 541 | }, 542 | "source": [ 543 | "##### Solution\n", 544 | "\n", 545 | "Expand the three dots below if you want to see the answer" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": null, 551 | "id": "1da9e012-31dc-4339-ab33-5ea38712ad65", 552 | "metadata": { 553 | "jupyter": { 554 | "source_hidden": true 555 | }, 556 | "tags": [] 557 | }, 558 | "outputs": [], 559 | "source": [ 560 | "%%time\n", 561 | "\n", 562 | "all_tags = []\n", 563 | "titles = []\n", 564 | "\n", 565 | "for url in urls:\n", 566 | " page = client.submit(download, url)\n", 567 | " tags = client.submit(scrape_tags, page)\n", 568 | " title = client.submit(scrape_title, page)\n", 569 | " \n", 570 | " all_tags.append(tags)\n", 571 | " titles.append(title)\n", 572 | " \n", 573 | "all_tags = client.gather(all_tags)\n", 574 | "titles = client.gather(titles)" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "id": "0f5df132-43d8-40ce-934e-eb0735a86519", 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "import collections\n", 585 | "\n", 586 | "tag_counter = collections.defaultdict(int)\n", 587 | "\n", 588 | "for tags in all_tags:\n", 589 | " for tag in tags:\n", 590 | " tag_counter[tag] += 1\n", 591 | " \n", 592 | "sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:10]" 593 | ] 594 | }, 595 | { 596 | "cell_type": "markdown", 597 | "id": "648048f6-53d8-4e64-a0cf-f3aa5c841ed6", 598 | "metadata": {}, 599 | "source": [ 600 | "### Exercise: Scale out\n", 601 | "\n", 602 | "There are different reasons to scale out for this problem:\n", 603 | "\n", 604 | "1. Parallelize bandwidth\n", 605 | "2. StackOverflow's rate-limits won't affect us as much if we spread out our requests from many different machines\n", 606 | "3. ~CPU Processing speed~ (not really an issue here)\n", 607 | "\n", 608 | "Let's ask for some machines from Coiled, and switch our Dask client to use that cluster." 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": null, 614 | "id": "f57c5ed2-5103-4811-b1ee-781046c16649", 615 | "metadata": {}, 616 | "outputs": [], 617 | "source": [ 618 | "client.close()" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "id": "4ded7332-a475-40f6-a7b2-01f017573c35", 625 | "metadata": { 626 | "tags": [] 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "import coiled\n", 631 | "\n", 632 | "cluster = coiled.Cluster(\n", 633 | " n_workers=20,\n", 634 | " worker_cpu=2,\n", 635 | " worker_options={\"nthreads\": 1},\n", 636 | " account=\"dask-tutorials\",\n", 637 | ")\n", 638 | "\n", 639 | "client = cluster.get_client()" 640 | ] 641 | }, 642 | { 643 | "attachments": { 644 | "dbc8e0aa-6a12-436a-aef4-f8253b1739d7.png": { 645 | "image/png": "" 646 | } 647 | }, 648 | "cell_type": "markdown", 649 | "id": "92acc786-81bd-49cd-867f-66b3346d0890", 650 | "metadata": {}, 651 | "source": [ 652 | "You can then insert the Dashboard URL into the text field at the top of the Dask JupyterLab extension sidebar, or press the Magnifying Glass icon (🔍) in the upper right of that section.\n", 653 | "\n", 654 | "This will change your dashboard plots to the new cluster.\n", 655 | "\n", 656 | "![Screen Shot 2023-05-08 at 10.36.36 AM.png](attachment:dbc8e0aa-6a12-436a-aef4-f8253b1739d7.png)" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "id": "4a1d539e-b1ba-467f-bb13-b8b3a4168778", 662 | "metadata": {}, 663 | "source": [ 664 | "Now rerun your computation and see how it feels." 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "id": "ba1bd389-7495-48be-9efb-78b4a3e46415", 670 | "metadata": {}, 671 | "source": [ 672 | "## 3. (Learn) Evolving computations\n", 673 | "\n", 674 | "Dask futures are flexible. There are many ways to coordinate them including ...\n", 675 | "\n", 676 | "1. Distributed locks and semaphores\n", 677 | "2. Distributed queues\n", 678 | "3. Launching tasks from tasks\n", 679 | "4. Global variables\n", 680 | "5. ... [and lots more](https://docs.dask.org/en/stable/futures.html)\n", 681 | "\n", 682 | "We're going to get a taste of this by learning about one Dask futures feature, [`as_completed`](https://docs.dask.org/en/stable/futures.html#distributed.as_completed), which lets us dynamically build up a computation as it completes.\n", 683 | "\n", 684 | "We will use this to build a parallel web crawler over Stack Overflow. \n", 685 | "\n", 686 | "1. First, we'll build this sequentially.\n", 687 | "2. Second, we'll learn how `as_completed` works in a simple example\n", 688 | "3. Third, we'll convert the sequential code into parallel code" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "id": "9b814665-7240-43ae-95f9-6fa860da8c2f", 694 | "metadata": {}, 695 | "source": [ 696 | "### Sequential Code to Crawl Stack Overflow" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "id": "7350e302-f792-448d-bc41-4afd0bf527ad", 703 | "metadata": { 704 | "tags": [] 705 | }, 706 | "outputs": [], 707 | "source": [ 708 | "%%time\n", 709 | "from collections import deque\n", 710 | "\n", 711 | "urls = deque()\n", 712 | "urls.append(\"https://stackoverflow.com/questions/tagged/dask\") # seed with a single page\n", 713 | "\n", 714 | "all_tags = []\n", 715 | "titles = []\n", 716 | "seen = set()\n", 717 | "i = 0\n", 718 | "\n", 719 | "while urls and i < 10:\n", 720 | " url = urls.popleft()\n", 721 | " \n", 722 | " # Don't scrape the same page twice\n", 723 | " if url in seen: \n", 724 | " continue\n", 725 | " else:\n", 726 | " seen.add(url)\n", 727 | " \n", 728 | " print(\".\", end=\"\")\n", 729 | " i += 1\n", 730 | " \n", 731 | " # This is like before\n", 732 | " page = download(url)\n", 733 | " tags = scrape_tags(page)\n", 734 | " title = scrape_title(page)\n", 735 | " all_tags.append(tags)\n", 736 | " titles.append(title)\n", 737 | "\n", 738 | " # This is new! \n", 739 | " # We scrape links on this page, and add them to the list of URLs\n", 740 | " new_urls = scrape_links(page, base_url=\"https://stackoverflow.com\")\n", 741 | " urls.extend(new_urls)" 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": null, 747 | "id": "1239abe4-3c3e-44f0-a0dd-825b9520360e", 748 | "metadata": { 749 | "tags": [] 750 | }, 751 | "outputs": [], 752 | "source": [ 753 | "titles" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": null, 759 | "id": "f18c4b87-5dd3-41b3-9856-2c79f00d3aa5", 760 | "metadata": { 761 | "tags": [] 762 | }, 763 | "outputs": [], 764 | "source": [ 765 | "all_tags" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "id": "1a919f2b-1fe2-44dd-b3e6-36eb8c08756a", 771 | "metadata": {}, 772 | "source": [ 773 | "### Learn about `as_completed`\n", 774 | "\n", 775 | "Let's use our `inc`/`dec` example from before, and add up the results as they come in." 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": null, 781 | "id": "be90ffe7-27e1-4178-8bb1-49c5b7f9a70a", 782 | "metadata": {}, 783 | "outputs": [], 784 | "source": [ 785 | "%%time \n", 786 | "\n", 787 | "total = 0\n", 788 | "\n", 789 | "for x in range(16):\n", 790 | " y = inc(x)\n", 791 | " z = double(y)\n", 792 | " total += z\n", 793 | " print(\"Total:\", total, end=\"\\r\")" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "id": "8d44ceb1-c433-481e-aaa4-0af038277206", 799 | "metadata": {}, 800 | "source": [ 801 | "When we want to change our computation on the fly the `as_completed` object becomes useful. It has the following API:\n", 802 | "\n", 803 | "```python\n", 804 | "ac = as_completed()\n", 805 | "\n", 806 | "# add futures to as_completed\n", 807 | "ac.add(future_1)\n", 808 | "ac.add(future_2)\n", 809 | "ac.add(future_3)\n", 810 | "\n", 811 | "# block until any of them finish\n", 812 | "future = ac.next()\n", 813 | "```\n", 814 | "\n", 815 | "You can add futures any time, and pull futures off any time. Futures will always be finished when then emerge from the `as_completed` object." 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "id": "357dff8d-836d-436b-91b7-f5a0a0f87765", 822 | "metadata": {}, 823 | "outputs": [], 824 | "source": [ 825 | "%%time \n", 826 | "\n", 827 | "from dask.distributed import as_completed\n", 828 | "\n", 829 | "total = 0\n", 830 | "futures = as_completed()\n", 831 | "\n", 832 | "for x in range(200):\n", 833 | " y = client.submit(inc, x)\n", 834 | " z = client.submit(double, y)\n", 835 | " futures.add(z)\n", 836 | "\n", 837 | "while futures.count() > 1:\n", 838 | " a = futures.next()\n", 839 | " b = futures.next()\n", 840 | " c = client.submit(add, a, b, priority=1)\n", 841 | " futures.add(c)\n", 842 | " print(\"Some results:\", a.result(), b.result(), end=\"\\r\")\n", 843 | " \n", 844 | "total = futures.next().result()\n", 845 | "print(\"Final Result:\", total)" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "id": "e82279ad-3352-4664-b351-f3690e05a819", 851 | "metadata": {}, 852 | "source": [ 853 | "## 4. (Do): Parallelize code to crawl Stack Overflow" 854 | ] 855 | }, 856 | { 857 | "cell_type": "markdown", 858 | "id": "b9202ed6-55c5-4575-a1d1-f48e439498e4", 859 | "metadata": {}, 860 | "source": [ 861 | "#### Difficulty: Hard\n", 862 | "\n", 863 | "Expand the sequential code that we saw below. Parallelize it with futures and `as_completed`.\n", 864 | "\n", 865 | "If it's too hard, consider the option below to get some hints." 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": null, 871 | "id": "fab394ec-bd8b-458e-b340-51b275050fad", 872 | "metadata": { 873 | "jupyter": { 874 | "source_hidden": true 875 | }, 876 | "tags": [] 877 | }, 878 | "outputs": [], 879 | "source": [ 880 | "%%time\n", 881 | "from collections import deque\n", 882 | "\n", 883 | "urls = deque()\n", 884 | "urls.append(\"https://stackoverflow.com/questions/tagged/dask\") # seed with a single page\n", 885 | "\n", 886 | "all_tags = []\n", 887 | "titles = []\n", 888 | "seen = set()\n", 889 | "i = 0\n", 890 | "\n", 891 | "while urls and i < 10:\n", 892 | " url = urls.popleft()\n", 893 | " \n", 894 | " # Don't scrape the same page twice\n", 895 | " if url in seen: \n", 896 | " continue\n", 897 | " else:\n", 898 | " seen.add(url)\n", 899 | " \n", 900 | " print(\".\", end=\"\")\n", 901 | " i += 1\n", 902 | " \n", 903 | " # This is like before\n", 904 | " page = download(url, delay=0.25)\n", 905 | " tags = scrape_tags(page)\n", 906 | " title = scrape_title(page)\n", 907 | " all_tags.append(tags)\n", 908 | " titles.append(title)\n", 909 | "\n", 910 | " # This is new! \n", 911 | " # We scrape links on this page, and add them to the list of URLs\n", 912 | " new_urls = scrape_question_links(page, base_url=\"https://stackoverflow.com\")\n", 913 | " urls.extend(new_urls)" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "id": "6233dcc7-ac22-4916-a67e-ac2996fc3d54", 919 | "metadata": {}, 920 | "source": [ 921 | "#### Difficulty: Medium\n", 922 | "\n", 923 | "Expand the cell below to get hints about how to proceed" 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": null, 929 | "id": "feefb51d-82c7-41f4-84c5-8b15a277a54c", 930 | "metadata": { 931 | "jupyter": { 932 | "source_hidden": true 933 | }, 934 | "tags": [] 935 | }, 936 | "outputs": [], 937 | "source": [ 938 | "%%time\n", 939 | "from collections import deque\n", 940 | "from dask.distributed import as_completed\n", 941 | "\n", 942 | "urls = deque()\n", 943 | "urls.append(\"https://stackoverflow.com/questions/tagged/jupyter\") # seed with a single page\n", 944 | "\n", 945 | "all_tags = []\n", 946 | "titles = []\n", 947 | "url_futures = as_completed()\n", 948 | "seen = set()\n", 949 | "i = 0\n", 950 | "\n", 951 | "while urls or not url_futures.is_empty() and i < 1000: \n", 952 | " # TODO: If urls is empty, \n", 953 | " # get the next future from url_futures\n", 954 | " # collect those new url results to the local notebook\n", 955 | " # and add those new urls to urls\n", 956 | " \n", 957 | " url = urls.popleft()\n", 958 | " \n", 959 | " if url in seen:\n", 960 | " continue\n", 961 | " else:\n", 962 | " seen.add(url)\n", 963 | " \n", 964 | " print(\".\", end=\"\")\n", 965 | " i += 1\n", 966 | "\n", 967 | " # This is like before\n", 968 | " # TODO: Submit this work to happen in parallel\n", 969 | " page = download(url, delay=0.25)\n", 970 | " tags = scrape_tags(page)\n", 971 | " title = scrape_title(page)\n", 972 | " \n", 973 | " all_tags.append(tags)\n", 974 | " titles.append(title)\n", 975 | "\n", 976 | " # We scrape links on this page, and add them to the list of URLs\n", 977 | " # TODO: Submit this work to happen in parallel. Add the future to url_futures\n", 978 | " new_urls = scrape_question_links(page, base_url=\"https://stackoverflow.com\")\n", 979 | " urls.extend(new_urls)" 980 | ] 981 | }, 982 | { 983 | "cell_type": "markdown", 984 | "id": "d41829cc-9405-47d7-8ee1-35c9fc51bcd0", 985 | "metadata": {}, 986 | "source": [ 987 | "#### Difficulty: Easy\n", 988 | "\n", 989 | "Expand the cell below to get the full solution. Study the solution and ask questions." 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": null, 995 | "id": "dcf0e2ca-e3ea-4dd9-be27-a37fd852b47a", 996 | "metadata": { 997 | "jupyter": { 998 | "source_hidden": true 999 | }, 1000 | "tags": [] 1001 | }, 1002 | "outputs": [], 1003 | "source": [ 1004 | "%%time\n", 1005 | "from collections import deque\n", 1006 | "from dask.distributed import as_completed\n", 1007 | "\n", 1008 | "urls = deque()\n", 1009 | "urls.append(\"https://stackoverflow.com/questions/tagged/jupyter\") # seed with a single page\n", 1010 | "\n", 1011 | "all_tags = []\n", 1012 | "titles = []\n", 1013 | "url_futures = as_completed()\n", 1014 | "seen = set()\n", 1015 | "i = 0\n", 1016 | "\n", 1017 | "while urls or not url_futures.is_empty() and i < 1000:\n", 1018 | " if not urls:\n", 1019 | " future = url_futures.next()\n", 1020 | " new_urls = future.result()\n", 1021 | " urls.extend(new_urls)\n", 1022 | " continue\n", 1023 | " \n", 1024 | " url = urls.popleft()\n", 1025 | " \n", 1026 | " if url in seen:\n", 1027 | " continue\n", 1028 | " else:\n", 1029 | " seen.add(url)\n", 1030 | " \n", 1031 | " print(\".\", end=\"\")\n", 1032 | " i += 1\n", 1033 | "\n", 1034 | " page = client.submit(download, url, delay=0.25)\n", 1035 | " tags = client.submit(scrape_tags, page)\n", 1036 | " title = client.submit(scrape_title, page)\n", 1037 | "\n", 1038 | " all_tags.append(tags)\n", 1039 | " titles.append(title)\n", 1040 | " \n", 1041 | " new_urls = client.submit(scrape_links, page, base_url=\"https://stackoverflow.com\")\n", 1042 | " url_futures.add(new_urls)" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "markdown", 1047 | "id": "6a3330ea-5354-4f13-a256-5c59892aa2b2", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "### Analyze results\n", 1051 | "\n", 1052 | "At this point you likely have lists `titles` and `all_tags` that are lists of futures. Let's gather them and analyze results." 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "execution_count": null, 1058 | "id": "11ec75ee-9dc4-4d2b-a251-b21a88122934", 1059 | "metadata": { 1060 | "tags": [] 1061 | }, 1062 | "outputs": [], 1063 | "source": [ 1064 | "titles = client.gather(titles)\n", 1065 | "titles[:20]" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "execution_count": null, 1071 | "id": "399bec76-291b-4a82-8c35-3bbca648279e", 1072 | "metadata": {}, 1073 | "outputs": [], 1074 | "source": [ 1075 | "all_tags = client.gather(all_tags)" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "code", 1080 | "execution_count": null, 1081 | "id": "41e5d085-cd39-4f6e-b82b-5a9913d4bc44", 1082 | "metadata": {}, 1083 | "outputs": [], 1084 | "source": [ 1085 | "import collections\n", 1086 | "\n", 1087 | "tag_counter = collections.defaultdict(int)\n", 1088 | "\n", 1089 | "for tags in all_tags:\n", 1090 | " for tag in tags:\n", 1091 | " tag_counter[tag] += 1\n", 1092 | " \n", 1093 | "sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:20]" 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "markdown", 1098 | "id": "e329af0b-bf41-43b3-9163-92d58386c8d4", 1099 | "metadata": {}, 1100 | "source": [ 1101 | "## Clean up" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "execution_count": null, 1107 | "id": "b0d37a84-97ee-4969-8846-876fe43e3455", 1108 | "metadata": {}, 1109 | "outputs": [], 1110 | "source": [ 1111 | "cluster.shutdown()\n", 1112 | "client.close()" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "markdown", 1117 | "id": "62a7a6ef-b94b-46f8-af23-1a7ef27725f4", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "### Useful links\n", 1121 | "\n", 1122 | "- https://tutorial.dask.org/05_futures.html\n", 1123 | "- [Futures documentation](https://docs.dask.org/en/latest/futures.html)\n", 1124 | "- [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)\n", 1125 | "- [Futures examples](https://examples.dask.org/futures.html)\n", 1126 | "\n", 1127 | "### More Dask Tutorials\n", 1128 | "\n", 1129 | "Coiled also runs regular Dask tutorials. See [coiled.io/tutorials](https://www.coiled.io/tutorials) for more information. \n" 1130 | ] 1131 | } 1132 | ], 1133 | "metadata": { 1134 | "kernelspec": { 1135 | "display_name": "Python 3 (ipykernel)", 1136 | "language": "python", 1137 | "name": "python3" 1138 | }, 1139 | "language_info": { 1140 | "codemirror_mode": { 1141 | "name": "ipython", 1142 | "version": 3 1143 | }, 1144 | "file_extension": ".py", 1145 | "mimetype": "text/x-python", 1146 | "name": "python", 1147 | "nbconvert_exporter": "python", 1148 | "pygments_lexer": "ipython3", 1149 | "version": "3.10.0" 1150 | } 1151 | }, 1152 | "nbformat": 4, 1153 | "nbformat_minor": 5 1154 | } 1155 | -------------------------------------------------------------------------------- /2-Dataframes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f8eafed2-77a1-4691-8e8b-aeb1187ce8f5", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "\"Dask\n", 14 | "\n", 15 | "\n", 16 | "Dask DataFrames\n", 17 | "===============\n", 18 | "\n", 19 | "Dask dataframes are like pandas dataframes, just bigger.\n", 20 | "\n", 21 | "\"Dask\n", 25 | "\n", 26 | "\n", 27 | "\n", 28 | "API-wise they're mostly the same, except that when you want an answer, add `.compute()` to the end.\n", 29 | "\n", 30 | "```python\n", 31 | "# Pandas\n", 32 | "df.groupby(\"name\").value.mean()\n", 33 | "\n", 34 | "# Dask DataFrame\n", 35 | "df.groupby(\"name\").value.mean().compute()\n", 36 | "```\n", 37 | "\n", 38 | "This brings the result back to your local machine, so it had better be small!\n", 39 | "\n", 40 | "```python\n", 41 | "df.compute() # this would be unwise\n", 42 | "```\n" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "6ef10c79-b17d-4612-b808-ba26830d802f", 48 | "metadata": {}, 49 | "source": [ 50 | "## Ask for machines" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "id": "9a9e6076-c8b3-4282-90a4-0fe3ab49440d", 57 | "metadata": { 58 | "tags": [] 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "import coiled\n", 63 | "\n", 64 | "cluster = coiled.Cluster(\n", 65 | " n_workers=20,\n", 66 | " region=\"us-east-2\", # start workers close to data to minimize costs\n", 67 | " account=\"dask-tutorials\",\n", 68 | ")\n", 69 | "\n", 70 | "client = cluster.get_client()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "id": "1491663a-b4f5-46d9-9fb4-fc1f5a58421d", 76 | "metadata": {}, 77 | "source": [ 78 | "## Ingest Uber/Lyft Data\n", 79 | "\n", 80 | "\n", 81 | "The NYC Taxi dataset is a timeless classic. \n", 82 | "\n", 83 | "Interestingly there is a new variant. The NYC Taxi and Livery Commission requires data from all ride-share services in the city of New York. This includes private limosine services, van services, and a new category \"High Volume For Hire Vehicle\" services, those that dispatch 10,000 rides per day or more. This is a special category defined for Uber and Lyft. " 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "id": "33b598a4-fe0a-43c5-8007-0e955ac193f9", 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "import dask\n", 94 | "import pandas\n", 95 | "import dask.dataframe as dd\n", 96 | "import pandas as pd\n", 97 | "\n", 98 | "dask.config.set({\"dataframe.convert-string\": True}) # use PyArrow strings by default\n", 99 | "\n", 100 | "# df = pd.read_parquet( # this would work if we had enough memory\n", 101 | "df = dd.read_parquet(\n", 102 | " \"s3://coiled-datasets/uber-lyft-tlc/\",\n", 103 | " storage_options={\"anon\": True},\n", 104 | ")\n", 105 | "df.head()" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "id": "70789360-efcd-40a0-bd28-24cb33d8dbdf", 112 | "metadata": { 113 | "tags": [] 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "df = df.persist()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "id": "f9d012d8-6055-4cdc-a37e-d6a01b36c7db", 123 | "metadata": {}, 124 | "source": [ 125 | "Play time\n", 126 | "---------\n", 127 | "\n", 128 | "We start by playing around. We assume that you understand Pandas syntax. Please use it to compute the following quantities:" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "95a96932-2109-447c-9eb3-0d235de5e973", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "df.columns" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "id": "7042c351-c813-4a7e-a923-1fc45b908d8c", 144 | "metadata": {}, 145 | "source": [ 146 | "How much did New Yorkers pay Uber/Lyft? Sum the `base_passenger_fare` column." 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "id": "62a36191-d619-4b6f-b923-c0d185611aa3", 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "id": "d7c57df5-fd73-4d6e-847c-c7747041bc5a", 160 | "metadata": {}, 161 | "source": [ 162 | "How much did Uber/Lyft pay drivers?" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "id": "9d9a86ed-e286-48d2-8e54-52c421dc97c7", 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "id": "25a05095-9a6e-44bf-bd38-dfafb386e565", 176 | "metadata": {}, 177 | "source": [ 178 | "Were there ever cases when Uber/Lyft paid drivers more than they made? How often did this occur?" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "id": "3fc3c20c-6318-48cc-8900-65a8146fd927", 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "id": "eba147d9-bea1-4073-8521-fd1329b71867", 192 | "metadata": {}, 193 | "source": [ 194 | "What fraction of rides had a non-zero tip?" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "id": "c893ca7b-94a7-4c2d-8c8e-04926c83864c", 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "id": "10f285be-6839-4866-9535-43adbcc965d0", 208 | "metadata": {}, 209 | "source": [ 210 | "## Broken down by carrier\n", 211 | "\n", 212 | "If we look at the frequencies of values in the `hvfhs_licence_num` column we can identify rides as Uber/Lyft or other less common carriers." 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "id": "b71729eb-f841-433a-b020-0f2b1c425355", 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "df.hvfhs_license_num.value_counts().compute()" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "d3f6e586-c870-4df4-9d5f-3539f1e2881a", 228 | "metadata": {}, 229 | "source": [ 230 | "Probably HV0003 is Uber, and HV0005 is Lyft." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "id": "d19855fb-8851-4d73-b418-43673f200b96", 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "id": "ba4191f1-6758-43dc-a631-e03fcd7755ca", 244 | "metadata": {}, 245 | "source": [ 246 | "How do the questions above break down by carrier?" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "id": "74cb092c-11ea-40ef-8bcb-127b540681f3", 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 3 (ipykernel)", 261 | "language": "python", 262 | "name": "python3" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 3 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython3", 274 | "version": "3.10.0" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 5 279 | } 280 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dask Tutorial 2 | 3 | In this tutorial you will learn the basics of Dask, specifically the following: 4 | 5 | 1. How to parallelize simple for-loop code with [Dask Futures](https://docs.dask.org/en/stable/futures.html) 6 | 2. How to scale up Pandas code with [Dask Dataframes](https://docs.dask.org/en/stable/dataframes.html) 7 | 8 | Additionally you will work on real large scale data using a cluster of machines on the cloud 9 | 10 | ## Set up 11 | 12 | 1. **Clone this repository** 13 | 14 | In your terminal: 15 | 16 | ``` 17 | git clone https://github.com/mrocklin/dask-tutorial 18 | cd dask-tutorial 19 | ``` 20 | 21 | Alternatively, you can download the zip file of the repository at the top of the main page of the repository. This is a good option if you don't have experience with git. 22 | 23 | 2. **Create Conda Environment** 24 | 25 | In your terminal navigate to the directory where you have cloned/downloaded the `dask-tutorial` repository and install the required packages: 26 | 27 | ``` 28 | conda env create -f binder/environment.yml 29 | ``` 30 | 31 | This will create a new environment called `dask-tutorial`. To activate the environment do: 32 | 33 | ``` 34 | conda activate dask-tutorial 35 | ``` 36 | 37 | Alternatively, you can run `pip install -r requirements.txt`. 38 | This may or may not work as well. 39 | We recommend doing this from a fresh Python environment (this will make 40 | synchronizing with your cluster easier). 41 | 42 | 3. **Establish Coiled Access** 43 | 44 | This tutorial will use Dask clusters on the cloud. We will get these 45 | clusters using a SaaS product, Coiled. You can either ... 46 | 47 | 1. Sign up (it's free and there's no commitment) as follows: 48 | 49 | ``` 50 | coiled login 51 | ``` 52 | 53 | You'll be asked to authenticate with GitHub to make an account. Don't 54 | worry about connecting to your cloud resources. We'll add you to the 55 | `dask-tutorials` team, which is connected to an AWS account of ours. 56 | 57 | To get this access, ask to be added in the #dask-tutorial channel. 58 | You'll also want to set your default account to `dask-tutorials`: 59 | 60 | ``` 61 | coiled config set account dask-tutorials 62 | ``` 63 | 64 | Alternatively, you can also ... 65 | 66 | 2. Use a short-lived auth token 67 | 68 | ``` 69 | coiled login --token 65924ef194cc4b658ff37c1c11caa357-2ad71e4ceeafd5a771f553306cff95eb9624ee2d --account dask-tutorials 70 | ``` 71 | 72 | This should just work, but will expire in a few days and you won't be 73 | able to access the web view. 74 | 75 | 4. **Open Jupyter Lab** 76 | 77 | Once your environment has been activated and you are in the `dask-tutorial` repository, start Jupyter Lab: 78 | 79 | ``` 80 | jupyter lab 81 | ``` 82 | 83 | You will see a notebooks directory, click on there and you will be ready to go. 84 | 85 | *We recommend Jupyter Lab due to the [Dask Jupyter extension](https://github.com/dask/dask-labextension).* 86 | 87 | ### Run on a Coiled notebook 88 | 89 | 1. **Setup virtual environment** 90 | ``` 91 | conda create -n dask-tutorial python=3.10 coiled jupyter 92 | conda activate dask-tutorial 93 | ``` 94 | 95 | 2. **Establish Coiled Access** 96 | 1. Sign up (it's free and there's no commitment) as follows: 97 | 98 | ``` 99 | coiled login 100 | ``` 101 | 102 | You'll be asked to authenticate with GitHub to make an account. Don't 103 | worry about connecting to your cloud resources. We'll add you to the 104 | `dask-tutorials` team, which is connected to an AWS account of ours. 105 | 106 | To get this access, ask to be added in the #dask-tutorial channel. 107 | You'll also want to set your default account to `dask-tutorials`: 108 | 109 | ``` 110 | coiled config set account dask-tutorials 111 | ``` 112 | 113 | Alternatively, you can also ... 114 | 115 | 2. Use a short-lived auth token 116 | 117 | ``` 118 | coiled login --token 65924ef194cc4b658ff37c1c11caa357-2ad71e4ceeafd5a771f553306cff95eb9624ee2d --account dask-tutorials 119 | ``` 120 | 121 | This should just work, but will expire in a few days and you won't be 122 | able to access the web view. 123 | 124 | 3. **Start Coiled notebook** 125 | ``` 126 | coiled notebook up --software jupytercon-notebook 127 | ``` 128 | 129 | **Note:** Don't forget to shut down your notebook after you're done! 130 | 131 | ### Run on mybinder.org 132 | 133 | The website [mybinder.org](https://mybinder.org) serves pre-configured Jupyter notebooks for 134 | free that you can also use. Here is the link → [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mrocklin/dask-tutorial/HEAD). 135 | 136 | However, mybinder.org has tragically lost some of their funding recently, and 137 | so availability is not what it once was. We recommend running locally if 138 | possible. 139 | -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-tutorial 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python==3.10 6 | - dask 7 | - coiled 8 | - pyarrow 9 | - ipykernel 10 | - dask-labextension 11 | - jupyterlab 12 | - s3fs 13 | -------------------------------------------------------------------------------- /binder/jupyterlab-workspace.json: -------------------------------------------------------------------------------- 1 | { 2 | "data": { 3 | "file-browser-filebrowser:cwd": { 4 | "path": "" 5 | }, 6 | "dask-dashboard-launcher": { 7 | "url": "DASK_DASHBOARD_URL" 8 | } 9 | }, 10 | "metadata": { 11 | "id": "/lab" 12 | } 13 | } -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | export DASK_DISTRIBUTED__DASHBOARD__LINK="${JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status" 6 | 7 | 8 | # Import the workspace 9 | jupyter lab workspaces import binder/jupyterlab-workspace.json 10 | 11 | # Install Coiled token 12 | coiled login --token 65924ef194cc4b658ff37c1c11caa357-2ad71e4ceeafd5a771f553306cff95eb9624ee2d --account dask-tutorials 13 | 14 | exec "$@" 15 | -------------------------------------------------------------------------------- /images/nyc-taxi-scatter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mrocklin/dask-tutorial/98eefba7373e4828535b09e0ad6aa02a0456b2f5/images/nyc-taxi-scatter.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | dask[complete] 2 | pyarrow 3 | coiled 4 | jupyterlab 5 | dask-labextension==6.0 6 | ipykernel 7 | s3fs 8 | --------------------------------------------------------------------------------