├── images ├── 2020.png ├── 2021.png ├── nils.jpeg ├── end2end.png └── med-head.jpg ├── .dask └── config.yaml ├── binder ├── postBuild ├── jupyterlab-workspace.json ├── start └── environment.yml ├── README.md └── dask-sql-pycon.ipynb /images/2020.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/2020.png -------------------------------------------------------------------------------- /images/2021.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/2021.png -------------------------------------------------------------------------------- /images/nils.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/nils.jpeg -------------------------------------------------------------------------------- /images/end2end.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/end2end.png -------------------------------------------------------------------------------- /images/med-head.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/med-head.jpg -------------------------------------------------------------------------------- /.dask/config.yaml: -------------------------------------------------------------------------------- 1 | distributed: 2 | dashboard: 3 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 4 | -------------------------------------------------------------------------------- /binder/postBuild: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Install dask and ipywidgets JupyterLab extensions 4 | jupyter labextension install --minimize=False --clean \ 5 | dask-labextension \ 6 | @jupyter-widgets/jupyterlab-manager 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pycon2021-dask-sql 2 | 3 | 4 | __Click here to launch:__ 5 | 6 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/adbreind/pycon2021-dask-sql.git/HEAD?urlpath=%2Fnotebooks%2Fdask-sql-pycon.ipynb) -------------------------------------------------------------------------------- /binder/jupyterlab-workspace.json: -------------------------------------------------------------------------------- 1 | { 2 | "data": { 3 | "file-browser-filebrowser:cwd": { 4 | "path": "" 5 | }, 6 | "dask-dashboard-launcher": { 7 | "url": "DASK_DASHBOARD_URL" 8 | } 9 | }, 10 | "metadata": { 11 | "id": "/lab" 12 | } 13 | } -------------------------------------------------------------------------------- /binder/start: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Replace DASK_DASHBOARD_URL with the proxy location 4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json 5 | 6 | # Import the workspace 7 | jupyter lab workspaces import binder/jupyterlab-workspace.json 8 | 9 | exec "$@" -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask-micro-2021 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.8 6 | - bokeh 7 | - dask=2021.2.0 8 | - distributed=2021.2.0 9 | - dask-sql=0.3.2 10 | - jupyterlab 11 | - nodejs 12 | - tornado 13 | - pip 14 | - matplotlib 15 | - dask_labextension 16 | 17 | -------------------------------------------------------------------------------- /dask-sql-pycon.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "9bd73053-73d9-4b18-b009-bb597a23f3ab", 6 | "metadata": {}, 7 | "source": [ 8 | "# Dask-SQL: Empowering Pythonistas for
Scalable End-to-End Data Engineering and Data Science\n", 9 | "\n", 10 | "\n", 11 | "\n", 12 | "## Who Am I?\n", 13 | "\n", 14 | "### Adam Breindel\n", 15 | "\n", 16 | "__LinkedIn__ - https://www.linkedin.com/in/adbreind
\n", 17 | "__Email__ - adbreind@gmail.com
\n", 18 | "__Twitter__ - @adbreind\n", 19 | "\n", 20 | "__What Do I Do?__\n", 21 | "* Training Lead at Coiled Computing: https://coiled.io\n", 22 | " * Dask scales Python for data science and machine learning\n", 23 | " * Coiled makes it easy to scale on the cloud\n", 24 | "* Consulting on data engineering and machine learning\n", 25 | " * Development\n", 26 | " * Various advisory roles\n", 27 | "* 20+ years building systems for startups and large enterprises\n", 28 | "* 10+ years teaching front- and back-end technology\n", 29 | "\n", 30 | "__Fun large-scale data projects__\n", 31 | "* Streaming neural net + decision tree fraud scoring\n", 32 | "* Realtime & offline analytics for banking\n", 33 | "* Music synchronization and licensing for networked jukeboxes\n", 34 | "\n", 35 | "__Industries__\n", 36 | "* Finance / Insurance\n", 37 | "* Travel, Media / Entertainment\n", 38 | "* Energy, Government\n", 39 | "* Advertising/Social Media, & more" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "5395ad04-4446-493a-a51f-3cceef4d40f5", 45 | "metadata": {}, 46 | "source": [ 47 | "
\n", 48 | "
\n", 49 | "\n", 50 | "---\n", 51 | "\n", 52 | "
\n", 53 | "
\n", 54 | "\n", 55 | "# Basic large-scale enterprise data processing pattern\n", 56 | "\n", 57 | "\n", 58 | "
\n", 59 | "
\n", 60 | "Yes, we're missing a lot of important upstream work (data aquisition, ingestion) and downstream (deploy, monitor), but today we're focusing on *SQL*\n", 61 | "\n", 62 | "
\n", 63 | "
\n", 64 | "\n", 65 | "---\n", 66 | "\n", 67 | "
\n", 68 | "
\n", 69 | "\n", 70 | "# Let's zoom in on extracting from a data lake/warehouse and transforming\n", 71 | "\n", 72 | "\n", 73 | "\n", 74 | "* There are __other__ tools (Presto/Trino, Spark, etc.) that can help\n", 75 | "* But we're *Pythonistas* and maybe not experts (or interested) in integrating complex JVM-based tools\n", 76 | "* And we'd like to ...\n", 77 | " * Use Python together with SQL at scale\n", 78 | " * Create services and tools for our company/team that use SQL\n", 79 | " * Because many more folks know SQL than Python! (I know it's hard to believe, but it's true :)\n", 80 | "\n", 81 | "
\n", 82 | "
\n", 83 | "\n", 84 | "---\n", 85 | "\n", 86 | "
\n", 87 | "
\n", 88 | "\n", 89 | "# We're all happy it's 2021\n", 90 | "\n", 91 | "\n", 92 | "\n", 93 | "\n", 94 | "
\n", 95 | "
\n", 96 | "\n", 97 | "---\n", 98 | "\n", 99 | "
\n", 100 | "
\n" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "id": "46cfd79c-1234-4d2a-83c5-d0bf633b3949", 106 | "metadata": {}, 107 | "source": [ 108 | "# Introducing Dask-SQL\n", 109 | "## Adding SQL execution and Hive access to Python!\n", 110 | "\n", 111 | "\n", 112 | "\n", 113 | "### Nils Braun\n", 114 | "* Data Engineer for Enabling: Bosch Center for Artificial Intelligence (BCAI)\n", 115 | "* https://www.linkedin.com/in/nlb/\n", 116 | "* https://github.com/nils-braun\n", 117 | "\n", 118 | "### Dask-SQL\n", 119 | "\n", 120 | "Core features\n", 121 | "\n", 122 | "* SQL parsing, optimization, planning, translation for Dask\n", 123 | "* Start with data from...\n", 124 | " * files in the cloud (e.g., S3)\n", 125 | " * any data in Python (e.g., Pandas or Dask Dataframe)\n", 126 | " * modern data catalog/aggregation like Intake (https://github.com/intake/intake)\n", 127 | " * __direct from enterprise data lakes/warehouses: Hive Metastore, Databricks, etc.__\n", 128 | " * Bring the SQL integration power of Spark right into the Python/Dask world\n", 129 | "* Query cached datasets to leverage the speed of a large distributed memory pool\n", 130 | "\n", 131 | "Bonus features\n", 132 | "* user-defined functions\n", 133 | "* a SQL server\n", 134 | "* ML in SQL\n", 135 | "* a command-line client\n", 136 | "* more in the works!\n", 137 | "\n", 138 | "Learn more...\n", 139 | "* Homepage: https://nils-braun.github.io/dask-sql/\n", 140 | "* Docs: https://dask-sql.readthedocs.io/en/latest/\n", 141 | "* Source: https://github.com/nils-braun/dask-sql\n", 142 | "\n", 143 | "
\n", 144 | "
\n", 145 | "\n", 146 | "---\n", 147 | "\n", 148 | "
\n", 149 | "
" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "962139c9-72a8-4c1f-bba1-f44f6056776f", 155 | "metadata": {}, 156 | "source": [ 157 | "## Before we dive into code ... a little clarification: data lakes\n", 158 | "\n", 159 | "If you haven't worked a lot in the large-scale data space, it can be a bit confusing why we need a Dask-SQL project. Common questions include...\n", 160 | "\n", 161 | "How is this different from...\n", 162 | "* Dask `read_sql_table`? \n", 163 | "* Pandas `read_sql`, `read_sql_table`, or `read_sql_query`?\n", 164 | "* SQLAlchemy\n", 165 | "* etc.\n", 166 | "\n", 167 | "The fundamental difference is: __those other approaches pass your query to a database system which already understands SQL, can execute a query, and has control over your data__\n", 168 | "\n", 169 | "__In enterprise data lakes, that \"database\" likely does not exist.__ Instead, you may have huge collections of files, in a variety of formats, with no query engine, and no process which has \"control\" over your data.\n", 170 | "\n", 171 | "You may not even have a data catalog. In other cases, you may have a catalog, but it is tied to a Hadoop/JVM-based system like Hive or Spark.\n", 172 | "\n", 173 | "In these data lake systems, all of the `read_sql` techniques above may not work at all, or may require you to pass your logic through to Hive/Spark/etc., requiring you to understand, use, and tune those systems before you can even start your work in Python.\n", 174 | "\n", 175 | "The goal of Dask-SQL is to allow you to formulate a SQL query against arbitrary files & formats, and execute that query at large scale with Dask." 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "id": "3f4de9c3-c9a8-4d3e-a073-8ba439fcb807", 181 | "metadata": {}, 182 | "source": [ 183 | "
\n", 184 | "
\n", 185 | "\n", 186 | "---\n", 187 | "\n", 188 | "
\n", 189 | "
\n", 190 | "\n", 191 | "## It's coding time!\n", 192 | "\n", 193 | "We'll demo three key approaches here:\n", 194 | "\n", 195 | "1. Creating a Dask Dataframe -- a lazy, distributed datastructure -- over a set of files, and then using Dask-SQL to query the data\n", 196 | "\n", 197 | "2. Creating a Dask-SQL table completely within SQL, and querying that -- an approach that will be very helpful working your SQL analyst friends\n", 198 | "\n", 199 | "3. Using Dask-SQL to access tables *already defined in the Hive catalog (\"metastore\")* but querying the underlying files with Dask -- an incredibly valuable missing link for Python data folks working within orgs that rely on Hive to catalog their data." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "id": "c5cd2fd9-4793-4f1a-a7e8-b3740442e32e", 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "from dask.distributed import Client\n", 210 | "\n", 211 | "client = Client()\n", 212 | "\n", 213 | "client" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "id": "1c9e855f-3cb5-4764-ab85-e5eb47c0373d", 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "from dask_sql import Context\n", 224 | "\n", 225 | "c = Context()" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "667aee4b-6c9c-42af-a07e-ea2b123c64aa", 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "import dask.dataframe as dd\n", 236 | "\n", 237 | "df = dd.read_csv('data/powerplant.csv')\n", 238 | "\n", 239 | "df" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "id": "6da3e6fb-ecef-48cd-8d97-cbc6c2fcaa99", 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "c.create_table(\"powerplant\", df)\n", 250 | "\n", 251 | "result = c.sql('SELECT * FROM powerplant')\n", 252 | "\n", 253 | "result" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "9c168bfd-2ae3-44e4-bc43-bc960ecce869", 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "type(result)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "id": "5ec06cec-b204-44f6-9026-2f82a3aa8d3e", 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "result.compute()" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "id": "dd9abde4-e6e0-4ed9-8251-69c8652639c2", 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "c.sql('SELECT * FROM powerplant', return_futures=False) # run immediately -- beware of large result sets!" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "id": "d59a600c", 290 | "metadata": {}, 291 | "outputs": [], 292 | "source": [ 293 | "type(c.sql('SELECT * FROM powerplant', return_futures=False))" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "id": "ce616b95-11a9-4631-a374-f9901632089d", 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "query = '''\n", 304 | "SELECT\n", 305 | " FLOOR(\"AT\") AS temp, AVG(\"PE\") AS output\n", 306 | "FROM\n", 307 | " powerplant\n", 308 | "GROUP BY \n", 309 | " FLOOR(\"AT\")\n", 310 | "'''\n", 311 | "\n", 312 | "result = c.sql(query)\n", 313 | "\n", 314 | "result" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "id": "146351e1-7b08-4156-af87-2452e0c89827", 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "result.compute().plot.scatter('temp','output')\n", 325 | "\n", 326 | "# hint: if you're not totally convinced the computation is happening in Dask, look at the Dask Task Stream dashboard!" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "id": "0b750332-7c6b-4352-a138-66e9777ca021", 332 | "metadata": {}, 333 | "source": [ 334 | "Maybe we could build a successful model with this data ... in fact, we could do it with any combination of\n", 335 | "* Data prep in SQL, training/prediction in Python\n", 336 | "* Training in Python, prediction in SQL\n", 337 | "* Everything (!) in SQL\n", 338 | "* Sound interesting? Check it out: https://dask-sql.readthedocs.io/en/latest/pages/machine_learning.html\n", 339 | "\n", 340 | "### What about \"creating the table completely in SQL\"?\n", 341 | "\n", 342 | "First, let's go \"full SQL\" so we don't even need to wrap our queries in Python..." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "id": "a50f01d0-5753-4f5e-91c4-9768211f77f8", 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "c.ipython_magic()" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "id": "e5e2f680-18af-49a5-b767-f7ba6adf1e34", 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "%%sql\n", 363 | "\n", 364 | "CREATE TABLE allsql WITH (\n", 365 | " format = 'csv',\n", 366 | " location = 'data/powerplant.csv' -- any Dask-accessible source or format (cloud/S3/..., parquet/ORC/...)\n", 367 | ")" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "id": "a189d158", 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "%%sql\n", 378 | "\n", 379 | "SELECT\n", 380 | " FLOOR(\"AT\") AS temp, AVG(\"PE\") AS output\n", 381 | "FROM\n", 382 | " allsql\n", 383 | "GROUP BY \n", 384 | " FLOOR(\"AT\")\n", 385 | "LIMIT 10" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "id": "33c0b6a7", 391 | "metadata": {}, 392 | "source": [ 393 | "### Let's see that Hive catalog integration!\n", 394 | "\n", 395 | "*note: this demo will not run in the standalone binder notebook available after PyCon, as it relies on a Hive server which is not configured in that container*" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "id": "e0c27141", 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "from pyhive.hive import connect\n", 406 | "\n", 407 | "cursor = connect(\"localhost\", 10000).cursor()\n", 408 | "\n", 409 | "c.create_table(\"my_diamonds\", cursor, hive_table_name=\"diamonds\")" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "id": "a5f9c10a", 415 | "metadata": {}, 416 | "source": [ 417 | "Here's the magic...\n", 418 | "* If you look at the Hive Server web UI, you'll see a query just ran to get schema info on the `Diamonds` table\n", 419 | "* But in the following queries\n", 420 | " * Data is accessed directly from the underlying files\n", 421 | " * No Hive queries are run\n", 422 | " * All compute is done in Dask/Python" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "id": "404722db", 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "%%sql\n", 433 | "\n", 434 | "SELECT * FROM my_diamonds LIMIT 10" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "id": "f7de52bb", 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "query = '''\n", 445 | "SELECT FLOOR(10*carat)/10 AS carat, AVG(price) AS price, COUNT(1) AS num \n", 446 | "FROM my_diamonds\n", 447 | "GROUP BY FLOOR(10*carat)\n", 448 | "'''\n", 449 | "\n", 450 | "data = c.sql(query).compute()\n", 451 | "\n", 452 | "data.plot.scatter('carat', 'price')\n", 453 | "data.plot.bar('carat', 'num')" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "id": "e99da79e-953e-4eb4-b213-82875016cc31", 459 | "metadata": {}, 460 | "source": [ 461 | "## A Quick Look at How Dask-SQL Works\n", 462 | "\n", 463 | "* Locate the source data\n", 464 | " * Hive, Intake, Databricks catalog integration\n", 465 | " * Files or Python data provided by user\n", 466 | "\n", 467 | "\n", 468 | "* Prepare the query using Apache Calcite\n", 469 | " * Parse SQL\n", 470 | " * Analyze (check vs. schema, etc.)\n", 471 | " * Optimize\n", 472 | "\n", 473 | "\n", 474 | "* Create execution plan\n", 475 | " * Take logical relational operators (`SELECT`/project, `WHERE`/filter, `JOIN`, etc.) \n", 476 | " * Convert into Dask Dataframe API calls (`query`, `merge`, etc.)\n", 477 | "\n", 478 | "\n", 479 | "* Then either...\n", 480 | " * Return a handle to the Dask Dataframe of results (recall this is a virtual Dataframe, so no execution yet)\n", 481 | " * or\n", 482 | " * Compute (materialize) the resulting dataframe and return the result as a Pandas Dataframe\n", 483 | " \n", 484 | "More detail at https://dask-sql.readthedocs.io/en/latest/pages/how_does_it_work.html" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "id": "f560c5e3-1688-4178-bb77-5cb6864d4b25", 490 | "metadata": {}, 491 | "source": [ 492 | "## Some Practical Details\n", 493 | "\n", 494 | "### Installing Dask-SQL\n", 495 | "\n", 496 | "Recommended approach is via conda and conda-forge -- this will include all dependencies like the JVM, and avoid conflicts by keeping everything within a conda environment.\n", 497 | "\n", 498 | "There are also a few other options: more details at https://dask-sql.readthedocs.io/en/latest/pages/installation.html\n", 499 | "\n", 500 | "### Supported SQL Operators\n", 501 | "\n", 502 | "Dask-SQL is a young project, so it does not yet support all of SQL\n", 503 | "\n", 504 | "More detail on\n", 505 | "* Query support https://dask-sql.readthedocs.io/en/latest/pages/sql/select.html\n", 506 | "* Table creation https://dask-sql.readthedocs.io/en/latest/pages/sql/creation.html\n", 507 | "* ML via SQL https://dask-sql.readthedocs.io/en/latest/pages/sql/ml.html\n", 508 | "\n", 509 | "### How to Contribute\n", 510 | "\n", 511 | "Source code and info on installing for development is at https://github.com/nils-braun/dask-sql\n", 512 | "\n", 513 | "Check issues -- or file a new bug -- at https://github.com/nils-braun/dask-sql/issues\n", 514 | "\n", 515 | "And there's even a \"good first issue\" list at https://github.com/nils-braun/dask-sql/contribute" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "id": "f7488066", 521 | "metadata": {}, 522 | "source": [ 523 | "# Thank You!" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "id": "4999eff9", 530 | "metadata": {}, 531 | "outputs": [], 532 | "source": [] 533 | } 534 | ], 535 | "metadata": { 536 | "kernelspec": { 537 | "display_name": "Python 3", 538 | "language": "python", 539 | "name": "python3" 540 | }, 541 | "language_info": { 542 | "codemirror_mode": { 543 | "name": "ipython", 544 | "version": 3 545 | }, 546 | "file_extension": ".py", 547 | "mimetype": "text/x-python", 548 | "name": "python", 549 | "nbconvert_exporter": "python", 550 | "pygments_lexer": "ipython3", 551 | "version": "3.8.0" 552 | } 553 | }, 554 | "nbformat": 4, 555 | "nbformat_minor": 5 556 | } 557 | --------------------------------------------------------------------------------