├── images
├── 2020.png
├── 2021.png
├── nils.jpeg
├── end2end.png
└── med-head.jpg
├── .dask
└── config.yaml
├── binder
├── postBuild
├── jupyterlab-workspace.json
├── start
└── environment.yml
├── README.md
└── dask-sql-pycon.ipynb
/images/2020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/2020.png
--------------------------------------------------------------------------------
/images/2021.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/2021.png
--------------------------------------------------------------------------------
/images/nils.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/nils.jpeg
--------------------------------------------------------------------------------
/images/end2end.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/end2end.png
--------------------------------------------------------------------------------
/images/med-head.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/pycon2021-dask-sql/main/images/med-head.jpg
--------------------------------------------------------------------------------
/.dask/config.yaml:
--------------------------------------------------------------------------------
1 | distributed:
2 | dashboard:
3 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status"
4 |
--------------------------------------------------------------------------------
/binder/postBuild:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Install dask and ipywidgets JupyterLab extensions
4 | jupyter labextension install --minimize=False --clean \
5 | dask-labextension \
6 | @jupyter-widgets/jupyterlab-manager
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # pycon2021-dask-sql
2 |
3 |
4 | __Click here to launch:__
5 |
6 | [](https://mybinder.org/v2/gh/adbreind/pycon2021-dask-sql.git/HEAD?urlpath=%2Fnotebooks%2Fdask-sql-pycon.ipynb)
--------------------------------------------------------------------------------
/binder/jupyterlab-workspace.json:
--------------------------------------------------------------------------------
1 | {
2 | "data": {
3 | "file-browser-filebrowser:cwd": {
4 | "path": ""
5 | },
6 | "dask-dashboard-launcher": {
7 | "url": "DASK_DASHBOARD_URL"
8 | }
9 | },
10 | "metadata": {
11 | "id": "/lab"
12 | }
13 | }
--------------------------------------------------------------------------------
/binder/start:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Replace DASK_DASHBOARD_URL with the proxy location
4 | sed -i -e "s|DASK_DASHBOARD_URL|${JUPYTERHUB_BASE_URL}user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json
5 |
6 | # Import the workspace
7 | jupyter lab workspaces import binder/jupyterlab-workspace.json
8 |
9 | exec "$@"
--------------------------------------------------------------------------------
/binder/environment.yml:
--------------------------------------------------------------------------------
1 | name: dask-micro-2021
2 | channels:
3 | - conda-forge
4 | dependencies:
5 | - python=3.8
6 | - bokeh
7 | - dask=2021.2.0
8 | - distributed=2021.2.0
9 | - dask-sql=0.3.2
10 | - jupyterlab
11 | - nodejs
12 | - tornado
13 | - pip
14 | - matplotlib
15 | - dask_labextension
16 |
17 |
--------------------------------------------------------------------------------
/dask-sql-pycon.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "9bd73053-73d9-4b18-b009-bb597a23f3ab",
6 | "metadata": {},
7 | "source": [
8 | "# Dask-SQL: Empowering Pythonistas for
Scalable End-to-End Data Engineering and Data Science\n",
9 | "\n",
10 | "
\n",
11 | "\n",
12 | "## Who Am I?\n",
13 | "\n",
14 | "### Adam Breindel\n",
15 | "\n",
16 | "__LinkedIn__ - https://www.linkedin.com/in/adbreind
\n",
17 | "__Email__ - adbreind@gmail.com
\n",
18 | "__Twitter__ - @adbreind\n",
19 | "\n",
20 | "__What Do I Do?__\n",
21 | "* Training Lead at Coiled Computing: https://coiled.io\n",
22 | " * Dask scales Python for data science and machine learning\n",
23 | " * Coiled makes it easy to scale on the cloud\n",
24 | "* Consulting on data engineering and machine learning\n",
25 | " * Development\n",
26 | " * Various advisory roles\n",
27 | "* 20+ years building systems for startups and large enterprises\n",
28 | "* 10+ years teaching front- and back-end technology\n",
29 | "\n",
30 | "__Fun large-scale data projects__\n",
31 | "* Streaming neural net + decision tree fraud scoring\n",
32 | "* Realtime & offline analytics for banking\n",
33 | "* Music synchronization and licensing for networked jukeboxes\n",
34 | "\n",
35 | "__Industries__\n",
36 | "* Finance / Insurance\n",
37 | "* Travel, Media / Entertainment\n",
38 | "* Energy, Government\n",
39 | "* Advertising/Social Media, & more"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "id": "5395ad04-4446-493a-a51f-3cceef4d40f5",
45 | "metadata": {},
46 | "source": [
47 | "
\n",
48 | "
\n",
49 | "\n",
50 | "---\n",
51 | "\n",
52 | "
\n",
53 | "
\n",
54 | "\n",
55 | "# Basic large-scale enterprise data processing pattern\n",
56 | "\n",
57 | "
\n",
58 | "
\n",
59 | "
\n",
60 | "Yes, we're missing a lot of important upstream work (data aquisition, ingestion) and downstream (deploy, monitor), but today we're focusing on *SQL*\n",
61 | "\n",
62 | "
\n",
63 | "
\n",
64 | "\n",
65 | "---\n",
66 | "\n",
67 | "
\n",
68 | "
\n",
69 | "\n",
70 | "# Let's zoom in on extracting from a data lake/warehouse and transforming\n",
71 | "\n",
72 | "
\n",
73 | "\n",
74 | "* There are __other__ tools (Presto/Trino, Spark, etc.) that can help\n",
75 | "* But we're *Pythonistas* and maybe not experts (or interested) in integrating complex JVM-based tools\n",
76 | "* And we'd like to ...\n",
77 | " * Use Python together with SQL at scale\n",
78 | " * Create services and tools for our company/team that use SQL\n",
79 | " * Because many more folks know SQL than Python! (I know it's hard to believe, but it's true :)\n",
80 | "\n",
81 | "
\n",
82 | "
\n",
83 | "\n",
84 | "---\n",
85 | "\n",
86 | "
\n",
87 | "
\n",
88 | "\n",
89 | "# We're all happy it's 2021\n",
90 | "\n",
91 | "
\n",
92 | "\n",
93 | "\n",
94 | "
\n",
95 | "
\n",
96 | "\n",
97 | "---\n",
98 | "\n",
99 | "
\n",
100 | "
\n"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "id": "46cfd79c-1234-4d2a-83c5-d0bf633b3949",
106 | "metadata": {},
107 | "source": [
108 | "# Introducing Dask-SQL\n",
109 | "## Adding SQL execution and Hive access to Python!\n",
110 | "\n",
111 | "
\n",
112 | "\n",
113 | "### Nils Braun\n",
114 | "* Data Engineer for Enabling: Bosch Center for Artificial Intelligence (BCAI)\n",
115 | "* https://www.linkedin.com/in/nlb/\n",
116 | "* https://github.com/nils-braun\n",
117 | "\n",
118 | "### Dask-SQL\n",
119 | "\n",
120 | "Core features\n",
121 | "\n",
122 | "* SQL parsing, optimization, planning, translation for Dask\n",
123 | "* Start with data from...\n",
124 | " * files in the cloud (e.g., S3)\n",
125 | " * any data in Python (e.g., Pandas or Dask Dataframe)\n",
126 | " * modern data catalog/aggregation like Intake (https://github.com/intake/intake)\n",
127 | " * __direct from enterprise data lakes/warehouses: Hive Metastore, Databricks, etc.__\n",
128 | " * Bring the SQL integration power of Spark right into the Python/Dask world\n",
129 | "* Query cached datasets to leverage the speed of a large distributed memory pool\n",
130 | "\n",
131 | "Bonus features\n",
132 | "* user-defined functions\n",
133 | "* a SQL server\n",
134 | "* ML in SQL\n",
135 | "* a command-line client\n",
136 | "* more in the works!\n",
137 | "\n",
138 | "Learn more...\n",
139 | "* Homepage: https://nils-braun.github.io/dask-sql/\n",
140 | "* Docs: https://dask-sql.readthedocs.io/en/latest/\n",
141 | "* Source: https://github.com/nils-braun/dask-sql\n",
142 | "\n",
143 | "
\n",
144 | "
\n",
145 | "\n",
146 | "---\n",
147 | "\n",
148 | "
\n",
149 | "
"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "id": "962139c9-72a8-4c1f-bba1-f44f6056776f",
155 | "metadata": {},
156 | "source": [
157 | "## Before we dive into code ... a little clarification: data lakes\n",
158 | "\n",
159 | "If you haven't worked a lot in the large-scale data space, it can be a bit confusing why we need a Dask-SQL project. Common questions include...\n",
160 | "\n",
161 | "How is this different from...\n",
162 | "* Dask `read_sql_table`? \n",
163 | "* Pandas `read_sql`, `read_sql_table`, or `read_sql_query`?\n",
164 | "* SQLAlchemy\n",
165 | "* etc.\n",
166 | "\n",
167 | "The fundamental difference is: __those other approaches pass your query to a database system which already understands SQL, can execute a query, and has control over your data__\n",
168 | "\n",
169 | "__In enterprise data lakes, that \"database\" likely does not exist.__ Instead, you may have huge collections of files, in a variety of formats, with no query engine, and no process which has \"control\" over your data.\n",
170 | "\n",
171 | "You may not even have a data catalog. In other cases, you may have a catalog, but it is tied to a Hadoop/JVM-based system like Hive or Spark.\n",
172 | "\n",
173 | "In these data lake systems, all of the `read_sql` techniques above may not work at all, or may require you to pass your logic through to Hive/Spark/etc., requiring you to understand, use, and tune those systems before you can even start your work in Python.\n",
174 | "\n",
175 | "The goal of Dask-SQL is to allow you to formulate a SQL query against arbitrary files & formats, and execute that query at large scale with Dask."
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "id": "3f4de9c3-c9a8-4d3e-a073-8ba439fcb807",
181 | "metadata": {},
182 | "source": [
183 | "
\n",
184 | "
\n",
185 | "\n",
186 | "---\n",
187 | "\n",
188 | "
\n",
189 | "
\n",
190 | "\n",
191 | "## It's coding time!\n",
192 | "\n",
193 | "We'll demo three key approaches here:\n",
194 | "\n",
195 | "1. Creating a Dask Dataframe -- a lazy, distributed datastructure -- over a set of files, and then using Dask-SQL to query the data\n",
196 | "\n",
197 | "2. Creating a Dask-SQL table completely within SQL, and querying that -- an approach that will be very helpful working your SQL analyst friends\n",
198 | "\n",
199 | "3. Using Dask-SQL to access tables *already defined in the Hive catalog (\"metastore\")* but querying the underlying files with Dask -- an incredibly valuable missing link for Python data folks working within orgs that rely on Hive to catalog their data."
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "id": "c5cd2fd9-4793-4f1a-a7e8-b3740442e32e",
206 | "metadata": {},
207 | "outputs": [],
208 | "source": [
209 | "from dask.distributed import Client\n",
210 | "\n",
211 | "client = Client()\n",
212 | "\n",
213 | "client"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "id": "1c9e855f-3cb5-4764-ab85-e5eb47c0373d",
220 | "metadata": {},
221 | "outputs": [],
222 | "source": [
223 | "from dask_sql import Context\n",
224 | "\n",
225 | "c = Context()"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "id": "667aee4b-6c9c-42af-a07e-ea2b123c64aa",
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "import dask.dataframe as dd\n",
236 | "\n",
237 | "df = dd.read_csv('data/powerplant.csv')\n",
238 | "\n",
239 | "df"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": null,
245 | "id": "6da3e6fb-ecef-48cd-8d97-cbc6c2fcaa99",
246 | "metadata": {},
247 | "outputs": [],
248 | "source": [
249 | "c.create_table(\"powerplant\", df)\n",
250 | "\n",
251 | "result = c.sql('SELECT * FROM powerplant')\n",
252 | "\n",
253 | "result"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": null,
259 | "id": "9c168bfd-2ae3-44e4-bc43-bc960ecce869",
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "type(result)"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "id": "5ec06cec-b204-44f6-9026-2f82a3aa8d3e",
270 | "metadata": {},
271 | "outputs": [],
272 | "source": [
273 | "result.compute()"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "id": "dd9abde4-e6e0-4ed9-8251-69c8652639c2",
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "c.sql('SELECT * FROM powerplant', return_futures=False) # run immediately -- beware of large result sets!"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "id": "d59a600c",
290 | "metadata": {},
291 | "outputs": [],
292 | "source": [
293 | "type(c.sql('SELECT * FROM powerplant', return_futures=False))"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": null,
299 | "id": "ce616b95-11a9-4631-a374-f9901632089d",
300 | "metadata": {},
301 | "outputs": [],
302 | "source": [
303 | "query = '''\n",
304 | "SELECT\n",
305 | " FLOOR(\"AT\") AS temp, AVG(\"PE\") AS output\n",
306 | "FROM\n",
307 | " powerplant\n",
308 | "GROUP BY \n",
309 | " FLOOR(\"AT\")\n",
310 | "'''\n",
311 | "\n",
312 | "result = c.sql(query)\n",
313 | "\n",
314 | "result"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "id": "146351e1-7b08-4156-af87-2452e0c89827",
321 | "metadata": {},
322 | "outputs": [],
323 | "source": [
324 | "result.compute().plot.scatter('temp','output')\n",
325 | "\n",
326 | "# hint: if you're not totally convinced the computation is happening in Dask, look at the Dask Task Stream dashboard!"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "id": "0b750332-7c6b-4352-a138-66e9777ca021",
332 | "metadata": {},
333 | "source": [
334 | "Maybe we could build a successful model with this data ... in fact, we could do it with any combination of\n",
335 | "* Data prep in SQL, training/prediction in Python\n",
336 | "* Training in Python, prediction in SQL\n",
337 | "* Everything (!) in SQL\n",
338 | "* Sound interesting? Check it out: https://dask-sql.readthedocs.io/en/latest/pages/machine_learning.html\n",
339 | "\n",
340 | "### What about \"creating the table completely in SQL\"?\n",
341 | "\n",
342 | "First, let's go \"full SQL\" so we don't even need to wrap our queries in Python..."
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "id": "a50f01d0-5753-4f5e-91c4-9768211f77f8",
349 | "metadata": {},
350 | "outputs": [],
351 | "source": [
352 | "c.ipython_magic()"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": null,
358 | "id": "e5e2f680-18af-49a5-b767-f7ba6adf1e34",
359 | "metadata": {},
360 | "outputs": [],
361 | "source": [
362 | "%%sql\n",
363 | "\n",
364 | "CREATE TABLE allsql WITH (\n",
365 | " format = 'csv',\n",
366 | " location = 'data/powerplant.csv' -- any Dask-accessible source or format (cloud/S3/..., parquet/ORC/...)\n",
367 | ")"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": null,
373 | "id": "a189d158",
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "%%sql\n",
378 | "\n",
379 | "SELECT\n",
380 | " FLOOR(\"AT\") AS temp, AVG(\"PE\") AS output\n",
381 | "FROM\n",
382 | " allsql\n",
383 | "GROUP BY \n",
384 | " FLOOR(\"AT\")\n",
385 | "LIMIT 10"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "id": "33c0b6a7",
391 | "metadata": {},
392 | "source": [
393 | "### Let's see that Hive catalog integration!\n",
394 | "\n",
395 | "*note: this demo will not run in the standalone binder notebook available after PyCon, as it relies on a Hive server which is not configured in that container*"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": null,
401 | "id": "e0c27141",
402 | "metadata": {},
403 | "outputs": [],
404 | "source": [
405 | "from pyhive.hive import connect\n",
406 | "\n",
407 | "cursor = connect(\"localhost\", 10000).cursor()\n",
408 | "\n",
409 | "c.create_table(\"my_diamonds\", cursor, hive_table_name=\"diamonds\")"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "id": "a5f9c10a",
415 | "metadata": {},
416 | "source": [
417 | "Here's the magic...\n",
418 | "* If you look at the Hive Server web UI, you'll see a query just ran to get schema info on the `Diamonds` table\n",
419 | "* But in the following queries\n",
420 | " * Data is accessed directly from the underlying files\n",
421 | " * No Hive queries are run\n",
422 | " * All compute is done in Dask/Python"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": null,
428 | "id": "404722db",
429 | "metadata": {},
430 | "outputs": [],
431 | "source": [
432 | "%%sql\n",
433 | "\n",
434 | "SELECT * FROM my_diamonds LIMIT 10"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": null,
440 | "id": "f7de52bb",
441 | "metadata": {},
442 | "outputs": [],
443 | "source": [
444 | "query = '''\n",
445 | "SELECT FLOOR(10*carat)/10 AS carat, AVG(price) AS price, COUNT(1) AS num \n",
446 | "FROM my_diamonds\n",
447 | "GROUP BY FLOOR(10*carat)\n",
448 | "'''\n",
449 | "\n",
450 | "data = c.sql(query).compute()\n",
451 | "\n",
452 | "data.plot.scatter('carat', 'price')\n",
453 | "data.plot.bar('carat', 'num')"
454 | ]
455 | },
456 | {
457 | "cell_type": "markdown",
458 | "id": "e99da79e-953e-4eb4-b213-82875016cc31",
459 | "metadata": {},
460 | "source": [
461 | "## A Quick Look at How Dask-SQL Works\n",
462 | "\n",
463 | "* Locate the source data\n",
464 | " * Hive, Intake, Databricks catalog integration\n",
465 | " * Files or Python data provided by user\n",
466 | "\n",
467 | "\n",
468 | "* Prepare the query using Apache Calcite\n",
469 | " * Parse SQL\n",
470 | " * Analyze (check vs. schema, etc.)\n",
471 | " * Optimize\n",
472 | "\n",
473 | "\n",
474 | "* Create execution plan\n",
475 | " * Take logical relational operators (`SELECT`/project, `WHERE`/filter, `JOIN`, etc.) \n",
476 | " * Convert into Dask Dataframe API calls (`query`, `merge`, etc.)\n",
477 | "\n",
478 | "\n",
479 | "* Then either...\n",
480 | " * Return a handle to the Dask Dataframe of results (recall this is a virtual Dataframe, so no execution yet)\n",
481 | " * or\n",
482 | " * Compute (materialize) the resulting dataframe and return the result as a Pandas Dataframe\n",
483 | " \n",
484 | "More detail at https://dask-sql.readthedocs.io/en/latest/pages/how_does_it_work.html"
485 | ]
486 | },
487 | {
488 | "cell_type": "markdown",
489 | "id": "f560c5e3-1688-4178-bb77-5cb6864d4b25",
490 | "metadata": {},
491 | "source": [
492 | "## Some Practical Details\n",
493 | "\n",
494 | "### Installing Dask-SQL\n",
495 | "\n",
496 | "Recommended approach is via conda and conda-forge -- this will include all dependencies like the JVM, and avoid conflicts by keeping everything within a conda environment.\n",
497 | "\n",
498 | "There are also a few other options: more details at https://dask-sql.readthedocs.io/en/latest/pages/installation.html\n",
499 | "\n",
500 | "### Supported SQL Operators\n",
501 | "\n",
502 | "Dask-SQL is a young project, so it does not yet support all of SQL\n",
503 | "\n",
504 | "More detail on\n",
505 | "* Query support https://dask-sql.readthedocs.io/en/latest/pages/sql/select.html\n",
506 | "* Table creation https://dask-sql.readthedocs.io/en/latest/pages/sql/creation.html\n",
507 | "* ML via SQL https://dask-sql.readthedocs.io/en/latest/pages/sql/ml.html\n",
508 | "\n",
509 | "### How to Contribute\n",
510 | "\n",
511 | "Source code and info on installing for development is at https://github.com/nils-braun/dask-sql\n",
512 | "\n",
513 | "Check issues -- or file a new bug -- at https://github.com/nils-braun/dask-sql/issues\n",
514 | "\n",
515 | "And there's even a \"good first issue\" list at https://github.com/nils-braun/dask-sql/contribute"
516 | ]
517 | },
518 | {
519 | "cell_type": "markdown",
520 | "id": "f7488066",
521 | "metadata": {},
522 | "source": [
523 | "# Thank You!"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": null,
529 | "id": "4999eff9",
530 | "metadata": {},
531 | "outputs": [],
532 | "source": []
533 | }
534 | ],
535 | "metadata": {
536 | "kernelspec": {
537 | "display_name": "Python 3",
538 | "language": "python",
539 | "name": "python3"
540 | },
541 | "language_info": {
542 | "codemirror_mode": {
543 | "name": "ipython",
544 | "version": 3
545 | },
546 | "file_extension": ".py",
547 | "mimetype": "text/x-python",
548 | "name": "python",
549 | "nbconvert_exporter": "python",
550 | "pygments_lexer": "ipython3",
551 | "version": "3.8.0"
552 | }
553 | },
554 | "nbformat": 4,
555 | "nbformat_minor": 5
556 | }
557 |
--------------------------------------------------------------------------------