├── README.md
├── data
├── amazon20k.parquet
├── diamonds.csv
├── pageviews-20170120-180000.gz
└── zips.json
└── notebooks
├── Intro-DF-DS-SQL.json
├── RDD-Mem-Execution.json
├── SQL-DF-DS-Execution.json
├── Spark Machine Learning.json
└── Streaming.json
/README.md:
--------------------------------------------------------------------------------
1 | # Spark with Zeppelin
2 |
3 | ### Materials and setup instructions
4 |
5 | 1. Download Apache Zeppelin version 0.7 with "all" bundled interpreters. Although the following link looks like a direct link to the package, it actually opens the Apache download page, where you can choose a mirror: http://www.apache.org/dyn/closer.cgi/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-all.tgz
6 | 2. Install ... which should be as simple as unzipping the archive you just downloaded. This version of Zeppelin includes an embedded install of Apache Spark 2.1, which is tested to work with the class notebooks.
7 | 3. Test it out!
8 | * `cd` into the folder where you unpacked Zeppelin
9 | * If you run `ls` you should see the `bin` folder
10 | * on ~~MacOS/~~ Linux, run `bin/zeppelin-daemon.sh start`
11 | * __UPDATE For MacOS USERS__:
12 | * There is a known bug that makes Zeppelin run extremely(!) slowly on MacOS (https://issues.apache.org/jira/browse/ZEPPELIN-2948) so, while it does work, I strongly recommend using Linux instead
13 | * __UPDATE FOR WINDOWS USERS__:
14 | * In theory, Zeppelin should run on Windows with minor changes (e.g., `zeppelin.cmd`), but your mileage may vary
15 | * Zeppelin claims support on Win 7 SP 1, but I have not tried that
16 | * I __did__ try it on Windows 10 (more or less latest Microsoft build) and it did *not* work out of the box
17 | * Of course, it's quite possible that with some monkeying around it can be made to work on Win 10
18 | * __HOWEVER__ my recommendation would be to install a Linux VM and run it there: free, easy, definitely works, and your "real" Spark work will almost certainly be on Linux anyway
19 | * Easiest path: Grab free VMWare or VirtualBox, install ubuntu-16.04.2-desktop-amd64, add a JRE with `sudo apt install openjdk-8-jre-headless` and then just unzip Zeppelin and you're all set.
20 | * Make sure you give this VM a good chunk of resources -- I'd recommend 8 GB RAM and 2 CPU cores if you have them
21 | * Point your browser to `localhost:8080` and you should see Zeppelin running!
22 | * occasionally it seems to require an extra few seconds, refreshes, or even clearing browser cache, moreso than other webapps, not sure why, but don't sweat it
23 | * Click "create new note" to start a new notebook
24 | * Give it a name (like "Test"); the default interpreter should say "Spark" and you can leave that as is, and click "Create Note"
25 | * In the first notebook cell, type in `1 + 1` and hit shift + enter
26 | * After a few seconds, you should see `res0: Int = 2` pop out
27 | * Just to make sure Spark can actually run a job, in the next cell, enter `sc.parallelize(1 to 100).sum` and hit shift + enter
28 | * You should see `res1: Double = 5050.0` pop out
29 | * Congrats, your Zeppelin + embedded Spark 2.1 is installed!
30 | 4. If you haven't yet cloned this github repo on your machine, go ahead and do that. If you really don't want to use git, you can just "Download as Zip" this whole repo and use that, and you'll be fine.
31 | 5. Move or copy the "data" folder from the repo into your Zeppelin folder. It should be at the same "level" as `bin`, `conf`, `interpreter`, etc. Why? This way, if you `cd` to the Zeppelin folder and start it with `bin/zeppelin-daemon.sh start`, inside the notebook environment, your data will be under `data/`.
32 | * This is for simplicity; of course you are welcome to `%sh cd ...` from inside Zeppelin, or hack the paths to the datasets, or anything else you wish -- I'm just trying to make it as simple and consistent for everyone as possible.
33 | * You can check if everything is in the right place by punching in `%sh ls data` from inside a notebook cell, and you should see the list of data files, with `amazon20k.parquet`, `diamonds.csv`, etc.
34 | 6. Make note (for yourself) of where you put the `notebooks` folder from this repo. You don't need to do anything with them yet, but in class, we will load notebooks into Zeppelin, so it will be handy to know where to find the files.
35 | * If you want, you can try it from the main Zeppelin page by clicking "Import Note" then "Choose a JSON here" and then using the file picker to find the notebook.
36 | * When long notebooks like these import, they will look weird at first -- like a bunch of blank cells. Just wait about 30 seconds for Zeppelin to catch up and render them, and you should be all set.
37 |
38 |
39 |
--------------------------------------------------------------------------------
/data/amazon20k.parquet:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/spark-zeppelin-17-1/c31a80b29d4b7dc5dad8ef3060d78adf9d6a2b4f/data/amazon20k.parquet
--------------------------------------------------------------------------------
/data/pageviews-20170120-180000.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adbreind/spark-zeppelin-17-1/c31a80b29d4b7dc5dad8ef3060d78adf9d6a2b4f/data/pageviews-20170120-180000.gz
--------------------------------------------------------------------------------
/notebooks/RDD-Mem-Execution.json:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456224733_-1880050817","id":"20170218-141704_979458167","dateCreated":"2017-02-18T14:17:04-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12434","text":"%md # RDDs, Spark Memory, and Execution\n\n©2016, 2017 by Adam Breindel. All Rights Reserved.","dateUpdated":"2017-02-18T14:17:24-0800","dateFinished":"2017-02-18T14:17:24-0800","dateStarted":"2017-02-18T14:17:24-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
RDDs, Spark Memory, and Execution
\n
©2016, 2017 by Adam Breindel. All Rights Reserved.
\n
"}]}},{"text":"%md #### Resilient Distributed Datasets (RDDs)\n\n* Purpose / motivation\n* Implementation in Spark\n* https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf\n * 2014 ACM Doctoral Dissertation Award (Matei Zaharia, Spark creator, Databricks cofounder)","user":"anonymous","dateUpdated":"2017-02-18T14:17:49-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456227626_49464917","id":"20170218-141707_1646601151","dateCreated":"2017-02-18T14:17:07-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12494","dateFinished":"2017-02-18T14:17:49-0800","dateStarted":"2017-02-18T14:17:49-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Resilient Distributed Datasets (RDDs)
\n
\n
"}]}},{"text":"%md #### Why a \"new\" computation approach?\n\nLots of existing cluster computing models. Goal is to exploit availability of large amounts of RAM.\n\nChallenge: treating a cluster a large pool of memory and putting objects in it is not a new idea, but it is prone to many difficulties which have expensive mitigations.","user":"anonymous","dateUpdated":"2017-02-18T14:17:59-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456261920_-1371550689","id":"20170218-141741_1098885690","dateCreated":"2017-02-18T14:17:41-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12613","dateFinished":"2017-02-18T14:17:59-0800","dateStarted":"2017-02-18T14:17:59-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Why a “new” computation approach?
\n
Lots of existing cluster computing models. Goal is to exploit availability of large amounts of RAM.
\n
Challenge: treating a cluster a large pool of memory and putting objects in it is not a new idea, but it is prone to many difficulties which have expensive mitigations.
\n
"}]}},{"text":"%md #### Distributed computation = concurrent computation + guaranteed failures\n\nWe can simplify distributed computation by addressing both of these elements:\n\n* Functional programming model for concurrency\n* Metadata for reliability (but __not__ for concurrency)","user":"anonymous","dateUpdated":"2017-02-18T14:18:17-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456279207_2046204439","id":"20170218-141759_470456622","dateCreated":"2017-02-18T14:17:59-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12705","dateFinished":"2017-02-18T14:18:17-0800","dateStarted":"2017-02-18T14:18:17-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Distributed computation = concurrent computation + guaranteed failures
\n
We can simplify distributed computation by addressing both of these elements:
\n
\n - Functional programming model for concurrency
\n - Metadata for reliability (but not for concurrency)
\n
\n
"}]}},{"text":"%md Most of the big data world is (was) on Java-based tools (e.g., Hadoop); Scala provides a functional language with direct Java interop.\n\n*Let's look at Scala collections* -- can we make something that looks/behaves like Scala collections but transparently operates at cluster scale?","user":"anonymous","dateUpdated":"2017-02-18T14:18:30-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456291055_-1369196223","id":"20170218-141811_706327086","dateCreated":"2017-02-18T14:18:11-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12779","dateFinished":"2017-02-18T14:18:30-0800","dateStarted":"2017-02-18T14:18:30-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Most of the big data world is (was) on Java-based tools (e.g., Hadoop); Scala provides a functional language with direct Java interop.
\n
Let’s look at Scala collections – can we make something that looks/behaves like Scala collections but transparently operates at cluster scale?
\n
"}]}},{"text":"%md Here's some Scala (note this is __not__ Spark code, just plain Scala ...
it's also not a new or peculiar to Scala ... Java 8, C#, JavaScript, and other languages have had these patterns for a very long time)","user":"anonymous","dateUpdated":"2017-02-18T14:19:07-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456310255_496682701","id":"20170218-141830_1787067081","dateCreated":"2017-02-18T14:18:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12871","dateFinished":"2017-02-18T14:19:07-0800","dateStarted":"2017-02-18T14:19:07-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Here’s some Scala (note this is not Spark code, just plain Scala …
it’s also not a new or peculiar to Scala … Java 8, C#, JavaScript, and other languages have had these patterns for a very long time)
\n
"}]}},{"text":"// Basic Scala (functional) collections, 1 thread\n\nval list = List(\"Apples\", \"bananas\", \"APPLES\", \"pears\", \"Bananas\")","user":"anonymous","dateUpdated":"2017-02-18T14:19:41-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456322125_1070758708","id":"20170218-141842_1092216808","dateCreated":"2017-02-18T14:18:42-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:12945","dateFinished":"2017-02-18T14:19:38-0800","dateStarted":"2017-02-18T14:19:21-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456361493_-745994950","id":"20170218-141921_1948918755","dateCreated":"2017-02-18T14:19:21-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13037","text":"list.map(_.toUpperCase) // transform each element","dateUpdated":"2017-02-18T14:19:37-0800","dateFinished":"2017-02-18T14:19:38-0800","dateStarted":"2017-02-18T14:19:37-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456377466_590084738","id":"20170218-141937_534808141","dateCreated":"2017-02-18T14:19:37-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13107","text":"list.groupBy(word => word.toLowerCase).mapValues(_.size) // group and count","dateUpdated":"2017-02-18T14:20:03-0800","dateFinished":"2017-02-18T14:20:03-0800","dateStarted":"2017-02-18T14:20:03-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456403380_-496415507","id":"20170218-142003_1490913215","dateCreated":"2017-02-18T14:20:03-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13231","text":"%md Since this approach avoids side effects and does not require the user (programmer) to manage any state
(e.g., position in a collection, or partial aggregation totals), it's easy to provide a declarative parallel API:","dateUpdated":"2017-02-18T14:20:31-0800","dateFinished":"2017-02-18T14:20:31-0800","dateStarted":"2017-02-18T14:20:31-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Since this approach avoids side effects and does not require the user (programmer) to manage any state
(e.g., position in a collection, or partial aggregation totals), it’s easy to provide a declarative parallel API:
\n
"}]}},{"text":"val parList = List(\"Apples\", \"bananas\", \"APPLES\", \"pears\", \"Bananas\").par\nparList.filter(_ contains \"p\")","user":"anonymous","dateUpdated":"2017-02-18T14:20:44-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456415709_198086906","id":"20170218-142015_332548807","dateCreated":"2017-02-18T14:20:15-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13313","dateFinished":"2017-02-18T14:20:25-0800","dateStarted":"2017-02-18T14:20:24-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456424524_-1925342274","id":"20170218-142024_954601551","dateCreated":"2017-02-18T14:20:24-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13387","text":"%md Look, ma! No [visible] threads! ... Of course underneath there are threads or processes or coroutines or
some other mechanism for the actual parallelism. Scala uses regular JVM threads. This is similar to the Java 8 Streams API using the default Fork-Join threadpool.","dateUpdated":"2017-02-18T14:21:12-0800","dateFinished":"2017-02-18T14:21:12-0800","dateStarted":"2017-02-18T14:21:12-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Look, ma! No [visible] threads! … Of course underneath there are threads or processes or coroutines or
some other mechanism for the actual parallelism. Scala uses regular JVM threads. This is similar to the Java 8 Streams API using the default Fork-Join threadpool.
\n
"}]}},{"text":"var parList = (1 to 100).par\nparList.sum","user":"anonymous","dateUpdated":"2017-02-18T14:21:25-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456450211_-365945115","id":"20170218-142050_1856554206","dateCreated":"2017-02-18T14:20:50-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13503","dateFinished":"2017-02-18T14:21:23-0800","dateStarted":"2017-02-18T14:21:22-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456482626_145816895","id":"20170218-142122_165808736","dateCreated":"2017-02-18T14:21:22-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13606","text":"%md Now let's take a look at how the Spark RDD API exposes similar functionality with similar syntax:","dateUpdated":"2017-02-18T14:21:41-0800","dateFinished":"2017-02-18T14:21:41-0800","dateStarted":"2017-02-18T14:21:41-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Now let’s take a look at how the Spark RDD API exposes similar functionality with similar syntax:
\n
"}]}},{"text":"val rdd = sc.parallelize(1 to 100)\nrdd.sum","user":"anonymous","dateUpdated":"2017-02-18T14:21:56-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456501273_-1343915522","id":"20170218-142141_1420349546","dateCreated":"2017-02-18T14:21:41-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13711","dateFinished":"2017-02-18T14:21:53-0800","dateStarted":"2017-02-18T14:21:52-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456512297_-1072436040","id":"20170218-142152_1858774990","dateCreated":"2017-02-18T14:21:52-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13785","text":"val rdd = sc.parallelize(List(\"Apples\", \"bananas\", \"APPLES\", \"pears\", \"Bananas\"))\nrdd.map(_.toUpperCase).collect","dateUpdated":"2017-02-18T14:22:11-0800","dateFinished":"2017-02-18T14:22:12-0800","dateStarted":"2017-02-18T14:22:11-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456531833_1589287499","id":"20170218-142211_2038460759","dateCreated":"2017-02-18T14:22:11-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13882","text":"rdd.groupBy(word => word.toLowerCase).mapValues(_.size).collect","dateUpdated":"2017-02-18T14:22:30-0800","dateFinished":"2017-02-18T14:22:31-0800","dateStarted":"2017-02-18T14:22:30-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456550356_431978132","id":"20170218-142230_1012834600","dateCreated":"2017-02-18T14:22:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:13969","text":"%md Those functional patterns help with concurrency ... but what about resiliency? \n\nI.e., how is this different from putting Scala collections in raw memory somewhere
(e.g., via RMI)? How can we fix things if (when!) we lose a node and all the data that was in RAM?\n\nWhat if we had a minimal metadata object that describes, deterministically, which how the dataset is distributed,
and how one step of the computation is derived from the parent.","dateUpdated":"2017-02-18T14:23:13-0800","dateFinished":"2017-02-18T14:23:13-0800","dateStarted":"2017-02-18T14:23:13-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Those functional patterns help with concurrency … but what about resiliency?
\n
I.e., how is this different from putting Scala collections in raw memory somewhere
(e.g., via RMI)? How can we fix things if (when!) we lose a node and all the data that was in RAM?
\n
What if we had a minimal metadata object that describes, deterministically, which how the dataset is distributed,
and how one step of the computation is derived from the parent.
\n
"}]}},{"text":"%md #### Computation as a DAG\n\n
","user":"anonymous","dateUpdated":"2017-02-18T14:23:40-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456574643_-1627735972","id":"20170218-142254_642582143","dateCreated":"2017-02-18T14:22:54-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14053","dateFinished":"2017-02-18T14:23:40-0800","dateStarted":"2017-02-18T14:23:40-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Computation as a DAG
\n

\n
"}]}},{"text":"%md Each RDD is a small piece of metadata describing a step of the computation\n\n* Parent(s)\n* Partitions\n* Compute\n* (Locality Preference)","user":"anonymous","dateUpdated":"2017-02-18T14:24:08-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456608650_-2075137646","id":"20170218-142328_417388631","dateCreated":"2017-02-18T14:23:28-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14162","dateFinished":"2017-02-18T14:24:08-0800","dateStarted":"2017-02-18T14:24:08-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Each RDD is a small piece of metadata describing a step of the computation
\n
\n - Parent(s)
\n - Partitions
\n - Compute
\n - (Locality Preference)
\n
\n
"}]}},{"text":"val rdd = sc.parallelize(List(\"Apples\", \"bananas\", \"APPLES\", \"pears\", \"Bananas\"))\n\nprintln(\"total partitions: \" + rdd.getNumPartitions) // how many? why?\n\nrdd.dependencies // this is the source/root RDD","user":"anonymous","dateUpdated":"2017-02-18T14:25:35-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456641651_1014658177","id":"20170218-142401_1594891979","dateCreated":"2017-02-18T14:24:01-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14254","dateFinished":"2017-02-18T14:25:00-0800","dateStarted":"2017-02-18T14:24:59-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456670377_1788803612","id":"20170218-142430_982285224","dateCreated":"2017-02-18T14:24:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14346","text":"val nextRDD = rdd.map(_.toLowerCase)\n\nnextRDD.dependencies(0) // nextRDD depends on its parent (rdd) ... and the partitions have a 1:1 dependency\n\n// https://github.com/apache/spark/blob/v2.0.0/core/src/main/scala/org/apache/spark/Dependency.scala\n","dateUpdated":"2017-02-18T14:25:11-0800","dateFinished":"2017-02-18T14:25:11-0800","dateStarted":"2017-02-18T14:25:11-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456711146_-1779160117","id":"20170218-142511_607280323","dateCreated":"2017-02-18T14:25:11-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14486","text":"nextRDD.dependencies(0).rdd ","dateUpdated":"2017-02-18T14:25:27-0800","dateFinished":"2017-02-18T14:25:28-0800","dateStarted":"2017-02-18T14:25:27-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456727975_335835235","id":"20170218-142527_1231240247","dateCreated":"2017-02-18T14:25:27-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14567","text":"%md Let's grab a dataset and do some simple counting. The goal is to understand how Spark executes our jobs!","dateUpdated":"2017-02-18T14:30:15-0800","dateFinished":"2017-02-18T14:30:15-0800","dateStarted":"2017-02-18T14:30:15-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Let’s grab a dataset and do some simple counting. The goal is to understand how Spark executes our jobs!
\n
"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487456971497_-644167831","id":"20170218-142931_25680823","dateCreated":"2017-02-18T14:29:31-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14681","text":"val rdd1 = sc.textFile(\"data/pageviews-20170120-180000\")","dateUpdated":"2017-02-18T15:16:06-0800","dateFinished":"2017-02-18T15:16:06-0800","dateStarted":"2017-02-18T15:16:06-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457128904_-1523472309","id":"20170218-143208_1602192883","dateCreated":"2017-02-18T14:32:08-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14768","text":"rdd1.take(5)","dateUpdated":"2017-02-18T14:32:19-0800","dateFinished":"2017-02-18T14:32:19-0800","dateStarted":"2017-02-18T14:32:19-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457139688_198294343","id":"20170218-143219_1089890362","dateCreated":"2017-02-18T14:32:19-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14849","text":"rdd1.partitions.size","dateUpdated":"2017-02-18T14:32:30-0800","dateFinished":"2017-02-18T14:32:30-0800","dateStarted":"2017-02-18T14:32:30-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457150360_-33293847","id":"20170218-143230_2112261355","dateCreated":"2017-02-18T14:32:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:14930","text":"rdd1.count","dateUpdated":"2017-02-18T14:32:37-0800","dateFinished":"2017-02-18T14:32:38-0800","dateStarted":"2017-02-18T14:32:37-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457157527_395917360","id":"20170218-143237_1022270964","dateCreated":"2017-02-18T14:32:37-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15011","text":"%md What is the parallelism of that job? Why?","dateUpdated":"2017-02-18T14:33:35-0800","dateFinished":"2017-02-18T14:33:35-0800","dateStarted":"2017-02-18T14:33:35-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
What is the parallelism of that job? Why?
\n
"}]}},{"text":"rdd1.take(10) // Let's just have a look\n","user":"anonymous","dateUpdated":"2017-02-18T14:35:07-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457215162_-501240454","id":"20170218-143335_99968549","dateCreated":"2017-02-18T14:33:35-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15101","dateFinished":"2017-02-18T14:33:52-0800","dateStarted":"2017-02-18T14:33:52-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457232162_668578448","id":"20170218-143352_1644408352","dateCreated":"2017-02-18T14:33:52-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15175","text":"%md Why only one task?","dateUpdated":"2017-02-18T14:34:05-0800","dateFinished":"2017-02-18T14:34:05-0800","dateStarted":"2017-02-18T14:34:05-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":""}]}},{"text":"%md #### Caching RDDs\n\n* Options for serialization, local disk cache, replication, etc.\n* Caching, like computation, is always at the granularity of partitions! \n * Why? That's the level of detail about which we have metadata for recovery!\n\nLet's start with basic memory-only caching of raw Java/Scala objects:\n","user":"anonymous","dateUpdated":"2017-02-18T14:34:57-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457245785_-31279802","id":"20170218-143405_788450771","dateCreated":"2017-02-18T14:34:05-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15265","dateFinished":"2017-02-18T14:34:57-0800","dateStarted":"2017-02-18T14:34:57-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Caching RDDs
\n
\n - Options for serialization, local disk cache, replication, etc.
\n - Caching, like computation, is always at the granularity of partitions!
\n - Why? That’s the level of detail about which we have metadata for recovery!
\n
\n
Let’s start with basic memory-only caching of raw Java/Scala objects:
\n
"}]}},{"text":"val enPagesRDD = rdd1.map(_.split(\" \"))\n .filter(fields => fields(0) == \"en\" && fields(1) >= \"A\")\n .map(_(1))\n \nenPagesRDD.take(10)","user":"anonymous","dateUpdated":"2017-02-18T15:17:05-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457324168_-4793424","id":"20170218-143524_1781944258","dateCreated":"2017-02-18T14:35:24-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15548","dateFinished":"2017-02-18T15:16:58-0800","dateStarted":"2017-02-18T15:16:56-0800","errorMessage":""},{"text":"import org.apache.spark.storage.StorageLevel._\n\nenPagesRDD.setName(\"EN Pages\").cache.count","user":"anonymous","dateUpdated":"2017-02-18T14:40:31-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457291132_-68322965","id":"20170218-143451_119496874","dateCreated":"2017-02-18T14:34:51-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15339","dateFinished":"2017-02-18T14:40:28-0800","dateStarted":"2017-02-18T14:40:28-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457305899_1253320155","id":"20170218-143505_809242403","dateCreated":"2017-02-18T14:35:05-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:15431","text":"enPagesRDD.map(identity).setName(\"EN Pages - SER\").persist(MEMORY_ONLY_SER).count // why map? try it without the map call!","dateUpdated":"2017-02-18T14:40:56-0800","dateFinished":"2017-02-18T14:40:56-0800","dateStarted":"2017-02-18T14:40:56-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457656109_-447191195","id":"20170218-144056_1064605137","dateCreated":"2017-02-18T14:40:56-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16027","text":"enPagesRDD.toDF.cache.count","dateUpdated":"2017-02-18T14:41:24-0800","dateFinished":"2017-02-18T14:41:27-0800","dateStarted":"2017-02-18T14:41:24-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457684628_1412177646","id":"20170218-144124_75907407","dateCreated":"2017-02-18T14:41:24-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16116","text":"%md RDD caching defaults to MEMORY\\_ONLY\n\ni.e., `myRDD.cache` is the same as `myRDD.persist(MEMORY_ONLY)`\n\n* Additional storage levels may be useful in special circumstances\n* http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence\n\nNote that\n1. DataFrame/Dataset default `.cache` as `persist(MEMORY_AND_DISK)` because the compressed encoded columnar cache is expensive to rebuild\n2. Some streaming sources default to stronger caching (e.g., `MEMORY_AND_DISK_SER_2`)\n\n`rdd.unpersist()` or `dataset.unpersist()` -- immediately remove from cache","dateUpdated":"2017-02-18T14:42:20-0800","dateFinished":"2017-02-18T14:42:20-0800","dateStarted":"2017-02-18T14:42:20-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
RDD caching defaults to MEMORY_ONLY
\n
i.e., myRDD.cache
is the same as myRDD.persist(MEMORY_ONLY)
\n
\n
Note that
1. DataFrame/Dataset default .cache
as persist(MEMORY_AND_DISK)
because the compressed encoded columnar cache is expensive to rebuild
2. Some streaming sources default to stronger caching (e.g., MEMORY_AND_DISK_SER_2
)
\n
rdd.unpersist()
or dataset.unpersist()
– immediately remove from cache
\n
"}]}},{"text":"%md Where does this memory pool fit into the executors (JVMs)? How does Spark allocate memory?\n\n
\n\nhttps://0x0fff.com/spark-memory-management/\n\nNote: these percentages are 1.6; they are slightly different in 2.x but the overall behavior is the same\n\nCache Eviction is due to\n\n* Least-recently-used (LRU) when space is needed for caching other data (per programmer requests)\n* or Spark needing memory for shuffle/execution","user":"anonymous","dateUpdated":"2017-02-18T14:43:38-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457706060_69434633","id":"20170218-144146_1016781747","dateCreated":"2017-02-18T14:41:46-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16203","dateFinished":"2017-02-18T14:43:38-0800","dateStarted":"2017-02-18T14:43:38-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Where does this memory pool fit into the executors (JVMs)? How does Spark allocate memory?
\n

\n
https://0x0fff.com/spark-memory-management/
\n
Note: these percentages are 1.6; they are slightly different in 2.x but the overall behavior is the same
\n
Cache Eviction is due to
\n
\n - Least-recently-used (LRU) when space is needed for caching other data (per programmer requests)
\n - or Spark needing memory for shuffle/execution
\n
\n
"}]}},{"text":"%md ### RDD Job Execution In-Depth\n\n* Application\n* Job \n* Stage\n* Task\n","user":"anonymous","dateUpdated":"2017-02-18T15:05:54-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487457748411_298573123","id":"20170218-144228_699818721","dateCreated":"2017-02-18T14:42:28-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16363","dateFinished":"2017-02-18T15:05:54-0800","dateStarted":"2017-02-18T15:05:54-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
RDD Job Execution In-Depth
\n
\n - Application
\n - Job
\n - Stage
\n - Task
\n
\n
"}]}},{"text":"%md
","user":"anonymous","dateUpdated":"2017-02-18T15:07:25-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487459154080_-829890468","id":"20170218-150554_1502543511","dateCreated":"2017-02-18T15:05:54-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16557","dateFinished":"2017-02-18T15:07:25-0800","dateStarted":"2017-02-18T15:07:25-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n

\n
"}]}},{"text":"enPagesRDD.sample(false, 0.02).flatMap(t => t.split(\"_\")).collect // let's see ... any shuffles? ... Pipelining!","user":"anonymous","dateUpdated":"2017-02-18T15:20:42-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487459245412_1673624653","id":"20170218-150725_34689135","dateCreated":"2017-02-18T15:07:25-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16656","dateFinished":"2017-02-18T15:20:37-0800","dateStarted":"2017-02-18T15:20:36-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487459278742_737392661","id":"20170218-150758_1869341889","dateCreated":"2017-02-18T15:07:58-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16730","text":"val smallPages = enPagesRDD.sample(false, 0.02).flatMap(t => t.split(\"_\")) // look for shuffles ... also notice sample does not affect ordering\nsmallPages.repartition(8).collect","dateUpdated":"2017-02-18T15:27:36-0800","dateFinished":"2017-02-18T15:27:38-0800","dateStarted":"2017-02-18T15:27:36-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487459950982_2143216404","id":"20170218-151910_1317123353","dateCreated":"2017-02-18T15:19:10-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:20588","text":"%md
","dateUpdated":"2017-02-18T15:19:13-0800","dateFinished":"2017-02-18T15:19:13-0800","dateStarted":"2017-02-18T15:19:13-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n

\n
"}]}},{"text":"smallPages.getNumPartitions","user":"anonymous","dateUpdated":"2017-02-18T15:29:04-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460533483_58105584","id":"20170218-152853_1052909922","dateCreated":"2017-02-18T15:28:53-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21668","dateFinished":"2017-02-18T15:29:04-0800","dateStarted":"2017-02-18T15:29:04-0800","results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\nres18: Int = 7\n"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487459910013_638402159","id":"20170218-151830_1768269158","dateCreated":"2017-02-18T15:18:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:20501","text":"smallPages.map(w => (w, 1)).collect","dateUpdated":"2017-02-18T15:27:58-0800","dateFinished":"2017-02-18T15:27:59-0800","dateStarted":"2017-02-18T15:27:58-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460020637_-1128795511","id":"20170218-152020_50164686","dateCreated":"2017-02-18T15:20:20-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:20663","text":"%md And, finally, to add up the word occurrences, we'll reduceByKey","dateUpdated":"2017-02-18T15:22:28-0800","dateFinished":"2017-02-18T15:22:28-0800","dateStarted":"2017-02-18T15:22:28-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
And, finally, to add up the word occurrences, we’ll reduceByKey
\n
"}]}},{"text":"smallPages.map(w => (w, 1)).reduceByKey(_ + _).collect","user":"anonymous","dateUpdated":"2017-02-18T15:29:20-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460148308_61784507","id":"20170218-152228_1493969637","dateCreated":"2017-02-18T15:22:28-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:20940","dateFinished":"2017-02-18T15:29:10-0800","dateStarted":"2017-02-18T15:29:09-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460170178_954802028","id":"20170218-152250_1922869985","dateCreated":"2017-02-18T15:22:50-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21014","text":"%md What does it look like if we increase parallelism in that last stage?","dateUpdated":"2017-02-18T15:23:16-0800","dateFinished":"2017-02-18T15:23:16-0800","dateStarted":"2017-02-18T15:23:16-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
What does it look like if we increase parallelism in that last stage?
\n
"}]}},{"text":"smallPages.map(w => (w, 1)).reduceByKey(_ + _, 32).collect","user":"anonymous","dateUpdated":"2017-02-18T15:34:50-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460196555_1951563835","id":"20170218-152316_1335332108","dateCreated":"2017-02-18T15:23:16-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21122","dateFinished":"2017-02-18T15:29:26-0800","dateStarted":"2017-02-18T15:29:24-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460208723_335618455","id":"20170218-152328_161554363","dateCreated":"2017-02-18T15:23:28-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21196","text":"%md Let's see how cached data appears in the execution UI","dateUpdated":"2017-02-18T15:23:59-0800","dateFinished":"2017-02-18T15:23:59-0800","dateStarted":"2017-02-18T15:23:59-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Let’s see how cached data appears in the execution UI
\n
"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460234235_-594673738","id":"20170218-152354_2052147108","dateCreated":"2017-02-18T15:23:54-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21296","text":"val wordPairsWithDuplicates = smallPages.map(w => (w, 1)).cache\nwordPairsWithDuplicates.count","dateUpdated":"2017-02-18T15:25:15-0800","dateFinished":"2017-02-18T15:25:15-0800","dateStarted":"2017-02-18T15:25:15-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460315371_119175034","id":"20170218-152515_524061505","dateCreated":"2017-02-18T15:25:15-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21365","text":"wordPairsWithDuplicates.reduceByKey(_ + _).collect","dateUpdated":"2017-02-18T15:25:35-0800","dateFinished":"2017-02-18T15:25:35-0800","dateStarted":"2017-02-18T15:25:35-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460335206_-1739916327","id":"20170218-152535_1626043615","dateCreated":"2017-02-18T15:25:35-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21446","text":"%md Take a look in the UI and look at the green dot in the DAG visualization, representing the cached RDD `wordPairsWithDuplicates`","dateUpdated":"2017-02-18T15:26:34-0800","dateFinished":"2017-02-18T15:26:34-0800","dateStarted":"2017-02-18T15:26:34-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Take a look in the UI and look at the green dot in the DAG visualization, representing the cached RDD wordPairsWithDuplicates
\n
"}]}},{"text":"%md What if we've performed a shuffle with a data set, and we need the output of that shuffle again?\n\nSpark will re-use the existing shuffle output files -- this results in skipped stages.","user":"anonymous","dateUpdated":"2017-02-18T15:30:07-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460394452_142918908","id":"20170218-152634_586127695","dateCreated":"2017-02-18T15:26:34-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21533","dateFinished":"2017-02-18T15:30:07-0800","dateStarted":"2017-02-18T15:30:07-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
What if we’ve performed a shuffle with a data set, and we need the output of that shuffle again?
\n
Spark will re-use the existing shuffle output files – this results in skipped stages.
\n
"}]}},{"text":"val aShuffledRDD = smallPages.map(w => (w, 1)).reduceByKey(_ + _)\naShuffledRDD.count","user":"anonymous","dateUpdated":"2017-02-18T15:31:07-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460607881_583980709","id":"20170218-153007_548933044","dateCreated":"2017-02-18T15:30:07-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21821","dateFinished":"2017-02-18T15:31:05-0800","dateStarted":"2017-02-18T15:31:03-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460655192_2030292340","id":"20170218-153055_2110154167","dateCreated":"2017-02-18T15:30:55-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:21898","text":"aShuffledRDD.collect","dateUpdated":"2017-02-18T15:31:13-0800","dateFinished":"2017-02-18T15:31:13-0800","dateStarted":"2017-02-18T15:31:13-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460673320_-464788953","id":"20170218-153113_1426673163","dateCreated":"2017-02-18T15:31:13-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:22046","text":"%md If we need to perform a full sort, this involved a range partitioning, and so will generate an additional Spark job.","dateUpdated":"2017-02-18T15:31:52-0800","dateFinished":"2017-02-18T15:31:52-0800","dateStarted":"2017-02-18T15:31:52-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
If we need to perform a full sort, this involved a range partitioning, and so will generate an additional Spark job.
\n
"}]}},{"text":"smallPages.map(w => (w, 1)).reduceByKey(_ + _).sortBy(word => -word._2).collect","user":"anonymous","dateUpdated":"2017-02-18T15:33:34-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460712151_1013278404","id":"20170218-153152_885908470","dateCreated":"2017-02-18T15:31:52-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:22145","dateFinished":"2017-02-18T15:33:18-0800","dateStarted":"2017-02-18T15:33:17-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460731902_-950834083","id":"20170218-153211_2078169979","dateCreated":"2017-02-18T15:32:11-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:22219","text":"%md #### How Does the Shuffle Work?\n\n* Hash shuffle\n* Sort shuffle\n* \"Tungsten-Sort\"","dateUpdated":"2017-02-18T15:35:09-0800","dateFinished":"2017-02-18T15:35:09-0800","dateStarted":"2017-02-18T15:35:09-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
How Does the Shuffle Work?
\n
\n - Hash shuffle
\n - Sort shuffle
\n - “Tungsten-Sort”
\n
\n
"}]}},{"text":"%md \n\n##### Hash Shuffle\n\n
","user":"anonymous","dateUpdated":"2017-02-18T15:33:56-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460826826_-1280240438","id":"20170218-153346_743694898","dateCreated":"2017-02-18T15:33:46-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:22364","dateFinished":"2017-02-18T15:33:56-0800","dateStarted":"2017-02-18T15:33:56-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Hash Shuffle
\n

\n
"}]}},{"text":"%md ##### Sort Shuffle\n\n
","user":"anonymous","dateUpdated":"2017-02-18T15:34:08-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460836778_2135037631","id":"20170218-153356_600904597","dateCreated":"2017-02-18T15:33:56-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:22438","dateFinished":"2017-02-18T15:34:08-0800","dateStarted":"2017-02-18T15:34:08-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Sort Shuffle
\n

\n
"}]}},{"text":"%md\n","user":"anonymous","dateUpdated":"2017-02-18T15:34:08-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487460848418_709927705","id":"20170218-153408_1411257291","dateCreated":"2017-02-18T15:34:08-0800","status":"READY","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:22512"}],"name":"RDD-Mem-Execution","id":"2C9QHDEEY","angularObjects":{"2CAVTNYXD:shared_process":[],"2C7K2429Q:shared_process":[],"2C8SPPQZU:shared_process":[],"2C7UX75H1:shared_process":[],"2C8CZHBMK:shared_process":[],"2C8M1YA6S:shared_process":[],"2C8HBGBZH:shared_process":[],"2C8W1YSQF:shared_process":[],"2CA11FFZW:shared_process":[],"2C8GTCYUP:shared_process":[],"2C7JSZ74W:shared_process":[],"2C7NXAQD7:shared_process":[],"2C8BUSWZY:shared_process":[],"2C8BMRSH6:shared_process":[],"2CAD4S1XP:shared_process":[],"2C8DQ16J7:shared_process":[],"2C9YA18WV:shared_process":[],"2CA315FYH:shared_process":[],"2C915WF4P:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------
/notebooks/SQL-DF-DS-Execution.json:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461493903_2013559314","id":"20170218-154453_2108850861","dateCreated":"2017-02-18T15:44:53-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:29534","text":"%md ## SQL, DataFrame, and Dataset\n### Catalyst, Tungsten, and Modern Spark Execution\n\n©2016, 2017 by Adam Breindel. All Rights Reserved.","dateUpdated":"2017-02-18T15:45:27-0800","dateFinished":"2017-02-18T15:45:27-0800","dateStarted":"2017-02-18T15:45:27-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
SQL, DataFrame, and Dataset
\n
Catalyst, Tungsten, and Modern Spark Execution
\n
©2016, 2017 by Adam Breindel. All Rights Reserved.
\n
"}]}},{"text":"%md Let's take the list of page title words, and write it out to disk so we can easily play with it.\n","user":"anonymous","dateUpdated":"2017-02-18T15:53:54-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461527443_-2091680802","id":"20170218-154527_1517181149","dateCreated":"2017-02-18T15:45:27-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:29594","dateFinished":"2017-02-18T15:53:54-0800","dateStarted":"2017-02-18T15:53:54-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Let’s take the list of page title words, and write it out to disk so we can easily play with it.
\n
"}]}},{"text":"val titles = spark.read.option(\"delimiter\", \" \").csv(\"data/pageviews-20170120-180000\").filter(\"_c0 = 'en' AND _c1 >= 'A'\").select(\"_c1\")\ntitles.show","user":"anonymous","dateUpdated":"2017-02-18T15:49:11-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461604377_117855511","id":"20170218-154644_1372034880","dateCreated":"2017-02-18T15:46:44-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:29669","dateFinished":"2017-02-18T15:49:08-0800","dateStarted":"2017-02-18T15:49:06-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461668127_-1127283875","id":"20170218-154748_1907974241","dateCreated":"2017-02-18T15:47:48-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:29743","text":"titles.as[String].flatMap(_.split(\"_\")).write.parquet(\"data/enTitleWords\")","dateUpdated":"2017-02-18T15:50:09-0800","dateFinished":"2017-02-18T15:50:14-0800","dateStarted":"2017-02-18T15:50:09-0800","results":{"code":"SUCCESS","msg":[]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461769789_-2140788575","id":"20170218-154929_95398356","dateCreated":"2017-02-18T15:49:29-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30012","text":"%md Ok, let's start with a simple query, and take a look at the execution pattern","dateUpdated":"2017-02-18T15:50:38-0800","dateFinished":"2017-02-18T15:50:38-0800","dateStarted":"2017-02-18T15:50:38-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Ok, let’s start with a simple query, and take a look at the execution pattern
\n
"}]}},{"text":"spark.read.parquet(\"data/enTitleWords\").count\n","user":"anonymous","dateUpdated":"2017-02-18T15:50:56-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461838861_-916609520","id":"20170218-155038_1583110915","dateCreated":"2017-02-18T15:50:38-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30108","dateFinished":"2017-02-18T15:50:53-0800","dateStarted":"2017-02-18T15:50:53-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461853045_-726512915","id":"20170218-155053_8839942","dateCreated":"2017-02-18T15:50:53-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30182","text":"%md Why are there 2 stages there? (Recall that counting an RDD was just one stage and an output)\n\nhttps://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala\n\n`count` is implemented as a \"regular\" aggregation, in this case with as `data.groupBy().count()`, with the actual result extracted for convenience\n\nIn other words, each partition is counted, then the counts are merged.","dateUpdated":"2017-02-18T15:52:48-0800","dateFinished":"2017-02-18T15:52:48-0800","dateStarted":"2017-02-18T15:52:48-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Why are there 2 stages there? (Recall that counting an RDD was just one stage and an output)
\n
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
\n
count
is implemented as a “regular” aggregation, in this case with as data.groupBy().count()
, with the actual result extracted for convenience
\n
In other words, each partition is counted, then the counts are merged.
\n
"}]}},{"text":"%md Doing our word count with our Dataset, let's look at the UI and understand the high-level execution: Jobs and Stages","user":"anonymous","dateUpdated":"2017-02-18T16:12:19-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461968738_1495163507","id":"20170218-155248_1504890380","dateCreated":"2017-02-18T15:52:48-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30283","dateFinished":"2017-02-18T16:12:19-0800","dateStarted":"2017-02-18T16:12:19-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Doing our word count with our Dataset, let’s look at the UI and understand the high-level execution: Jobs and Stages
\n
"}]}},{"text":"spark.read.parquet(\"data/enTitleWords\").groupBy('value).count.collect","user":"anonymous","dateUpdated":"2017-02-18T15:54:46-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487461980193_41197414","id":"20170218-155300_731889392","dateCreated":"2017-02-18T15:53:00-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30357","dateFinished":"2017-02-18T15:53:37-0800","dateStarted":"2017-02-18T15:53:33-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462065897_-17715141","id":"20170218-155425_1570043783","dateCreated":"2017-02-18T15:54:25-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30545","text":"%md As with RDDs, adding narrow operations doesn't change the number of stages/tasks:","dateUpdated":"2017-02-18T15:54:42-0800","dateFinished":"2017-02-18T15:54:42-0800","dateStarted":"2017-02-18T15:54:42-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
As with RDDs, adding narrow operations doesn’t change the number of stages/tasks:
\n
"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462013920_-864639415","id":"20170218-155333_677018840","dateCreated":"2017-02-18T15:53:33-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30437","text":"spark.read.parquet(\"data/enTitleWords\").sample(false, 0.1, 42).groupBy('value).count().collect\n","dateUpdated":"2017-02-18T15:55:01-0800","dateFinished":"2017-02-18T15:55:02-0800","dateStarted":"2017-02-18T15:55:01-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462101036_1291062700","id":"20170218-155501_699431991","dateCreated":"2017-02-18T15:55:01-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30625","text":"%md And sorting will require an extra job with a non-deterministic number of tasks (for the RangePartition)","dateUpdated":"2017-02-18T15:56:13-0800","dateFinished":"2017-02-18T15:56:13-0800","dateStarted":"2017-02-18T15:56:13-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
And sorting will require an extra job with a non-deterministic number of tasks (for the RangePartition)
\n
"}]}},{"text":"spark.read.parquet(\"data/enTitleWords\").sample(false, 0.1, 42).groupBy('value).count().orderBy('count desc).collect","user":"anonymous","dateUpdated":"2017-02-18T15:56:37-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462173855_765419081","id":"20170218-155613_1618479922","dateCreated":"2017-02-18T15:56:13-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30709","dateFinished":"2017-02-18T15:56:32-0800","dateStarted":"2017-02-18T15:56:29-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462189686_-1129069178","id":"20170218-155629_1418402288","dateCreated":"2017-02-18T15:56:29-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30783","text":"%md\n\n#### Spark DataFrame/Dataset Execution Pipeline\n\n
","dateUpdated":"2017-02-18T15:57:02-0800","dateFinished":"2017-02-18T15:57:02-0800","dateStarted":"2017-02-18T15:57:02-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Spark DataFrame/Dataset Execution Pipeline
\n

\n
"}]}},{"text":"%md Ok, how can we ask for -- and understand -- the query plan?","user":"anonymous","dateUpdated":"2017-02-18T15:57:26-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462222771_-1643665637","id":"20170218-155702_1359280060","dateCreated":"2017-02-18T15:57:02-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30880","dateFinished":"2017-02-18T15:57:26-0800","dateStarted":"2017-02-18T15:57:26-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Ok, how can we ask for – and understand – the query plan?
\n
"}]}},{"text":"spark.read.parquet(\"data/enTitleWords\").sample(false, 0.1, 42).groupBy('value).count().orderBy('count desc).explain(true)","user":"anonymous","dateUpdated":"2017-02-18T15:57:51-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462246971_1634825713","id":"20170218-155726_1623265222","dateCreated":"2017-02-18T15:57:26-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:30954","dateFinished":"2017-02-18T15:57:51-0800","dateStarted":"2017-02-18T15:57:51-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462271434_-13976565","id":"20170218-155751_1108045271","dateCreated":"2017-02-18T15:57:51-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31028","text":"%md Great info, but a little hard to read. Let's look in the SQL UI!","dateUpdated":"2017-02-18T15:58:26-0800","dateFinished":"2017-02-18T15:58:26-0800","dateStarted":"2017-02-18T15:58:26-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Great info, but a little hard to read. Let’s look in the SQL UI!
\n
"}]}},{"text":"%md #### What does Catalyst Logical Optmizer do?\n(i.e., what is the difference between analyzed logical plan and optimized logical plan)\n\n__*Make a more performant plan that produces the same output*__\n\nHere is an example to discuss:\n\n
\n","user":"anonymous","dateUpdated":"2017-02-18T16:07:00-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462306057_1025660935","id":"20170218-155826_2049509830","dateCreated":"2017-02-18T15:58:26-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31115","dateFinished":"2017-02-18T16:07:00-0800","dateStarted":"2017-02-18T16:07:00-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
What does Catalyst Logical Optmizer do?
\n
(i.e., what is the difference between analyzed logical plan and optimized logical plan)
\n
Make a more performant plan that produces the same output
\n
Here is an example to discuss:
\n

\n
"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462499800_1861410523","id":"20170218-160139_627008589","dateCreated":"2017-02-18T16:01:39-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31191","text":"%md #### Demo/Exploration of 2 Simple Optimizations:\n* Combine filters\n* Push down predicate (filter), e.g., through most project operators","dateUpdated":"2017-02-18T16:07:30-0800","dateFinished":"2017-02-18T16:07:30-0800","dateStarted":"2017-02-18T16:07:30-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Demo/Exploration of 2 Simple Optimizations:
\n
\n - Combine filters
\n - Push down predicate (filter), e.g., through most project operators
\n
\n
"}]}},{"text":"spark.read.parquet(\"data/enTitleWords\").filter('value > \"A\").filter('value < \"Z\").explain(true)","user":"anonymous","dateUpdated":"2017-02-18T16:08:11-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462841260_-1932140824","id":"20170218-160721_1014174430","dateCreated":"2017-02-18T16:07:21-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31459","dateFinished":"2017-02-18T16:07:58-0800","dateStarted":"2017-02-18T16:07:58-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462878620_919126005","id":"20170218-160758_47290636","dateCreated":"2017-02-18T16:07:58-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31568","text":"import org.apache.spark.sql.functions._\n\nspark.read.parquet(\"data/enTitleWords\").select('value, upper('value)).filter('value > \"A\").filter('value < \"Z\").explain(true)","dateUpdated":"2017-02-18T16:08:31-0800","dateFinished":"2017-02-18T16:08:31-0800","dateStarted":"2017-02-18T16:08:31-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462911505_856073158","id":"20170218-160831_185662418","dateCreated":"2017-02-18T16:08:31-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31665","text":"spark.read.parquet(\"data/enTitleWords\").filter('value > \"A\").select('value, upper('value)).filter('value < \"Z\").explain(true)","dateUpdated":"2017-02-18T16:08:59-0800","dateFinished":"2017-02-18T16:09:00-0800","dateStarted":"2017-02-18T16:09:00-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462939984_727798533","id":"20170218-160859_1172863548","dateCreated":"2017-02-18T16:08:59-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31749","text":"%md __If you want to take a closer look at the logical optimizer code and rules__\n\n* Operators\n * https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala\n* Expressions\n * https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala\n* Transformation rules\n * https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala\n\nSpark 2.x even has a user-facing public (though experimental) API for writing our own optimizations and dynamically adding them: http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/","dateUpdated":"2017-02-18T16:09:46-0800","dateFinished":"2017-02-18T16:09:46-0800","dateStarted":"2017-02-18T16:09:46-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
If you want to take a closer look at the logical optimizer code and rules
\n
\n - Operators\n \n
\n - Expressions\n \n
\n - Transformation rules\n \n
\n
\n
Spark 2.x even has a user-facing public (though experimental) API for writing our own optimizations and dynamically adding them: http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/
\n
"}]}},{"text":"%md #### Physical Planning and Execution\n\n* Physical planner\n* Physical (SparkPlan) Optimizer\n* Code Generation","user":"anonymous","dateUpdated":"2017-02-18T16:10:28-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462965430_-1002371299","id":"20170218-160925_1694435055","dateCreated":"2017-02-18T16:09:25-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31831","dateFinished":"2017-02-18T16:10:28-0800","dateStarted":"2017-02-18T16:10:28-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Physical Planning and Execution
\n
\n - Physical planner
\n - Physical (SparkPlan) Optimizer
\n - Code Generation
\n
\n
"}]}},{"text":"%md
\n
\n
\n__Additional optimizations...__\n\nWhat if the parquet file is partitioned by age? (or we're reading from a database with key columns and/or indices?)","user":"anonymous","dateUpdated":"2017-02-18T16:10:14-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487462997325_-837131775","id":"20170218-160957_1575272477","dateCreated":"2017-02-18T16:09:57-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:31923","dateFinished":"2017-02-18T16:10:14-0800","dateStarted":"2017-02-18T16:10:14-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n

\n
\n
\n
Additional optimizations…
\n
What if the parquet file is partitioned by age? (or we’re reading from a database with key columns and/or indices?)
\n
"}]}},{"text":"%md Why Whole-Stage Codegen?\n\n* Iterator/Volcano Model\n* Codegen'ed Operators vs. Interpreted Queries\n* Purpose-Built Code vs. General Framework (Freshman Paradox)\n * Avoid virtual calls, page faults, leverage cache etc.","user":"anonymous","dateUpdated":"2017-02-18T16:13:40-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487463014132_1200144300","id":"20170218-161014_1685219149","dateCreated":"2017-02-18T16:10:14-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:32015","dateFinished":"2017-02-18T16:13:40-0800","dateStarted":"2017-02-18T16:13:40-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Why Whole-Stage Codegen?
\n
\n - Iterator/Volcano Model
\n - Codegen’ed Operators vs. Interpreted Queries
\n - Purpose-Built Code vs. General Framework (Freshman Paradox)\n
\n - Avoid virtual calls, page faults, leverage cache etc.
\n
\n \n
\n
"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487463033284_-545815561","id":"20170218-161033_714671021","dateCreated":"2017-02-18T16:10:33-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:32106","text":"import org.apache.spark.sql.execution.debug._\n\nspark.range(10).select('id * 2).debugCodegen","dateUpdated":"2017-02-18T16:11:16-0800","dateFinished":"2017-02-18T16:11:16-0800","dateStarted":"2017-02-18T16:11:16-0800","errorMessage":""},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487463055761_1017666052","id":"20170218-161055_960553788","dateCreated":"2017-02-18T16:10:55-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:32203","text":"%md If you really want to dig in, look at the codegen design doc, framework, and basic physical operators...\n\nFilter is a good one to start with! look for `case class FilterExec`\n\nhttps://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala\n\nHINTS:\n\n1. IntelliJ has great, free Scala support and \"Go To Declaration...\" will help you learn your way around the code\n2. Building Spark in IntelliJ can be tricky, but if you build the repo using its own build script (easier), you can attach IntelliJ to the process via remote debugging.\n * To enable remote debugging on the driver, `export SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005`","dateUpdated":"2017-02-18T16:28:02-0800","dateFinished":"2017-02-18T16:28:02-0800","dateStarted":"2017-02-18T16:28:02-0800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
If you really want to dig in, look at the codegen design doc, framework, and basic physical operators…
\n
Filter is a good one to start with! look for case class FilterExec
\n
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
\n
HINTS:
\n
\n - IntelliJ has great, free Scala support and “Go To Declaration…” will help you learn your way around the code
\n - Building Spark in IntelliJ can be tricky, but if you build the repo using its own build script (easier), you can attach IntelliJ to the process via remote debugging.\n
\n - To enable remote debugging on the driver,
export SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
\n
\n \n
\n
"}]}},{"text":"%md\n","user":"anonymous","dateUpdated":"2017-02-18T16:18:16-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487463496934_1080994734","id":"20170218-161816_1413843304","dateCreated":"2017-02-18T16:18:16-0800","status":"READY","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:32414"}],"name":"SQL-DF-DS-Execution","id":"2C9SRQV8G","angularObjects":{"2CAVTNYXD:shared_process":[],"2C7K2429Q:shared_process":[],"2C8SPPQZU:shared_process":[],"2C7UX75H1:shared_process":[],"2C8CZHBMK:shared_process":[],"2C8M1YA6S:shared_process":[],"2C8HBGBZH:shared_process":[],"2C8W1YSQF:shared_process":[],"2CA11FFZW:shared_process":[],"2C8GTCYUP:shared_process":[],"2C7JSZ74W:shared_process":[],"2C7NXAQD7:shared_process":[],"2C8BUSWZY:shared_process":[],"2C8BMRSH6:shared_process":[],"2CAD4S1XP:shared_process":[],"2C8DQ16J7:shared_process":[],"2C9YA18WV:shared_process":[],"2CA315FYH:shared_process":[],"2C915WF4P:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------
/notebooks/Spark Machine Learning.json:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"text":"%md ### Machine Learning with Spark\n\n©2016, 2017 by Adam Breindel. All Rights Reserved.\n\n#### Legacy API vs. Modern API\n\n* ML Pipelines\n * DataFrame\n * Transformer / Estimator / Pipeline\n * CrossValidator / ParamGridBuilder / Evaluator\n* RDD API in maintenance mode, will be deprecated, then removed (~ 3.0?)\n * Still contains some features not present in new API (e.g., SVD, covariance matrix)\n\n#### Example with (R/ggplot) Diamonds dataset\n\n* Data manipulation (DataFrame, Transformer, Estimator)\n* Feature selection\n* Model building (Pipeline)\n* Evaluation (Evaluator)\n* Tuning\n* Crossvalidation (CrossValidator, ParamGridBuilder)","user":"anonymous","dateUpdated":"2017-02-19T12:15:01-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487532591326_907480540","id":"20170219-112951_1854979224","dateCreated":"2017-02-19T11:29:51-0800","dateStarted":"2017-02-19T12:15:01-0800","dateFinished":"2017-02-19T12:15:01-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:7669","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Machine Learning with Spark
\n
©2016, 2017 by Adam Breindel. All Rights Reserved.
\n
Legacy API vs. Modern API
\n
\n - ML Pipelines\n
\n - DataFrame
\n - Transformer / Estimator / Pipeline
\n - CrossValidator / ParamGridBuilder / Evaluator
\n
\n \n - RDD API in maintenance mode, will be deprecated, then removed (~ 3.0?)\n
\n - Still contains some features not present in new API (e.g., SVD, covariance matrix)
\n
\n \n
\n
Example with (R/ggplot) Diamonds dataset
\n
\n - Data manipulation (DataFrame, Transformer, Estimator)
\n - Feature selection
\n - Model building (Pipeline)
\n - Evaluation (Evaluator)
\n - Tuning
\n - Crossvalidation (CrossValidator, ParamGridBuilder)
\n
\n
"}]}},{"text":"spark.read.option(\"header\", true).csv(\"data/diamonds.csv\").show","user":"anonymous","dateUpdated":"2017-02-19T11:41:53-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533117212_1110766400","id":"20170219-113837_1424131170","dateCreated":"2017-02-19T11:38:37-0800","dateStarted":"2017-02-19T11:41:50-0800","dateFinished":"2017-02-19T11:41:50-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7670"},{"text":"spark.read.option(\"header\", true).csv(\"data/diamonds.csv\").printSchema","user":"anonymous","dateUpdated":"2017-02-19T11:42:04-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533316425_1727578539","id":"20170219-114156_497727897","dateCreated":"2017-02-19T11:41:56-0800","dateStarted":"2017-02-19T11:42:04-0800","dateFinished":"2017-02-19T11:42:04-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7671"},{"text":"val data = spark.read.option(\"header\", true)\n .option(\"inferSchema\", true)\n .csv(\"data/diamonds.csv\")\ndata.show","user":"anonymous","dateUpdated":"2017-02-19T11:42:08-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533276943_849704143","id":"20170219-114116_1632005710","dateCreated":"2017-02-19T11:41:16-0800","dateStarted":"2017-02-19T11:42:08-0800","dateFinished":"2017-02-19T11:42:09-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7672"},{"text":"data.printSchema","user":"anonymous","dateUpdated":"2017-02-19T11:42:19-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533328697_-1838643691","id":"20170219-114208_638207268","dateCreated":"2017-02-19T11:42:08-0800","dateStarted":"2017-02-19T11:42:19-0800","dateFinished":"2017-02-19T11:42:19-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7673"},{"text":"%md We'll look at the features in more detail ... but right away we see we'll have to do something about string-typed features. \n\nThe price (label) is an integer, not a double. In many cases, an integer can be auto-widened to a double, but there may be some places we'll have to watch out.\n\nAlso, that \"\\_c0\" (a.k.a. the row number or row ID) ... not only is it not a feature, but it can leak irrelevant data:","user":"anonymous","dateUpdated":"2017-02-19T12:15:05-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533339615_-68398678","id":"20170219-114219_563237401","dateCreated":"2017-02-19T11:42:19-0800","dateStarted":"2017-02-19T12:15:05-0800","dateFinished":"2017-02-19T12:15:05-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7674","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
We’ll look at the features in more detail … but right away we see we’ll have to do something about string-typed features.
\n
The price (label) is an integer, not a double. In many cases, an integer can be auto-widened to a double, but there may be some places we’ll have to watch out.
\n
Also, that “_c0” (a.k.a. the row number or row ID) … not only is it not a feature, but it can leak irrelevant data:
\n
"}]}},{"text":"z.show(data.select(\"_c0\", \"price\").sample(false, 0.01, 42))","user":"anonymous","dateUpdated":"2017-02-19T12:01:13-0800","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"scatterChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533349774_1881524072","id":"20170219-114229_63907511","dateCreated":"2017-02-19T11:42:29-0800","dateStarted":"2017-02-19T12:01:02-0800","dateFinished":"2017-02-19T12:01:02-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7675"},{"text":"// We'd better get rid of the row number and fix price:\n\nimport org.apache.spark.sql.types._\n\nval data2 = data.drop(\"_c0\").withColumn(\"label\", 'price cast DoubleType).drop(\"price\")\ndata2.show","user":"anonymous","dateUpdated":"2017-02-19T11:44:59-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533432866_-1861374525","id":"20170219-114352_1415218590","dateCreated":"2017-02-19T11:43:52-0800","dateStarted":"2017-02-19T11:44:59-0800","dateFinished":"2017-02-19T11:45:00-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7676"},{"text":"data2.describe(\"carat\", \"label\", \"x\", \"y\", \"z\").show","user":"anonymous","dateUpdated":"2017-02-19T11:47:32-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533499374_1800204137","id":"20170219-114459_1151417012","dateCreated":"2017-02-19T11:44:59-0800","dateStarted":"2017-02-19T11:47:32-0800","dateFinished":"2017-02-19T11:47:32-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7677"},{"text":"data2.filter(\"x <= 1 or y <= 1 or z <= 1\").show(100)","user":"anonymous","dateUpdated":"2017-02-19T11:48:51-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533526757_266392517","id":"20170219-114526_172033791","dateCreated":"2017-02-19T11:45:26-0800","dateStarted":"2017-02-19T11:48:51-0800","dateFinished":"2017-02-19T11:48:51-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7678"},{"text":"%md Now we need to do some processing with the categorical features: cut, color, and clarity.","user":"anonymous","dateUpdated":"2017-02-19T12:15:09-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533684375_-877234455","id":"20170219-114804_1000009167","dateCreated":"2017-02-19T11:48:04-0800","dateStarted":"2017-02-19T12:15:09-0800","dateFinished":"2017-02-19T12:15:09-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7679","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Now we need to do some processing with the categorical features: cut, color, and clarity.
\n
"}]}},{"text":"z.show(data2.select(\"cut\").distinct)","user":"anonymous","dateUpdated":"2017-02-19T12:15:24-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533766540_942951294","id":"20170219-114926_309428837","dateCreated":"2017-02-19T11:49:26-0800","dateStarted":"2017-02-19T12:15:21-0800","dateFinished":"2017-02-19T12:15:22-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7680"},{"text":"data2.groupBy(\"cut\").count.show","user":"anonymous","dateUpdated":"2017-02-19T11:50:15-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533801417_1019329146","id":"20170219-115001_205712247","dateCreated":"2017-02-19T11:50:01-0800","dateStarted":"2017-02-19T11:50:15-0800","dateFinished":"2017-02-19T11:50:15-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7681"},{"text":"%md __In this example, we'll build a Linear Regression model.__ For this type of model, \nwe will want to convert these categorical variables into a one-hot, or \"dummy variable,\" representation,\nso we want to create a OneHotEncoder\n\n* https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder\n\nThe one-hot encoder takes a numeric value, so we need to convert the categorical values to numbers.\n\nWe can do that with a StringIndexer\n\n* https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer\n\n---\n\nThis is a good time to think about how `Transformers` and `Estimators` work and gain some design intuition.\n\nFirst, we want to be clear what the `StringIndexer` and `OneHotEncoder` are supposed to do.\n\nWhy might one be a transformer and the other an estimator? These two helpers could be built either as Transformers or Estimators. What factors would argue for/against using one pattern versus the other? Hint: there is a Spark JIRA to \"fix\" OneHotEncoder, perhaps in 2.2, to be an Estimator.\n\nNote that neither can be performed as a pure map operation. They both need to accumulate some bit of state, via a reduce, to then use in the map.\n\nEXTRA CREDIT: Try to locate these bits in the source code. Can you find where the \"reduce\" job is hidden in the OneHotEncoder (even though it's a transformer)?\n\n---\n\nSee if you can manually run these two helpers against the data.\n\n","user":"anonymous","dateUpdated":"2017-02-19T12:15:30-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533811224_825061600","id":"20170219-115011_934534280","dateCreated":"2017-02-19T11:50:11-0800","dateStarted":"2017-02-19T12:15:30-0800","dateFinished":"2017-02-19T12:15:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7682","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
In this example, we’ll build a Linear Regression model. For this type of model,
we will want to convert these categorical variables into a one-hot, or “dummy variable,” representation,
so we want to create a OneHotEncoder
\n
\n
The one-hot encoder takes a numeric value, so we need to convert the categorical values to numbers.
\n
We can do that with a StringIndexer
\n
\n
\n
This is a good time to think about how Transformers
and Estimators
work and gain some design intuition.
\n
First, we want to be clear what the StringIndexer
and OneHotEncoder
are supposed to do.
\n
Why might one be a transformer and the other an estimator? These two helpers could be built either as Transformers or Estimators. What factors would argue for/against using one pattern versus the other? Hint: there is a Spark JIRA to “fix” OneHotEncoder, perhaps in 2.2, to be an Estimator.
\n
Note that neither can be performed as a pure map operation. They both need to accumulate some bit of state, via a reduce, to then use in the map.
\n
EXTRA CREDIT: Try to locate these bits in the source code. Can you find where the “reduce” job is hidden in the OneHotEncoder (even though it’s a transformer)?
\n
\n
See if you can manually run these two helpers against the data.
\n
"}]}},{"text":"// try it here","user":"anonymous","dateUpdated":"2017-02-19T11:53:37-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487533911695_267715330","id":"20170219-115151_209900794","dateCreated":"2017-02-19T11:51:51-0800","dateStarted":"2017-02-19T11:53:35-0800","dateFinished":"2017-02-19T11:53:35-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7683"},{"text":"%md Now let's automate this work a bit: we'll use functional collections to create our feature helpers, and a pipeline to wrap them","user":"anonymous","dateUpdated":"2017-02-19T12:15:34-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534015139_731691855","id":"20170219-115335_1690778894","dateCreated":"2017-02-19T11:53:35-0800","dateStarted":"2017-02-19T12:15:34-0800","dateFinished":"2017-02-19T12:15:34-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7684","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Now let’s automate this work a bit: we’ll use functional collections to create our feature helpers, and a pipeline to wrap them
\n
"}]}},{"text":"import org.apache.spark.ml._\nimport org.apache.spark.ml.feature._\n\nval categoricalFields = Seq(\"cut\", \"color\", \"clarity\")\n\nval indexers = categoricalFields.map(f => new StringIndexer().setInputCol(f).setOutputCol(f + \"Index\"))\n\nval encoders = categoricalFields.map(f => new OneHotEncoder().setInputCol(f + \"Index\").setOutputCol(f + \"Vec\"))\n\nval pipeline = new Pipeline().setStages( (indexers ++ encoders).toArray ).fit(data2)\n\npipeline.transform(data2).show","user":"anonymous","dateUpdated":"2017-02-19T11:54:03-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534033565_-2018514064","id":"20170219-115353_845492268","dateCreated":"2017-02-19T11:53:53-0800","dateStarted":"2017-02-19T11:54:00-0800","dateFinished":"2017-02-19T11:54:02-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7685"},{"text":"%md That looks pretty good. Next, we need to bring all of our features together into a single vector. \nThe `VectorAssembler` class, a `Transformer` (why?) does exactly that:","user":"anonymous","dateUpdated":"2017-02-19T12:15:36-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534040226_-1873063656","id":"20170219-115400_1179469190","dateCreated":"2017-02-19T11:54:00-0800","dateStarted":"2017-02-19T12:15:36-0800","dateFinished":"2017-02-19T12:15:36-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7686","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
That looks pretty good. Next, we need to bring all of our features together into a single vector.
The VectorAssembler
class, a Transformer
(why?) does exactly that:
\n
"}]}},{"text":"val assembler = new VectorAssembler()\n .setInputCols( (categoricalFields.map(_ + \"Vec\") ++ Seq(\"carat\", \"depth\", \"table\", \"x\", \"y\", \"z\")).toArray )\n .setOutputCol(\"features\")\n\nnew Pipeline()\n .setStages( ((indexers ++ encoders) :+ assembler).toArray )\n .fit(data2)\n .transform(data2)\n .drop(\"carat\", \"depth\", \"table\", \"x\", \"y\", \"z\") // to save space on the screen!\n .show","user":"anonymous","dateUpdated":"2017-02-19T11:57:21-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534094992_-1025785281","id":"20170219-115454_1035763807","dateCreated":"2017-02-19T11:54:54-0800","dateStarted":"2017-02-19T11:57:15-0800","dateFinished":"2017-02-19T11:57:16-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7687"},{"text":"%md Let's finish the pipeline by adding the Linear Regression algorithm","user":"anonymous","dateUpdated":"2017-02-19T12:15:38-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534104624_-2077842015","id":"20170219-115504_655775644","dateCreated":"2017-02-19T11:55:04-0800","dateStarted":"2017-02-19T12:15:38-0800","dateFinished":"2017-02-19T12:15:38-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7688","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Let’s finish the pipeline by adding the Linear Regression algorithm
\n
"}]}},{"text":"import org.apache.spark.ml.regression._\n\nval lr = new LinearRegression()\n\nval completePipeline = new Pipeline().setStages( ((indexers ++ encoders) :+ assembler :+ lr).toArray )","user":"anonymous","dateUpdated":"2017-02-19T11:57:42-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534252858_-1002237333","id":"20170219-115732_467292383","dateCreated":"2017-02-19T11:57:32-0800","dateStarted":"2017-02-19T11:57:39-0800","dateFinished":"2017-02-19T11:57:40-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7689"},{"text":"%md Now we're ready to train and do an initial test","user":"anonymous","dateUpdated":"2017-02-19T12:15:41-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534259602_1229306288","id":"20170219-115739_2099637335","dateCreated":"2017-02-19T11:57:39-0800","dateStarted":"2017-02-19T12:15:41-0800","dateFinished":"2017-02-19T12:15:41-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7690","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Now we’re ready to train and do an initial test
\n
"}]}},{"text":"val Array(train, test) = data2.randomSplit(Array(0.75, 0.25), 42)\nval model = completePipeline.fit(train)\nval predictions = model.transform(test).select('label, 'prediction)\nz.show(predictions.sample(false, 0.05, 42))","user":"anonymous","dateUpdated":"2017-02-19T12:00:32-0800","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"scatterChart","height":300,"optionOpen":false,"setting":{"scatterChart":{"xAxis":{"name":"label","index":0,"aggr":"sum"},"yAxis":{"name":"prediction","index":1,"aggr":"sum"}}}},"helium":{}}},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534276033_-811386554","id":"20170219-115756_2109335244","dateCreated":"2017-02-19T11:57:56-0800","dateStarted":"2017-02-19T11:58:46-0800","dateFinished":"2017-02-19T11:58:52-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7691"},{"text":"%md Looks plausible ... but let's get some numbers using the `Evaluator` class.\n\n`Evaluator` is useful for \n\n* Obtaining performance statistics on our models\n\n* Enabling SparkML to obtains statistics on models that it build as part of automated hyperparameter tuning and cross validations","user":"anonymous","dateUpdated":"2017-02-19T12:15:42-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534326449_-317125424","id":"20170219-115846_704049924","dateCreated":"2017-02-19T11:58:46-0800","dateStarted":"2017-02-19T12:15:42-0800","dateFinished":"2017-02-19T12:15:42-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7692","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Looks plausible … but let’s get some numbers using the Evaluator
class.
\n
Evaluator
is useful for
\n
\n
"}]}},{"text":"import org.apache.spark.ml.evaluation._\nval eval = new RegressionEvaluator()\neval.evaluate(predictions)","user":"anonymous","dateUpdated":"2017-02-19T12:03:18-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534586717_-1139381440","id":"20170219-120306_725391657","dateCreated":"2017-02-19T12:03:06-0800","dateStarted":"2017-02-19T12:03:14-0800","dateFinished":"2017-02-19T12:03:15-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7693"},{"text":"%md Ok ... let's see if we can wire Spark up to tune the model. In particular, \n\n1. we'll give Spark a set of hyperparamters to adjust, and some values to try\n2. let Spark build a series of tentative models using k-fold crossvalidation\n3. tell Spark what evaluator to use, so it knows which tentative models are better","user":"anonymous","dateUpdated":"2017-02-19T12:15:45-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534594725_-36291020","id":"20170219-120314_1557023944","dateCreated":"2017-02-19T12:03:14-0800","dateStarted":"2017-02-19T12:15:45-0800","dateFinished":"2017-02-19T12:15:45-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7694","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Ok … let’s see if we can wire Spark up to tune the model. In particular,
\n
\n - we’ll give Spark a set of hyperparamters to adjust, and some values to try
\n - let Spark build a series of tentative models using k-fold crossvalidation
\n - tell Spark what evaluator to use, so it knows which tentative models are better
\n
\n
"}]}},{"text":"import org.apache.spark.ml.evaluation.RegressionEvaluator\nimport org.apache.spark.ml.tuning._\n\nval paramGrid = new ParamGridBuilder()\n .addGrid(lr.elasticNetParam, Array(0.3, 0.7))\n .addGrid(lr.regParam, Array(0.01, 0.1))\n .build()\n\nval cv = new CrossValidator()\n .setEstimator(completePipeline)\n .setEvaluator(new RegressionEvaluator)\n .setEstimatorParamMaps(paramGrid)\n .setNumFolds(3)\n\nval cvModel = cv.fit(train)","user":"anonymous","dateUpdated":"2017-02-19T12:06:07-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534758015_-1851473389","id":"20170219-120558_532949922","dateCreated":"2017-02-19T12:05:58-0800","dateStarted":"2017-02-19T12:06:04-0800","dateFinished":"2017-02-19T12:06:20-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7695"},{"text":"%md How different was the performance across the different parameter sets?","user":"anonymous","dateUpdated":"2017-02-19T12:15:48-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534764606_-1668317589","id":"20170219-120604_1585803515","dateCreated":"2017-02-19T12:06:04-0800","dateStarted":"2017-02-19T12:15:48-0800","dateFinished":"2017-02-19T12:15:48-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7696","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
How different was the performance across the different parameter sets?
\n
"}]}},{"text":"cvModel.getEstimatorParamMaps.zip(cvModel.avgMetrics)","user":"anonymous","dateUpdated":"2017-02-19T12:15:55-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534793112_1685984619","id":"20170219-120633_688046917","dateCreated":"2017-02-19T12:06:33-0800","dateStarted":"2017-02-19T12:15:50-0800","dateFinished":"2017-02-19T12:15:50-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7697"},{"text":"%md After training the CrossValidatorModel, `cvModel.bestModel` will contain a model trained on all of the training data using the best hyperparams.\n\nHowever, we could also train that \"best model\" ourselves:","user":"anonymous","dateUpdated":"2017-02-19T12:15:59-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534816022_1797192376","id":"20170219-120656_1325278688","dateCreated":"2017-02-19T12:06:56-0800","dateStarted":"2017-02-19T12:15:59-0800","dateFinished":"2017-02-19T12:15:59-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7698","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
After training the CrossValidatorModel, cvModel.bestModel
will contain a model trained on all of the training data using the best hyperparams.
\n
However, we could also train that “best model” ourselves:
\n
"}]}},{"text":"val lr = new LinearRegression().setRegParam(0.1).setElasticNetParam(0.7)\nval completePipeline = new Pipeline().setStages( ((indexers ++ encoders) :+ assembler :+ lr).toArray )\nval finalModel = completePipeline.fit(train)","user":"anonymous","dateUpdated":"2017-02-19T12:07:22-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534833683_1996689876","id":"20170219-120713_1880334707","dateCreated":"2017-02-19T12:07:13-0800","dateStarted":"2017-02-19T12:07:19-0800","dateFinished":"2017-02-19T12:07:23-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7699"},{"text":"%md Run the final model against the test set","user":"anonymous","dateUpdated":"2017-02-19T12:16:02-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534839675_-1254838403","id":"20170219-120719_85872729","dateCreated":"2017-02-19T12:07:19-0800","dateStarted":"2017-02-19T12:16:02-0800","dateFinished":"2017-02-19T12:16:02-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7700","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Run the final model against the test set
\n
"}]}},{"text":"val predictions = finalModel.transform(test).select('label, 'prediction)\nval eval = new RegressionEvaluator()\neval.evaluate(predictions)","user":"anonymous","dateUpdated":"2017-02-19T12:07:41-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534853027_708181133","id":"20170219-120733_197164508","dateCreated":"2017-02-19T12:07:33-0800","dateStarted":"2017-02-19T12:07:39-0800","dateFinished":"2017-02-19T12:07:40-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7701"},{"text":"%md We can examine the parameters of the final model:","user":"anonymous","dateUpdated":"2017-02-19T12:16:05-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534888561_102833196","id":"20170219-120808_499925378","dateCreated":"2017-02-19T12:08:08-0800","dateStarted":"2017-02-19T12:16:05-0800","dateFinished":"2017-02-19T12:16:05-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7702","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
We can examine the parameters of the final model:
\n
"}]}},{"text":"finalModel.stages.last.asInstanceOf[LinearRegressionModel].coefficients","user":"anonymous","dateUpdated":"2017-02-19T12:08:03-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534859067_1017149824","id":"20170219-120739_777219916","dateCreated":"2017-02-19T12:07:39-0800","dateStarted":"2017-02-19T12:08:03-0800","dateFinished":"2017-02-19T12:08:03-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7703"},{"text":"finalModel.stages.last.asInstanceOf[LinearRegressionModel].intercept","user":"anonymous","dateUpdated":"2017-02-19T12:09:03-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534883554_-1034869302","id":"20170219-120803_1553554723","dateCreated":"2017-02-19T12:08:03-0800","dateStarted":"2017-02-19T12:09:03-0800","dateFinished":"2017-02-19T12:09:03-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7704"},{"text":"%md One area where Spark ML really shines is in the pluggability of the different components. \nLet's swap in a Gradient Boosted Tree regressor and see if we can do better:","user":"anonymous","dateUpdated":"2017-02-19T12:16:08-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487534943234_-1044411403","id":"20170219-120903_508219683","dateCreated":"2017-02-19T12:09:03-0800","dateStarted":"2017-02-19T12:16:08-0800","dateFinished":"2017-02-19T12:16:08-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7705","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
One area where Spark ML really shines is in the pluggability of the different components.
Let’s swap in a Gradient Boosted Tree regressor and see if we can do better:
\n
"}]}},{"text":"val baseGBTPipeline = new Pipeline().setStages( ((indexers ++ encoders) :+ assembler :+ new GBTRegressor).toArray )\n\nval baseGBTPipelineModel = baseGBTPipeline.fit(train)","user":"anonymous","dateUpdated":"2017-02-19T12:13:24-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487535031455_-17361703","id":"20170219-121031_1700959407","dateCreated":"2017-02-19T12:10:31-0800","dateStarted":"2017-02-19T12:13:24-0800","dateFinished":"2017-02-19T12:13:35-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7706"},{"text":"eval.evaluate(baseGBTPipelineModel.transform(test))","user":"anonymous","dateUpdated":"2017-02-19T12:13:37-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487535130060_1134563577","id":"20170219-121210_699263768","dateCreated":"2017-02-19T12:12:10-0800","dateStarted":"2017-02-19T12:13:37-0800","dateFinished":"2017-02-19T12:13:37-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7707"},{"text":"%md Not surprisingly, we did significantly better with the GBT model. See if you can add tuning and crossvalidation,\nand improve the GBT model even further!","user":"anonymous","dateUpdated":"2017-02-19T12:16:11-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487535217115_227065160","id":"20170219-121337_1921596346","dateCreated":"2017-02-19T12:13:37-0800","dateStarted":"2017-02-19T12:16:11-0800","dateFinished":"2017-02-19T12:16:11-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:7708","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Not surprisingly, we did significantly better with the GBT model. See if you can add tuning and crossvalidation,
and improve the GBT model even further!
\n
"}]}},{"text":"%md\n","user":"anonymous","dateUpdated":"2017-02-19T12:14:33-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487535273612_-1822030861","id":"20170219-121433_827534889","dateCreated":"2017-02-19T12:14:33-0800","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:7709"}],"name":"Spark Machine Learning","id":"2CANQG3YX","angularObjects":{"2CAVTNYXD:shared_process":[],"2C7K2429Q:shared_process":[],"2C8SPPQZU:shared_process":[],"2C7UX75H1:shared_process":[],"2C8CZHBMK:shared_process":[],"2C8M1YA6S:shared_process":[],"2C8HBGBZH:shared_process":[],"2C8W1YSQF:shared_process":[],"2CA11FFZW:shared_process":[],"2C8GTCYUP:shared_process":[],"2C7JSZ74W:shared_process":[],"2C7NXAQD7:shared_process":[],"2C8BUSWZY:shared_process":[],"2C8BMRSH6:shared_process":[],"2CAD4S1XP:shared_process":[],"2C8DQ16J7:shared_process":[],"2C9YA18WV:shared_process":[],"2CA315FYH:shared_process":[],"2C915WF4P:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------
/notebooks/Streaming.json:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"text":"%md ## Spark Streaming\n\n©2016, 2017 by Adam Breindel. All Rights Reserved.","user":"anonymous","dateUpdated":"2017-02-19T11:24:20-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487528716707_1493906859","id":"20170219-102516_867721242","dateCreated":"2017-02-19T10:25:16-0800","dateStarted":"2017-02-19T11:24:20-0800","dateFinished":"2017-02-19T11:24:20-0800","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:41779","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Spark Streaming
\n
©2016, 2017 by Adam Breindel. All Rights Reserved.
\n
"}]}},{"text":"%md #### Structured Streaming vs. DStream Streaming\n\n* DStream (\"classic\")\n * Programming model and workarounds\n * Operational models (receivers, receiverless)\n * Fault-tolerance considerations\n* Structured Streaming (\"streaming DataFrames\")\n * Goals\n * Advantages and Limitations\n\n#### Common Factors\n\n * High throughput\n * Microbatch processing\n * *Not* low latency (due to job scheduling)\n * Ideal for bulk computation, not per-event isolated computation","user":"anonymous","dateUpdated":"2017-02-19T11:24:22-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487528818769_-2045889767","id":"20170219-102658_1868081552","dateCreated":"2017-02-19T10:26:58-0800","dateStarted":"2017-02-19T11:24:22-0800","dateFinished":"2017-02-19T11:24:22-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41780","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Structured Streaming vs. DStream Streaming
\n
\n - DStream (“classic”)\n
\n - Programming model and workarounds
\n - Operational models (receivers, receiverless)
\n - Fault-tolerance considerations
\n
\n \n - Structured Streaming (“streaming DataFrames”)\n
\n - Goals
\n - Advantages and Limitations
\n
\n \n
\n
Common Factors
\n
\n - High throughput
\n - Microbatch processing
\n - Not low latency (due to job scheduling)
\n - Ideal for bulk computation, not per-event isolated computation
\n
\n
"}]}},{"text":"%md __DStreams__\n\n
","user":"anonymous","dateUpdated":"2017-02-19T11:24:25-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487528858026_1923842115","id":"20170219-102738_2063400792","dateCreated":"2017-02-19T10:27:38-0800","dateStarted":"2017-02-19T11:24:25-0800","dateFinished":"2017-02-19T11:24:25-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41781","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
DStreams
\n

\n
"}]}},{"text":"%md __General Receiver-Based Model__\n\n(run this next cell just to set up a data source; don't worry about the code)","user":"anonymous","dateUpdated":"2017-02-19T11:24:26-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529050844_-1476509012","id":"20170219-103050_1264893885","dateCreated":"2017-02-19T10:30:50-0800","dateStarted":"2017-02-19T11:24:26-0800","dateFinished":"2017-02-19T11:24:27-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41782","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
General Receiver-Based Model
\n
(run this next cell just to set up a data source; don’t worry about the code)
\n
"}]}},{"text":"import org.apache.spark.streaming._ \n\nclass DemoSource() extends org.apache.spark.streaming.receiver.Receiver[(Int,Int)]( org.apache.spark.storage.StorageLevel.MEMORY_ONLY) { \n def onStart() { \n new Thread(\"Demo Receiver\") {\n override def run() { genData() }\n }.start()\n }\n\n def onStop() { }\n \n private def genData() { \n val r = scala.util.Random\n val customers = 100 to 120\n while(!isStopped) {\n store( (customers(r.nextInt(customers.size)), 40 + r.nextInt(50)) )\n Thread.sleep(20 + r.nextInt(40))\n }\n }\n}","user":"anonymous","dateUpdated":"2017-02-19T11:24:29-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529075627_2059379352","id":"20170219-103115_1642824979","dateCreated":"2017-02-19T10:31:15-0800","dateStarted":"2017-02-19T11:24:29-0800","dateFinished":"2017-02-19T11:24:30-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41783","results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\nimport org.apache.spark.streaming._\n\ndefined class DemoSource\n"}]}},{"text":"%md What does that source produce? Key-value data, where the key represents a user ID and the value a piece of sensor data (freeway speed).\n\nThe records look like this:\n```\n(101, 55)\n(110, 60)\n(103, 63)\n(101, 59)\n```\netc.","user":"anonymous","dateUpdated":"2017-02-19T11:24:32-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529085883_-298361907","id":"20170219-103125_1719703299","dateCreated":"2017-02-19T10:31:25-0800","dateStarted":"2017-02-19T11:24:32-0800","dateFinished":"2017-02-19T11:24:32-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41784","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
What does that source produce? Key-value data, where the key represents a user ID and the value a piece of sensor data (freeway speed).
\n
The records look like this:
\n
(101, 55)\n(110, 60)\n(103, 63)\n(101, 59)\n
\n
etc.
\n
"}]}},{"text":"%md Our first streaming code:","user":"anonymous","dateUpdated":"2017-02-19T11:24:35-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529357610_-300687370","id":"20170219-103557_183939056","dateCreated":"2017-02-19T10:35:57-0800","dateStarted":"2017-02-19T11:24:35-0800","dateFinished":"2017-02-19T11:24:35-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41785","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Our first streaming code:
\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nssc.start\nThread.sleep(4000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T10:36:16-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529364394_-93292388","id":"20170219-103604_684175870","dateCreated":"2017-02-19T10:36:04-0800","dateStarted":"2017-02-19T10:36:11-0800","dateFinished":"2017-02-19T10:36:13-0800","status":"ERROR","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41786"},{"text":"%md Doh!","user":"anonymous","dateUpdated":"2017-02-19T11:24:37-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529371473_-185616800","id":"20170219-103611_558192802","dateCreated":"2017-02-19T10:36:11-0800","dateStarted":"2017-02-19T11:24:37-0800","dateFinished":"2017-02-19T11:24:37-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41787","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":""}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nstream.foreachRDD { //foreachRDD is a DStream action\n rdd => println(\"Received \" + rdd.count /* RDD action */ + \" data records at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nssc.start\nThread.sleep(4000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T10:36:47-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529395817_2013854744","id":"20170219-103635_1921904302","dateCreated":"2017-02-19T10:36:35-0800","dateStarted":"2017-02-19T10:36:38-0800","dateFinished":"2017-02-19T10:36:45-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41788"},{"text":"%md Ok, that was better ... can we add just a little more logic to the stream?","user":"anonymous","dateUpdated":"2017-02-19T11:24:40-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529398889_-311198183","id":"20170219-103638_1788059295","dateCreated":"2017-02-19T10:36:38-0800","dateStarted":"2017-02-19T11:24:40-0800","dateFinished":"2017-02-19T11:24:40-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41789","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Ok, that was better … can we add just a little more logic to the stream?
\n
"}]}},{"text":"stream.foreachRDD { //foreachRDD is a DStream action\n rdd => println(\"First record was \" + rdd.first)\n}","user":"anonymous","dateUpdated":"2017-02-19T10:37:15-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529416968_-1199183321","id":"20170219-103656_1251482232","dateCreated":"2017-02-19T10:36:56-0800","dateStarted":"2017-02-19T10:37:10-0800","dateFinished":"2017-02-19T10:37:11-0800","status":"ERROR","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41790"},{"text":"%md Nope ... but what went wrong?\n\nIs it\n1. a stream can only feed a single action or transformation? or \n2. stream was already stopped?\n\nPer the error message, the problem is #2.\n\nWe *are* allowed to attach any number of transformations and/or actions to streams, although there are performance implications.\n\nWhy? each action will require one or more jobs, and the jobs overhead is a big factor in streaming performance. That's why Spark does great with high throughput, provided the jobs are not too frequent (i.e., no low-latency).","user":"anonymous","dateUpdated":"2017-02-19T11:24:42-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529430608_-1193766188","id":"20170219-103710_323841711","dateCreated":"2017-02-19T10:37:10-0800","dateStarted":"2017-02-19T11:24:42-0800","dateFinished":"2017-02-19T11:24:42-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41791","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Nope … but what went wrong?
\n
Is it
1. a stream can only feed a single action or transformation? or
2. stream was already stopped?
\n
Per the error message, the problem is #2.
\n
We are allowed to attach any number of transformations and/or actions to streams, although there are performance implications.
\n
Why? each action will require one or more jobs, and the jobs overhead is a big factor in streaming performance. That’s why Spark does great with high throughput, provided the jobs are not too frequent (i.e., no low-latency).
\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nstream.foreachRDD { //foreachRDD is a DStream action\n rdd => println(\"Received \" + rdd.count /* RDD action */ + \" data records at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nstream.foreachRDD { //foreachRDD is a DStream action\n rdd => println(\"First record was \" + rdd.first)\n}\n\nssc.start\nThread.sleep(4000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T11:24:55-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529484173_1997233921","id":"20170219-103804_766593924","dateCreated":"2017-02-19T10:38:04-0800","dateStarted":"2017-02-19T11:24:45-0800","dateFinished":"2017-02-19T11:24:51-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41792"},{"text":"%md It does work ... how does the performance compare? Run the following cell with and without the comment block ...
It's set to run for 30 seconds so you have time to look at the Spark Streaming GUI","user":"anonymous","dateUpdated":"2017-02-19T11:24:59-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529509933_-1162077216","id":"20170219-103829_1746440337","dateCreated":"2017-02-19T10:38:29-0800","dateStarted":"2017-02-19T11:24:59-0800","dateFinished":"2017-02-19T11:24:59-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41793","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
It does work … how does the performance compare? Run the following cell with and without the comment block …
It’s set to run for 30 seconds so you have time to look at the Spark Streaming GUI
\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nstream.foreachRDD { //foreachRDD is a DStream action\n rdd => println(\"Received \" + rdd.count /* RDD action */ + \" data records at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\n/*\nstream.foreachRDD { //foreachRDD is a DStream action\n rdd => println(\"First record was \" + rdd.first)\n}\n*/\n\nssc.start\nThread.sleep(30000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T11:25:05-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529522293_1541568630","id":"20170219-103842_977029022","dateCreated":"2017-02-19T10:38:42-0800","dateStarted":"2017-02-19T11:25:02-0800","dateFinished":"2017-02-19T11:25:34-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41794"},{"text":"%md __DStream Transformations__\n\n
\n\nTake a close look at the following cell. How is it different from the earlier counting code?\n\nIt uses the .count() stream transformation!","user":"anonymous","dateUpdated":"2017-02-19T11:25:28-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529536316_-937737959","id":"20170219-103856_1195992770","dateCreated":"2017-02-19T10:38:56-0800","dateStarted":"2017-02-19T11:25:28-0800","dateFinished":"2017-02-19T11:25:28-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41795","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
DStream Transformations
\n

\n
Take a close look at the following cell. How is it different from the earlier counting code?
\n
It uses the .count() stream transformation!
\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nstream.count.foreachRDD { // count *DStream transformation* + foreachRDD *DStream action*\n rdd => println(\"Received \" + rdd.first /* rdd *action* */ + \" data records at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nssc.start\nThread.sleep(4000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T11:25:31-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529571097_141143930","id":"20170219-103931_146087602","dateCreated":"2017-02-19T10:39:31-0800","dateStarted":"2017-02-19T10:44:55-0800","dateFinished":"2017-02-19T10:45:01-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41796"},{"text":"%md Before going any further with the Streaming API, let's take a quick look at a receiverless \"advanced\" source example, just so we can look for the similarities and differences.\n\n*Note: this code will not work unless you install a library (.jar). If you want an Uber-jar (i.e., JAR with dependencies), look for spark-streaming-kafka-XXXX\n\nWhere XXXX matches your versions of Spark, Scala, and Kafka. If you're running this with Zeppelin 0.7, that will be Spark 2.1, Scala 2.11, Kafka 0.8\n\n__Even if you don't want to wire up the libs to run this example, let's take a quick look at the code. It's not any more complex than what we have been doing, but with KafkaDirect, we get a ton of features that make our application more robust *and* easier to operate at the same time__\n\n1. Automatic parallelization across Kafka topic partitions\n2. No dedicated receiver cores \n3. Easy recovery in the case of brief failure, because Kafka keeps the records and we can just ask for them again (by offset range)\n * This means we need to track offsets, but don't need to journal the actual data","user":"anonymous","dateUpdated":"2017-02-19T11:25:43-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487529895737_-609077554","id":"20170219-104455_1054520696","dateCreated":"2017-02-19T10:44:55-0800","dateStarted":"2017-02-19T11:25:43-0800","dateFinished":"2017-02-19T11:25:43-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41797","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Before going any further with the Streaming API, let’s take a quick look at a receiverless “advanced” source example, just so we can look for the similarities and differences.
\n
*Note: this code will not work unless you install a library (.jar). If you want an Uber-jar (i.e., JAR with dependencies), look for spark-streaming-kafka-XXXX
\n
Where XXXX matches your versions of Spark, Scala, and Kafka. If you’re running this with Zeppelin 0.7, that will be Spark 2.1, Scala 2.11, Kafka 0.8
\n
Even if you don’t want to wire up the libs to run this example, let’s take a quick look at the code. It’s not any more complex than what we have been doing, but with KafkaDirect, we get a ton of features that make our application more robust and easier to operate at the same time
\n
\n - Automatic parallelization across Kafka topic partitions
\n - No dedicated receiver cores
\n - Easy recovery in the case of brief failure, because Kafka keeps the records and we can just ask for them again (by offset range)\n
\n - This means we need to track offsets, but don’t need to journal the actual data
\n
\n \n
\n
"}]}},{"text":"import _root_.kafka.serializer.StringDecoder\nimport org.apache.spark.streaming._\nimport org.apache.spark.streaming.kafka._\n\nval ssc = new StreamingContext(sc, Seconds(1))\nval kafkaParams = Map(\"metadata.broker.list\" -> \"ec2-54-201-178-163.us-west-2.compute.amazonaws.com:9092\")\nval k = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(\"purchase\"))\n\nk.count.foreachRDD {\n rdd => println(\"Received \" + rdd.first + \" data records at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nk.foreachRDD {\n rdd => println(\"Data is \" + rdd.collect.mkString(\", \") + \" at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nssc.start\nThread sleep 4000\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T10:50:20-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530199477_-943963984","id":"20170219-104959_1782267802","dateCreated":"2017-02-19T10:49:59-0800","dateStarted":"2017-02-19T10:50:18-0800","dateFinished":"2017-02-19T10:50:18-0800","status":"ERROR","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41798"},{"text":"%md The DStream API includes additional operations like reduceByKey, union, and join. \nIf we need a DStream transformation which doesn't exist in the API, we call dstream.transform(...) \nand supply an arbitrary function that takes a RDD (for each batch) and returns a transformed RDD.\n","user":"anonymous","dateUpdated":"2017-02-19T11:25:49-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530218428_1930525697","id":"20170219-105018_239122416","dateCreated":"2017-02-19T10:50:18-0800","dateStarted":"2017-02-19T11:25:49-0800","dateFinished":"2017-02-19T11:25:50-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41799","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
The DStream API includes additional operations like reduceByKey, union, and join.
If we need a DStream transformation which doesn’t exist in the API, we call dstream.transform(…)
and supply an arbitrary function that takes a RDD (for each batch) and returns a transformed RDD.
\n
"}]}},{"text":"%md #### Stateful Streams\n\nSo far, all of the streams we've used involve computation only on the current batch (in one or more streams).\n\nSome applications require more complex computation, such as calculating moving averages over several batched, or \"sessionization\" -- collected all of the data for a user over an indefinite period of time until some business rule indicates the end of the \"session\" (e.g., user logout or idle time).\n\nTo accommodate these scenarios, Spark has support for two different kinds of stateful streams.","user":"anonymous","dateUpdated":"2017-02-19T11:25:53-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530336194_-1084581222","id":"20170219-105216_1655620219","dateCreated":"2017-02-19T10:52:16-0800","dateStarted":"2017-02-19T11:25:53-0800","dateFinished":"2017-02-19T11:25:53-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41800","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stateful Streams
\n
So far, all of the streams we’ve used involve computation only on the current batch (in one or more streams).
\n
Some applications require more complex computation, such as calculating moving averages over several batched, or “sessionization” – collected all of the data for a user over an indefinite period of time until some business rule indicates the end of the “session” (e.g., user logout or idle time).
\n
To accommodate these scenarios, Spark has support for two different kinds of stateful streams.
\n
"}]}},{"text":"%md __Windowed Streams__\n\n* .window(...)\n* \"andWindow\" methods: reduceByKeyAndWindow etc\n\n
","user":"anonymous","dateUpdated":"2017-02-19T11:25:58-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530357001_2118884092","id":"20170219-105237_1084108607","dateCreated":"2017-02-19T10:52:37-0800","dateStarted":"2017-02-19T11:25:58-0800","dateFinished":"2017-02-19T11:25:58-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41801","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Windowed Streams
\n
\n - .window(…)
\n - “andWindow” methods: reduceByKeyAndWindow etc
\n
\n

\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource()).map(_._2)\n\nstream.window(Seconds(3)).count.foreachRDD {\n rdd => println(\"Received \" + rdd.first + \" data records at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nstream.window(Seconds(4), Seconds(4)).foreachRDD {\n rdd => println(\"Average speed is \" + rdd.mean + \" at \" + java.util.Calendar.getInstance.getTimeInMillis)\n}\n\nssc.start\nThread.sleep(12000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T10:53:30-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530369769_1580417878","id":"20170219-105249_1399117207","dateCreated":"2017-02-19T10:52:49-0800","dateStarted":"2017-02-19T10:53:27-0800","dateFinished":"2017-02-19T10:53:40-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41802"},{"text":"%md __Long-Term Stateful Streams__\n\n* Collect data on per key (user, device, etc.) basis\n* Any type of data\n* Any type of collection structure\n* (Almost) any business rules\n\nupdateStateByKey:","user":"anonymous","dateUpdated":"2017-02-19T11:26:01-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530378560_-1703413985","id":"20170219-105258_820870918","dateCreated":"2017-02-19T10:52:58-0800","dateStarted":"2017-02-19T11:26:01-0800","dateFinished":"2017-02-19T11:26:01-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41803","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Long-Term Stateful Streams
\n
\n - Collect data on per key (user, device, etc.) basis
\n - Any type of data
\n - Any type of collection structure
\n - (Almost) any business rules
\n
\n
updateStateByKey:
\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nssc.checkpoint(\"/tmp\")\n\nstream.updateStateByKey( (newData:Seq[Int], oldData:Option[Iterable[Any]]) => {\n oldData match {\n case None => Some(newData.toVector)\n case Some(existingData) => Some(existingData ++ newData)\n }\n})\n .foreachRDD {\n rdd => println(\"Time: \" + java.util.Calendar.getInstance.getTimeInMillis + \"\\nData: \" + rdd.collect.mkString(\", \") + \"\\n------------\\n\")\n}\n\nssc.start\nThread.sleep(4000)\nssc.stop(false)","user":"anonymous","dateUpdated":"2017-02-19T10:53:54-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530424615_-1808789211","id":"20170219-105344_809976093","dateCreated":"2017-02-19T10:53:44-0800","dateStarted":"2017-02-19T10:53:52-0800","dateFinished":"2017-02-19T10:53:58-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41804"},{"text":"%md Similar, slightly more complex and possibly more performant, call .mapWithState is also available\n\nhttps://docs.cloud.databricks.com/docs/latest/databricks_guide/07%20Spark%20Streaming/12%20Global%20Aggregations%20-%20mapWithState.html","user":"anonymous","dateUpdated":"2017-02-19T11:26:06-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530432047_1038968906","id":"20170219-105352_615285554","dateCreated":"2017-02-19T10:53:52-0800","dateStarted":"2017-02-19T11:26:06-0800","dateFinished":"2017-02-19T11:26:06-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41805","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":""}]}},{"text":"%md __Combining Legacy Streaming with SQL/DataFrame__\n\nConvert each batch into a DataFrame\n\nThis example is stateless in the sense that the DataFrame is replaced anew for each batch -- simple but limited","user":"anonymous","dateUpdated":"2017-02-19T11:26:10-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530449222_498796775","id":"20170219-105409_1017960737","dateCreated":"2017-02-19T10:54:09-0800","dateStarted":"2017-02-19T11:26:10-0800","dateFinished":"2017-02-19T11:26:10-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41806","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Combining Legacy Streaming with SQL/DataFrame
\n
Convert each batch into a DataFrame
\n
This example is stateless in the sense that the DataFrame is replaced anew for each batch – simple but limited
\n
"}]}},{"text":"val ssc = new StreamingContext(sc, Seconds(1))\nval stream = ssc.receiverStream(new DemoSource())\n\nstream.foreachRDD {\n rdd => rdd.map( speed => (speed._1, speed._2, java.util.Calendar.getInstance.getTimeInMillis) )\n .toDF(\"customer\", \"speed\", \"processingTime\").createOrReplaceTempView(\"speeds\")\n}\n\nssc.start\nThread.sleep(4000)\nssc.stop(false)\n\nspark.table(\"speeds\").show(100)","user":"anonymous","dateUpdated":"2017-02-19T10:54:37-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530463590_1984727675","id":"20170219-105423_2111190231","dateCreated":"2017-02-19T10:54:23-0800","dateStarted":"2017-02-19T10:54:31-0800","dateFinished":"2017-02-19T10:54:38-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41807"},{"text":"%md This approach can be useful; if we want to query over a larger (longer-term) collection of data ... say, cumulative collection of records,\nor hours of records, it usually a good idea to write the data out to another system. Depending on your requirements, this might be a filesystem (HDFS, S3, etc.) or a database that supports queries, like Cassandra.","user":"anonymous","dateUpdated":"2017-02-19T11:26:12-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530471894_-1099419508","id":"20170219-105431_309429721","dateCreated":"2017-02-19T10:54:31-0800","dateStarted":"2017-02-19T11:26:12-0800","dateFinished":"2017-02-19T11:26:12-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41808","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
This approach can be useful; if we want to query over a larger (longer-term) collection of data … say, cumulative collection of records,
or hours of records, it usually a good idea to write the data out to another system. Depending on your requirements, this might be a filesystem (HDFS, S3, etc.) or a database that supports queries, like Cassandra.
\n
"}]}},{"text":"%md ### What about all of the operational considerations?\n\n#### Fault Tolerance\n* Reliable Receivers\n* Write-Ahead Log\n* Data Checkpointing\n\n#### High Availability\n* Driver Recovery\n* Metadata Checkpointing\n\n#### Exactly-Once End-to-End Message Processing\n* https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html\n* http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/","user":"anonymous","dateUpdated":"2017-02-19T11:26:17-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530637214_828201485","id":"20170219-105717_551511967","dateCreated":"2017-02-19T10:57:17-0800","dateStarted":"2017-02-19T11:26:17-0800","dateFinished":"2017-02-19T11:26:17-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41809","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
What about all of the operational considerations?
\n
Fault Tolerance
\n
\n - Reliable Receivers
\n - Write-Ahead Log
\n - Data Checkpointing
\n
\n
High Availability
\n
\n - Driver Recovery
\n - Metadata Checkpointing
\n
\n
Exactly-Once End-to-End Message Processing
\n
\n
"}]}},{"text":"%md #### Hmm...\n\n#### What if we could combine streaming with DataFrames, optimized state tracking for aggregations, and simplified fault tolerance with end-to-end guarantees?\n","user":"anonymous","dateUpdated":"2017-02-19T11:26:21-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530694708_351390329","id":"20170219-105814_1183750754","dateCreated":"2017-02-19T10:58:14-0800","dateStarted":"2017-02-19T11:26:21-0800","dateFinished":"2017-02-19T11:26:21-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41810","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Hmm…
\nWhat if we could combine streaming with DataFrames, optimized state tracking for aggregations, and simplified fault tolerance with end-to-end guarantees?
\n"}]}},{"text":"%md ### Structured Streaming\n\n* Fault tolerance\n* Available source/sink strategies\n* Incremental query optimization\n\n\"Easiest way to reason about streaming is to ... not reason about streaming\"\n\n* Treat sources as \"append-only\" tables\n* Read and manipulate with DataFrame API\n* Internal resilient state management for aggregations over time\n* Output to sinks","user":"anonymous","dateUpdated":"2017-02-19T11:26:24-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530819506_-1232636217","id":"20170219-110019_800101349","dateCreated":"2017-02-19T11:00:19-0800","dateStarted":"2017-02-19T11:26:24-0800","dateFinished":"2017-02-19T11:26:24-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41811","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Structured Streaming
\n
\n - Fault tolerance
\n - Available source/sink strategies
\n - Incremental query optimization
\n
\n
“Easiest way to reason about streaming is to … not reason about streaming”
\n
\n - Treat sources as “append-only” tables
\n - Read and manipulate with DataFrame API
\n - Internal resilient state management for aggregations over time
\n - Output to sinks
\n
\n
"}]}},{"text":"%md To keep the environment simple for this class, and save time,\nwe'll use a socket source and a memory sink to demonstrate Structured Streaming. \nIn production, it is essential to use fault-tolerant sources and sinks such as the Kafka source and the filesystem sink.\n","user":"anonymous","dateUpdated":"2017-02-19T11:26:28-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531153021_220421340","id":"20170219-110553_1684115231","dateCreated":"2017-02-19T11:05:53-0800","dateStarted":"2017-02-19T11:26:28-0800","dateFinished":"2017-02-19T11:26:28-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41812","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
To keep the environment simple for this class, and save time,
we’ll use a socket source and a memory sink to demonstrate Structured Streaming.
In production, it is essential to use fault-tolerant sources and sinks such as the Kafka source and the filesystem sink.
\n
"}]}},{"text":"%sql SET spark.sql.shuffle.partitions = 3 ","user":"anonymous","dateUpdated":"2017-02-19T11:07:44-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"sql","editOnDblClick":true},"editorMode":"ace/mode/sql","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487530861030_-1584019232","id":"20170219-110101_959914080","dateCreated":"2017-02-19T11:01:01-0800","dateStarted":"2017-02-19T11:07:33-0800","dateFinished":"2017-02-19T11:07:34-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41813"},{"text":"%sql\n","user":"anonymous","dateUpdated":"2017-02-19T11:21:33-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487532093913_784440802","id":"20170219-112133_1386192119","dateCreated":"2017-02-19T11:21:33-0800","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41814"},{"text":"val lines = spark.readStream\n .format(\"socket\")\n .option(\"host\", \"54.213.33.240\")\n .option(\"port\", 9002)\n .load()\n \nval edits = lines.select(json_tuple('value, \"channel\", \"timestamp\", \"isRobot\", \"isAnonymous\"))\n .selectExpr(\"c0 as channel\", \"c1 as time\", \"c2 as robot\", \"c3 as anon\")\n\nval query = edits.writeStream\n .queryName(\"demo\")\n .outputMode(\"append\")\n .format(\"memory\")\n .start()","user":"anonymous","dateUpdated":"2017-02-19T11:22:03-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531005864_177715857","id":"20170219-110325_954460005","dateCreated":"2017-02-19T11:03:25-0800","dateStarted":"2017-02-19T11:22:03-0800","dateFinished":"2017-02-19T11:22:04-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41815"},{"text":"%sql SELECT * FROM demo","user":"anonymous","dateUpdated":"2017-02-19T11:22:13-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531306339_2046686451","id":"20170219-110826_1839713258","dateCreated":"2017-02-19T11:08:26-0800","dateStarted":"2017-02-19T11:22:13-0800","dateFinished":"2017-02-19T11:22:13-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41816"},{"text":"query.stop","user":"anonymous","dateUpdated":"2017-02-19T11:22:18-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531325427_684305920","id":"20170219-110845_611901439","dateCreated":"2017-02-19T11:08:45-0800","dateStarted":"2017-02-19T11:22:18-0800","dateFinished":"2017-02-19T11:22:19-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41817"},{"text":"%md We'll try something a little more practical.\n\n* Interpret the time as SQL timestamp (it was a raw string earlier)\n* Aggregate over a time window\n* Transform the stream by grouping by channel and time, then counting edits\n* Specfiy a trigger interval","user":"anonymous","dateUpdated":"2017-02-19T11:26:46-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531346768_-1722451692","id":"20170219-110906_528567220","dateCreated":"2017-02-19T11:09:06-0800","dateStarted":"2017-02-19T11:26:46-0800","dateFinished":"2017-02-19T11:26:46-0800","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:41818","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
We’ll try something a little more practical.
\n
\n - Interpret the time as SQL timestamp (it was a raw string earlier)
\n - Aggregate over a time window
\n - Transform the stream by grouping by channel and time, then counting edits
\n - Specfiy a trigger interval
\n
\n
"}]}},{"text":"val lines2 = spark.readStream\n .format(\"socket\")\n .option(\"host\", \"54.213.33.240\")\n .option(\"port\", 9002)\n .load()\n\nval edits2 = lines2\n .select(json_tuple('value, \"channel\", \"timestamp\", \"page\"))\n .selectExpr(\"c0 as channel\", \"cast(c1 as timestamp) as time\", \"c2 as page\")\n .groupBy(window($\"time\", \"10 seconds\"), $\"channel\").count() \n\nval query2 = edits2.writeStream\n .queryName(\"demo2\")\n .outputMode(\"complete\")\n .format(\"memory\")\n .trigger(ProcessingTime(\"10 seconds\"))\n .start()","user":"anonymous","dateUpdated":"2017-02-19T11:23:27-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531412880_2038727673","id":"20170219-111012_1407364053","dateCreated":"2017-02-19T11:10:12-0800","dateStarted":"2017-02-19T11:23:24-0800","dateFinished":"2017-02-19T11:23:24-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41819"},{"text":"%sql SELECT * FROM demo2","user":"anonymous","dateUpdated":"2017-02-19T11:23:33-0800","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"table","height":300,"optionOpen":true,"setting":{"lineChart":{}},"commonSetting":{},"keys":[{"name":"window","index":0,"aggr":"sum"}],"groups":[{"name":"channel","index":1,"aggr":"sum"}],"values":[{"name":"channel","index":1,"aggr":"sum"}]},"helium":{}}},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531430006_1026468641","id":"20170219-111030_1173652410","dateCreated":"2017-02-19T11:10:30-0800","dateStarted":"2017-02-19T11:23:33-0800","dateFinished":"2017-02-19T11:23:33-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41820"},{"text":"%sql SELECT *, date_format(window.start, \"HH:mm:ss\") as time FROM demo2 ORDER BY time, channel","user":"anonymous","dateUpdated":"2017-02-19T11:23:50-0800","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"lineChart","height":300,"optionOpen":false,"setting":{"lineChart":{}},"commonSetting":{},"keys":[{"name":"time","index":3,"aggr":"sum"}],"groups":[{"name":"channel","index":1,"aggr":"sum"}],"values":[{"name":"count","index":2,"aggr":"sum"}]},"helium":{}}},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531444766_-1933389323","id":"20170219-111044_1995656321","dateCreated":"2017-02-19T11:10:44-0800","dateStarted":"2017-02-19T11:23:50-0800","dateFinished":"2017-02-19T11:23:50-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41821"},{"text":"query2.stop","user":"anonymous","dateUpdated":"2017-02-19T11:23:58-0800","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531486182_-1514905113","id":"20170219-111126_1972350781","dateCreated":"2017-02-19T11:11:26-0800","dateStarted":"2017-02-19T11:23:58-0800","dateFinished":"2017-02-19T11:23:58-0800","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41822"},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1487531645667_-2011769185","id":"20170219-111405_42341523","dateCreated":"2017-02-19T11:14:05-0800","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:41823"}],"name":"Streaming","id":"2C9F93KZA","angularObjects":{"2CAVTNYXD:shared_process":[],"2C7K2429Q:shared_process":[],"2C8SPPQZU:shared_process":[],"2C7UX75H1:shared_process":[],"2C8CZHBMK:shared_process":[],"2C8M1YA6S:shared_process":[],"2C8HBGBZH:shared_process":[],"2C8W1YSQF:shared_process":[],"2CA11FFZW:shared_process":[],"2C8GTCYUP:shared_process":[],"2C7JSZ74W:shared_process":[],"2C7NXAQD7:shared_process":[],"2C8BUSWZY:shared_process":[],"2C8BMRSH6:shared_process":[],"2CAD4S1XP:shared_process":[],"2C8DQ16J7:shared_process":[],"2C9YA18WV:shared_process":[],"2CA315FYH:shared_process":[],"2C915WF4P:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------