├── Dockerfile
├── README.md
└── notebooks
├── Data Science with Scala.snb
├── Why Spark Notebook.snb
├── WhyScala.md
├── WhyScala.pdf
├── WhyScala.snb
├── airports.json
└── images
├── JavaMemory.jpg
└── TungstenMemory.jpg
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM andypetrella/spark-notebook-demo:master-2.0.0-preview
2 |
3 | # Data Fellas
4 | MAINTAINER Data Fellas info@data-fellas.guru
5 |
6 | USER root
7 |
8 | ENV HOME /root
9 |
10 | ENV NOTEBOOKS_DIR /root/spark-notebook/notebooks/scala-ds
11 |
12 | ENV ADD_JARS /root/spark-notebook/lib/common.common-0.7.0-SNAPSHOT-scala-2.10.6-spark-2.0.0-preview-hadoop-2.2.0-with-hive-with-parquet.jar
13 |
14 | ADD notebooks /root/spark-notebook/notebooks/scala-ds
15 |
16 | WORKDIR /root/demo-base
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Scala for Data Science
2 |
3 | The enclosed notebooks and other materials are for the [Scala Days 2016](http://www.scaladays.org/) and [Strata London 2016](http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49739) talks by [Andy Petrella](mailto:noootsab@data-fellas.guru) and [Dean Wampler](dean.wampler@lightbend.com) on why Scala is a great language for Data Science.
4 |
5 | The talk includes a notebook for [Spark Notebook](http://spark-notebook.io/), which provides a notebook metaphor for interactive Spark development using Scala. If you aren't familiar with the idea of a notebook interface, think of it as an enhanced REPL that makes it easy to edit and run (or rerun) code, plot results, mix in markdown-based documentation, etc.
6 |
7 | However, if you don't want to go to the trouble of installing and using [Spark Notebook](http://spark-notebook.io/), there are Markdown and PDF versions of the same content in the `notebooks` directory.
8 |
9 | ## Use Existing Docker
10 | A docker container exists with the Spark Notebook available with the current notebooks.
11 |
12 | ### Pull it from docker hub:
13 | ```
14 | docker pull datafellas/scala-for-data-science:1.0-spark2
15 | ```
16 |
17 | ### Run it
18 | ```
19 | docker run --rm -it --net=host -m 8g datafellas/scala-for-data-science:1.0-spark2 bash
20 | ```
21 |
22 | ### Start the services
23 | ```
24 | source start.sh
25 | ```
26 |
27 | ### Use it
28 | On Linux, go to [http://localhost:9000](http://localhost:9000).
29 |
30 | On Mac/Win, you'll probably have to use the VM's IP/Name.
31 |
32 |
33 | ## Install manually
34 |
35 | Otherwise, install [Spark Notebook](http://spark-notebook.io/), version 0.6.3 or later. You can use either Scala 2.10 or 2.11. In the commands below, we'll assume the root directory of this installation is `/path/to/spark-notebook`. Just use your real path instead. Due to a bug in library path handling, **you must start Spark Notebook from this directory**.
36 |
37 | We'll also use `/path/to/scala-for-data-science` as the path to your local clone of this Git repo. Again, substitute the real path...
38 |
39 | There is one environment variable that you **must** define, `NOTEBOOKS_DIR`. Run the following commands to define this variable and start Spark Notebook.
40 |
41 | For Linux or OSX, use the following:
42 | ```
43 | export NOTEBOOKS_DIR=/path/to/scala-for-data-science/notebooks
44 | cd /path/to/spark-notebook
45 | bin/spark-notebook
46 | ```
47 |
48 | For Windows, use the following:
49 | ```
50 | set NOTEBOOKS_DIR=c:\path\to\scala-for-data-science\notebooks
51 | cd \path\to\spark-notebook
52 | bin\spark-notebook
53 | ```
54 |
55 | Open a browser window to [localhost:9000](http://localhost:9000). Then click the link to open the notebook [WhyScala](http://localhost:9000/notebooks/WhyScala.snb).
56 |
57 | To evaluate all the cells in a notebook, use the _Cell > Run All_ menu item. You can evaluate one cell at a time with the ▶︎ button on the toolbar, or use "shift+return". Both options run the currently-selected cell and advance to the next cell. Note that the notebook copy in the repo includes the output from a run.
58 |
59 | Grab the slides for the rest of the presentation [here](https://docs.google.com/a/data-fellas.guru/presentation/d/1d7vT3mgo4ppHXHtKRQjcVW8SsMs3PeRAkq3PHRgWKaQ/edit?usp=sharing).
60 |
--------------------------------------------------------------------------------
/notebooks/Data Science with Scala.snb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata" : {
3 | "name" : "Data Science with Scala",
4 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z",
5 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z",
6 | "language_info" : {
7 | "name" : "scala",
8 | "file_extension" : "scala",
9 | "codemirror_mode" : "text/x-scala"
10 | },
11 | "trusted" : true,
12 | "customLocalRepo" : "/tmp/repo",
13 | "customRepos" : [ "spartakus % default % http://dl.bintray.com/spark-clustering-notebook/maven % maven" ],
14 | "customDeps" : [ "com.github.haifengl % smile-core % 1.0.4", "org.deeplearning4j % deeplearning4j-core % 0.4-rc3.9", "org.deeplearning4j % deeplearning4j-nlp % 0.4-rc3.9", "batchstream %% batchstream % 1.0" ],
15 | "customImports" : null,
16 | "customArgs" : null,
17 | "customSparkConf" : null
18 | },
19 | "cells" : [ {
20 | "metadata" : {
21 | "id" : "B0A64FACA3AE41FD8A8CC61106CDB042"
22 | },
23 | "cell_type" : "markdown",
24 | "source" : "# What is available?"
25 | }, {
26 | "metadata" : {
27 | "id" : "88B84167E7E9422C869BA7FF63F07099"
28 | },
29 | "cell_type" : "markdown",
30 | "source" : "## Libraries: Awesome-Scala"
31 | }, {
32 | "metadata" : {
33 | "id" : "E9751047EE5F49CA8E79F8D865F5E7BB"
34 | },
35 | "cell_type" : "markdown",
36 | "source" : "The _Awesome Scala_ project by [@lauris](https://github.com/lauris) is listing \"all\" available libraries in Scala, there is an awesome (of course) sublist for Data Science related stuff, you can check it [here](https://github.com/lauris/awesome-scala#science-and-data-analysis)."
37 | }, {
38 | "metadata" : {
39 | "id" : "FAAD385D942E4423B869F3D6A4B1EB31"
40 | },
41 | "cell_type" : "markdown",
42 | "source" : "## Models: in Scala/JVM"
43 | }, {
44 | "metadata" : {
45 | "id" : "FBBF02F2BD594437A5A2CF68B01A29A5"
46 | },
47 | "cell_type" : "markdown",
48 | "source" : "The JVM is getting ready for the new world it is entering into, and we can count on the multi millions users (10M based on a mix of wikipedia and other sources) to continue the addition of new models or to improve existing implementations."
49 | }, {
50 | "metadata" : {
51 | "id" : "E93B500E5A174D268D7EF4E86FC30B1E"
52 | },
53 | "cell_type" : "markdown",
54 | "source" : "So we will show a few of them in the following section (help wanted :-D)"
55 | }, {
56 | "metadata" : {
57 | "id" : "FBF9AD0483E6481790E8111314F7D4B8"
58 | },
59 | "cell_type" : "markdown",
60 | "source" : "## Smile [GitHub](https://github.com/haifengl/smile)"
61 | }, {
62 | "metadata" : {
63 | "id" : "4E0845C51EEE484A8CF7C1B606E0AA4E"
64 | },
65 | "cell_type" : "markdown",
66 | "source" : "Probably the most complete project available in Scala in terms of implementation, with more than 90 (98 at the time writing) methods/models."
67 | }, {
68 | "metadata" : {
69 | "id" : "091F7CD7443B491C8C482BC3A8D82286"
70 | },
71 | "cell_type" : "markdown",
72 | "source" : "* Classification\n * Support Vector Machines\n * Decision Trees\n * AdaBoost\n * Gradient Boosting\n * Random Forest\n * Logistic Regression\n * Neural Networks\n * RBF Networks\n * Maximum Entropy Classifier\n * Naïve Bayesian\n * Fisher / Linear / Quadratic / Regularized Discriminant Analysis\n* Regression\n * Support Vector Regression\n * Gaussian Process\n * Regression Trees\n * Gradient Boosting\n * Random Forest\n * RBF Networks\n * Linear Regression\n * LASSO\n * Ridge Regression\n* Feature Selection\n * Genetic Algorithm based Feature Selection\n * Ensemble Learning based Feature Selection\n * Signal Noise ratio\n * Sum Squares ratio\n* Dimension Reduction\n * PCA\n * Kernel PCA\n * Probabilistic PCA\n * Generalized Hebbian Algorithm\n * Random Project\n* Model Validation\n * Cross Validation\n * Leave-One-Out Validation\n * Bootstrap\n * Confusion Matrix\n * AUC\n * Fallout\n * FDR\n * F-Score\n * Precision\n * Recall\n * Sensitivity\n * Specificity\n * MSE\n * RMSE\n * RSS\n * Absolute Deviation\n * Rand Index\n * Adjusted Rand Index\n* Clustering\n * BIRCH\n * CLARANS\n * DBScan\n * DENCLUE\n * Deterministic Annealing\n * K-Means\n * X-Means\n * G-Means\n * Neural Gas\n * Growing Neural Gas\n * Hierarchical Clustering\n * Sequential Information Bottleneck\n * Self-Organizing Maps\n * Spectral Clustering\n * Minimum Entropy Clustering\n* Association Rules\n * Frequent Itemset Mining\n * Association Rule Mining\n* Manifold learning\n * IsoMap\n * LLE\n * Laplacian Eigenmap\n* Multi-Dimensional Scaling\n * Classical MDS\n * Isotonic MDS\n * Sammon Mapping\n* Nearest Neighbor Search\n * BK-Tree\n * Cover Tree\n * KD-Tree\n * Locality-Sensitive Hashing\n* Sequence Learning\n * Hidden Markov Model\n * Conditional Random Field\n* Natural Language Processing\n * Sentence Splitter\n * Tokenizer\n * Bigram Statistical Test\n * Phrase Extractor\n * Keyword Extractor\n * Porter Stemmer\n * Lancaster Stemmer\n * POS Tagging\n * Relevance Ranking\n* Interpolation\n * Linear\n * Bilinear\n * Cubic\n * Bicubic\n * Kriging\n * Laplace\n * Shepard\n * RBF\n* Wavelet\n * Discrete Wavelet Transform\n * Wavelet Shrinkage Haar Daubechies D4 Best Localized Wavelet\n * Coiflet\n * Symmlet"
73 | }, {
74 | "metadata" : {
75 | "id" : "8C49D8D0D60B4DD5956FE2DAC8F2808F"
76 | },
77 | "cell_type" : "markdown",
78 | "source" : "However, this is local only.\n> Haifeng Li (the main author) is providing a quick benchmark where Smile outperforms R/Python/Spark/H2O and claims too quickly that we can train the model locally only. This wouldn't work if the data is getting bigger or if we simply want to run ensembles or many algorithms -- a cluster would still be worth considering."
79 | }, {
80 | "metadata" : {
81 | "id" : "3E0F0919691340D3B3A946411500D586"
82 | },
83 | "cell_type" : "markdown",
84 | "source" : "### Example of Maximum Entropy Classifier (Maxent) using Smile"
85 | }, {
86 | "metadata" : {
87 | "id" : "114AA350182B48A0ABDBFBA19C441111"
88 | },
89 | "cell_type" : "markdown",
90 | "source" : "Maximum entropy is a technique for learning probability distributions from data. \n\nIn maximum entropy models, the observed data itself is assumed to be the testable information. Maximum entropy models don't assume anything about the probability distribution other than what have been observed and always choose the most uniform distribution subject to the observed constraints."
91 | }, {
92 | "metadata" : {
93 | "id" : "9700BA7AFD894F729780D0F0BB0E8AE3"
94 | },
95 | "cell_type" : "markdown",
96 | "source" : "```scala\ndef maxent(x: Array[Array[Int]], y: Array[Int], p: Int, lambda: Double = 0.1, tol: Double = 1E-5, maxIter: Int = 500): Maxent\n```"
97 | }, {
98 | "metadata" : {
99 | "id" : "446B6E618681440CAA93A50975B4C7C6"
100 | },
101 | "cell_type" : "markdown",
102 | "source" : "where `x` is the sparse training samples. Each sample is represented by a set of sparse binary features. The features are stored in an integer array, of which are the indices of nonzero features. \n\nThe parameter `p` is the dimension of feature space, and `lambda` is the regularization factor.\n\nBasically, maximum entropy classifier is another name of multinomial logistic regression applied to categorical independent variables, \nwhich are converted to binary dummy variables. \n\nMaximum entropy models are widely used in natural language processing. Therefore, Smile's implementation assumes that **binary features** are stored in a sparse array, of which entries are the indices of nonzero features."
103 | }, {
104 | "metadata" : {
105 | "trusted" : true,
106 | "input_collapsed" : false,
107 | "collapsed" : true,
108 | "id" : "4699144082CF486A804DCED6326066BD"
109 | },
110 | "cell_type" : "code",
111 | "source" : "import sys.process._\nimport scala.language.postfixOps\n\n\"wget https://raw.githubusercontent.com/haifengl/smile/master/shell/src/universal/data/sequence/sparse.hyphen.6.train -O /tmp/sparse.hyphen.6.train \"!!\n\n\"wget https://raw.githubusercontent.com/haifengl/smile/master/shell/src/universal/data/sequence/sparse.hyphen.6.test -O /tmp/sparse.hyphen.6.test \"!!",
112 | "outputs" : [ ]
113 | }, {
114 | "metadata" : {
115 | "trusted" : true,
116 | "input_collapsed" : false,
117 | "collapsed" : true,
118 | "id" : "DD548922D7734331811948F0FF6946BF"
119 | },
120 | "cell_type" : "code",
121 | "source" : "case class SmileDataset(\n x:Array[Array[Int]],\n y:Array[Int],\n p:Int\n)",
122 | "outputs" : [ ]
123 | }, {
124 | "metadata" : {
125 | "trusted" : true,
126 | "input_collapsed" : false,
127 | "collapsed" : true,
128 | "id" : "D76F34DAF0934FA29142026AA82BDBB9"
129 | },
130 | "cell_type" : "code",
131 | "source" : "def load(resource:String):SmileDataset = {\n val xs = scala.collection.mutable.ArrayBuffer.empty[Array[Int]]\n val ys = scala.collection.mutable.ArrayBuffer.empty[Int]\n \n val head :: content = scala.io.Source.fromFile(new java.io.File(resource)).getLines.toList\n \n val Array(nseq, k, p) = head.split(\" \").map(_.trim.toInt)\n \n content.foreach{ line =>\n val seqid :: pos :: len :: featureAndY = line.split(\" \").map(_.trim.toInt).toList\n val (feature, y) = (featureAndY.init, featureAndY.last)\n xs += feature.toArray\n ys += y\n }\n \n SmileDataset(xs.toArray, ys.toArray, p)\n}",
132 | "outputs" : [ ]
133 | }, {
134 | "metadata" : {
135 | "trusted" : true,
136 | "input_collapsed" : false,
137 | "collapsed" : true,
138 | "id" : "9CC0506B62854016892AE78C94D7A9F5"
139 | },
140 | "cell_type" : "code",
141 | "source" : "import smile.classification.Maxent\nval train = load(\"/tmp/sparse.hyphen.6.train\")\nval test = load(\"/tmp/sparse.hyphen.6.test\")\n\nval maxent = new Maxent(train.p, train.x, train.y, 0.1, 1E-5, 500);\n\nval error = (test.x zip test.y).filter{ case (x,y) => maxent.predict(x) != y }.size",
142 | "outputs" : [ ]
143 | }, {
144 | "metadata" : {
145 | "trusted" : true,
146 | "input_collapsed" : false,
147 | "collapsed" : true,
148 | "id" : "0AD1F07DFE6C45A68AE15AF4001C5A18"
149 | },
150 | "cell_type" : "code",
151 | "source" : ":markdown \nHyphen error is $error of ${test.x.size}",
152 | "outputs" : [ ]
153 | }, {
154 | "metadata" : {
155 | "trusted" : true,
156 | "input_collapsed" : false,
157 | "collapsed" : true,
158 | "id" : "3504A2996BBB43128F077F9DC97E588D"
159 | },
160 | "cell_type" : "code",
161 | "source" : ":markdown\nHyphen error rate = ${100.0 * error / test.x.length}",
162 | "outputs" : [ ]
163 | }, {
164 | "metadata" : {
165 | "id" : "7D761878E55148ACBC6545F0E0B80332"
166 | },
167 | "cell_type" : "markdown",
168 | "source" : "## DeepLearning4J [GitHub](https://github.com/deeplearning4j/deeplearning4j)"
169 | }, {
170 | "metadata" : {
171 | "id" : "E43F83384B284FCD86C3B6BF392026B1"
172 | },
173 | "cell_type" : "markdown",
174 | "source" : "Probably the Ultimate library to follow in terms of local optimization (CPU/GPU) and obviously for Deep Learning models (both local and distributed using Spark for instance)."
175 | }, {
176 | "metadata" : {
177 | "id" : "9EFEFD14676049348478B8BB8C029C75"
178 | },
179 | "cell_type" : "markdown",
180 | "source" : "### Example of LSTM"
181 | }, {
182 | "metadata" : {
183 | "trusted" : true,
184 | "input_collapsed" : false,
185 | "collapsed" : true,
186 | "id" : "8B2DEFC72CBB47C2A5080209171F1676"
187 | },
188 | "cell_type" : "code",
189 | "source" : "import org.deeplearning4j.datasets.iterator._\nimport org.deeplearning4j.eval.Evaluation\nimport org.deeplearning4j.models.embeddings.loader.WordVectorSerializer\nimport org.deeplearning4j.models.embeddings.wordvectors.WordVectors\n\nimport org.deeplearning4j.nn.api.OptimizationAlgorithm\nimport org.deeplearning4j.nn.conf._\nimport org.deeplearning4j.nn.conf.layers._\nimport org.deeplearning4j.nn.multilayer.MultiLayerNetwork\nimport org.deeplearning4j.nn.weights.WeightInit\n\nimport org.nd4j.linalg.api.ndarray.INDArray\nimport org.nd4j.linalg.dataset.DataSet\nimport org.nd4j.linalg.lossfunctions.LossFunctions",
190 | "outputs" : [ ]
191 | }, {
192 | "metadata" : {
193 | "id" : "46A4DEB01C024725B8DD7D6F045F1A2D"
194 | },
195 | "cell_type" : "markdown",
196 | "source" : "Using Word2Vec feature space"
197 | }, {
198 | "metadata" : {
199 | "trusted" : true,
200 | "input_collapsed" : false,
201 | "collapsed" : true,
202 | "id" : "458E3AC613CA433A8F1533F97DAEC48E"
203 | },
204 | "cell_type" : "code",
205 | "source" : "val wordVectors: WordVectors =WordVectorSerializer.loadGoogleModel(WORD_VECTORS_PATH, true, false)",
206 | "outputs" : [ ]
207 | }, {
208 | "metadata" : {
209 | "id" : "7F7660A14B044A628CB8DE7115BACA23"
210 | },
211 | "cell_type" : "markdown",
212 | "source" : "LSTM: The solution to exploding and vanishing gradients"
213 | }, {
214 | "metadata" : {
215 | "trusted" : true,
216 | "input_collapsed" : false,
217 | "collapsed" : true,
218 | "id" : "B2A8B81473564238836CEC096829FADE"
219 | },
220 | "cell_type" : "code",
221 | "source" : "val lstmLayer:GravesLSTM = new GravesLSTM.Builder()\n .nIn(vectorSize)\n .nOut(200) // 200 hidden units\n .activation(\"softsign\")\n .build()",
222 | "outputs" : [ ]
223 | }, {
224 | "metadata" : {
225 | "id" : "7FD8C1AA70E24BDA8E8749AF93AB5545"
226 | },
227 | "cell_type" : "markdown",
228 | "source" : "Output Layer"
229 | }, {
230 | "metadata" : {
231 | "trusted" : true,
232 | "input_collapsed" : false,
233 | "collapsed" : true,
234 | "id" : "D88CDED53ECC4A7590DC676501BFA58C"
235 | },
236 | "cell_type" : "code",
237 | "source" : "val rnnLayer:RnnOutputLayer = new RnnOutputLayer.Builder()\n .activation(\"softmax\")\n .lossFunction(LossFunctions.LossFunction.MCXENT)\n .nIn(200)\n .nOut(2)\n .build()",
238 | "outputs" : [ ]
239 | }, {
240 | "metadata" : {
241 | "id" : "6257296C93C04E7DB5CADE461299DE36"
242 | },
243 | "cell_type" : "markdown",
244 | "source" : "Model"
245 | }, {
246 | "metadata" : {
247 | "trusted" : true,
248 | "input_collapsed" : false,
249 | "collapsed" : true,
250 | "id" : "3A57CF5E0C344AF087F877349F9F41B0"
251 | },
252 | "cell_type" : "code",
253 | "source" : "//Set up network configuration\nval conf = new NeuralNetConfiguration.Builder()\n .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)\n .iterations(1) // 1 iteration per mini-batch\n .updater(Updater.RMSPROP) // How to propagate the \"errors\"\n .regularization(true).l2(1e-5)\n .weightInit(WeightInit.XAVIER)\n .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)\n .gradientNormalizationThreshold(1.0)\n .learningRate(0.0018)\n .list()\n .layer(0, lstmLayer)\n .layer(1, rnnLayer)\n .pretrain(false) \n .backprop(true)\n .build()",
254 | "outputs" : [ ]
255 | }, {
256 | "metadata" : {
257 | "trusted" : true,
258 | "input_collapsed" : false,
259 | "collapsed" : true,
260 | "id" : "4473F552C0A642508471FF7D984A9924"
261 | },
262 | "cell_type" : "code",
263 | "source" : "val net = new MultiLayerNetwork(conf)\nnet.init()",
264 | "outputs" : [ ]
265 | }, {
266 | "metadata" : {
267 | "id" : "5C730159197C42C89CE5BBE82C0E6F82"
268 | },
269 | "cell_type" : "markdown",
270 | "source" : "Spark"
271 | }, {
272 | "metadata" : {
273 | "trusted" : true,
274 | "input_collapsed" : false,
275 | "collapsed" : true,
276 | "id" : "C7D7D7F652F043B99834E71974D82220"
277 | },
278 | "cell_type" : "code",
279 | "source" : "import org.deeplearning4j.spark.impl.multilayer.SparkDl4jMultiLayer\nval sparkNetwork = new SparkDl4jMultiLayer(sparkContext, net)",
280 | "outputs" : [ ]
281 | }, {
282 | "metadata" : {
283 | "id" : "B62E83CBB7C44969A0E13EC42F8D4E9C"
284 | },
285 | "cell_type" : "markdown",
286 | "source" : "Load distributed data"
287 | }, {
288 | "metadata" : {
289 | "trusted" : true,
290 | "input_collapsed" : false,
291 | "collapsed" : true,
292 | "id" : "237B2AA9241D49AA83597DF4C5DABA78"
293 | },
294 | "cell_type" : "code",
295 | "source" : "val rdd = ???",
296 | "outputs" : [ ]
297 | }, {
298 | "metadata" : {
299 | "id" : "3292362830C546C29B03E54F9DEBD6D0"
300 | },
301 | "cell_type" : "markdown",
302 | "source" : "Train on Spark"
303 | }, {
304 | "metadata" : {
305 | "trusted" : true,
306 | "input_collapsed" : false,
307 | "collapsed" : true,
308 | "id" : "118F303E945C4D2DA09873A3BDF0A503"
309 | },
310 | "cell_type" : "code",
311 | "source" : "val trainedNetwork = sparkNetwork.fitDataSet(rdd)",
312 | "outputs" : [ ]
313 | }, {
314 | "metadata" : {
315 | "id" : "2471B6E4297D40458575C054905F548C"
316 | },
317 | "cell_type" : "markdown",
318 | "source" : "## MLlib [Guide](https://spark.apache.org/docs/latest/mllib-guide.html)"
319 | }, {
320 | "metadata" : {
321 | "id" : "CF6CAE27D5A74DA68AB9667A521BE940"
322 | },
323 | "cell_type" : "markdown",
324 | "source" : "Apache Spark's machine learning library, focused on scalability and distributed dataset."
325 | }, {
326 | "metadata" : {
327 | "id" : "7CBAE40AF08D41ED8098FD53F5BD07C1"
328 | },
329 | "cell_type" : "markdown",
330 | "source" : "MLlib has more than 20 optimized and distributed methods/models implementation available (at the time writing)."
331 | }, {
332 | "metadata" : {
333 | "id" : "DDA47801A31A4A9C8915F40C7FA2954C"
334 | },
335 | "cell_type" : "markdown",
336 | "source" : "* Basic statistics\n * summary statistics\n * correlations\n * stratified sampling\n * hypothesis testing\n * streaming significance testing\n * random data generation\n* Classification and regression\n * linear models (SVMs, logistic regression, linear regression)\n * naive Bayes\n * decision trees\n * ensembles of trees (Random Forests and Gradient-Boosted Trees)\n * isotonic regression\n* Collaborative filtering\n * alternating least squares (ALS)\n* Clustering\n * k-means\n * Gaussian mixture\n * power iteration clustering (PIC)\n * latent Dirichlet allocation (LDA)\n * bisecting k-means\n * streaming k-means\n* Dimensionality reduction\n * singular value decomposition (SVD)\n * principal component analysis (PCA)\n* Feature extraction and transformation\n* Frequent pattern mining\n * FP-growth\n * association rules\n * PrefixSpan\n* Evaluation metrics\n* Optimization (developer)\n * stochastic gradient descent\n * limited-memory BFGS (L-BFGS)"
337 | }, {
338 | "metadata" : {
339 | "id" : "2806EB46987A49798EE2A58E251317E8"
340 | },
341 | "cell_type" : "markdown",
342 | "source" : "### Example of Random Forest"
343 | }, {
344 | "metadata" : {
345 | "id" : "AA31064E508843B3832F8B23078A4CDA"
346 | },
347 | "cell_type" : "markdown",
348 | "source" : "The MLlib guide is really good and present the models with examples plus their theorical and practical foundations.\n\nHence the following example of Random Forest is shamelessly stealt from the guide :-)."
349 | }, {
350 | "metadata" : {
351 | "id" : "E084EA0943974E67BE8199AA2DE7DE47"
352 | },
353 | "cell_type" : "markdown",
354 | "source" : "As usual we first import the required classes, which are the model, the algorithm and a utils class to load predefined data types "
355 | }, {
356 | "metadata" : {
357 | "trusted" : true,
358 | "input_collapsed" : false,
359 | "collapsed" : true,
360 | "id" : "D641028ABE2B47748513B9406574D5E2"
361 | },
362 | "cell_type" : "code",
363 | "source" : "import org.apache.spark.mllib.tree.RandomForest\nimport org.apache.spark.mllib.tree.model.RandomForestModel\nimport org.apache.spark.mllib.util.MLUtils",
364 | "outputs" : [ ]
365 | }, {
366 | "metadata" : {
367 | "id" : "D9EB6B5A237E4B64877E8C7097EC82DB"
368 | },
369 | "cell_type" : "markdown",
370 | "source" : "Download the dataset"
371 | }, {
372 | "metadata" : {
373 | "trusted" : true,
374 | "input_collapsed" : false,
375 | "collapsed" : true,
376 | "id" : "8442F74EAC064B6A8C7FA7540AD7C756"
377 | },
378 | "cell_type" : "code",
379 | "source" : ":sh wget https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt -O /tmp/sample_libsvm_data.txt",
380 | "outputs" : [ ]
381 | }, {
382 | "metadata" : {
383 | "id" : "50FF68EFD4B44B8983BF132B78B0FCE5"
384 | },
385 | "cell_type" : "markdown",
386 | "source" : "Load and parse the data file."
387 | }, {
388 | "metadata" : {
389 | "trusted" : true,
390 | "input_collapsed" : false,
391 | "collapsed" : true,
392 | "id" : "23E9B4A6AD6040CE96A32B2B7DB76BD4"
393 | },
394 | "cell_type" : "code",
395 | "source" : "val data = MLUtils.loadLibSVMFile(sc, \"/tmp/sample_libsvm_data.txt\")",
396 | "outputs" : [ ]
397 | }, {
398 | "metadata" : {
399 | "id" : "A0602B0895DC460F8B5D7AC660E4C1C5"
400 | },
401 | "cell_type" : "markdown",
402 | "source" : "Split the data into training and test sets (30% held out for testing)"
403 | }, {
404 | "metadata" : {
405 | "trusted" : true,
406 | "input_collapsed" : false,
407 | "collapsed" : true,
408 | "id" : "65FC2F1EA6884FA991ECA28A962FA127"
409 | },
410 | "cell_type" : "code",
411 | "source" : "val splits = data.randomSplit(Array(0.7, 0.3))\nval (trainingData, testData) = (splits(0), splits(1))",
412 | "outputs" : [ ]
413 | }, {
414 | "metadata" : {
415 | "id" : "8A72D49D5A1849DE92CB23BCB60A3C81"
416 | },
417 | "cell_type" : "markdown",
418 | "source" : "Train a RandomForest model.\n\nEmpty categoricalFeaturesInfo indicates all features are continuous."
419 | }, {
420 | "metadata" : {
421 | "trusted" : true,
422 | "input_collapsed" : false,
423 | "collapsed" : true,
424 | "id" : "A8DC8E18B8AC48B3901230FB25D5E180"
425 | },
426 | "cell_type" : "code",
427 | "source" : "val numClasses = 2\nval categoricalFeaturesInfo = Map[Int, Int]()\nval numTrees = 3 // Use more in practice.\nval featureSubsetStrategy = \"auto\" // Let the algorithm choose.\nval impurity = \"gini\"\nval maxDepth = 4\nval maxBins = 32",
428 | "outputs" : [ ]
429 | }, {
430 | "metadata" : {
431 | "trusted" : true,
432 | "input_collapsed" : false,
433 | "collapsed" : true,
434 | "id" : "7A58234C49814CFC80B6609DD12AE88D"
435 | },
436 | "cell_type" : "code",
437 | "source" : "val model = RandomForest.trainClassifier(trainingData, \n numClasses, \n categoricalFeaturesInfo,\n numTrees, \n featureSubsetStrategy, \n impurity, \n maxDepth, \n maxBins)",
438 | "outputs" : [ ]
439 | }, {
440 | "metadata" : {
441 | "id" : "3FC587E4A580450794BA4704EF7A6766"
442 | },
443 | "cell_type" : "markdown",
444 | "source" : "Evaluate model on test instances and compute test error"
445 | }, {
446 | "metadata" : {
447 | "trusted" : true,
448 | "input_collapsed" : false,
449 | "collapsed" : true,
450 | "id" : "C6AF1D4F9B5F421F9F9443E5DF6B812A"
451 | },
452 | "cell_type" : "code",
453 | "source" : "val labelAndPreds = testData.map { point =>\n val prediction = model.predict(point.features)\n (point.label, prediction)\n}",
454 | "outputs" : [ ]
455 | }, {
456 | "metadata" : {
457 | "trusted" : true,
458 | "input_collapsed" : false,
459 | "collapsed" : true,
460 | "id" : "69A330F4C5544F628C744C623AE280C5"
461 | },
462 | "cell_type" : "code",
463 | "source" : "val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()\ntext(\"Test Error = \" + testErr) ++ html( ) ++ text(\"Learned classification forest model:\\n\") ++ html(
{model.toDebugString}
)",
464 | "outputs" : [ ]
465 | }, {
466 | "metadata" : {
467 | "id" : "6EC433E37604464C863DFB3F850F9103"
468 | },
469 | "cell_type" : "markdown",
470 | "source" : "Save and load model"
471 | }, {
472 | "metadata" : {
473 | "trusted" : true,
474 | "input_collapsed" : false,
475 | "collapsed" : true,
476 | "id" : "DE9462E48A9C4D198385A017A8EDA776"
477 | },
478 | "cell_type" : "code",
479 | "source" : "model.save(sc, \"/tmp/myRandomForestClassificationModel\")\nval sameModel = RandomForestModel.load(sc, \"/tmp/myRandomForestClassificationModel\")",
480 | "outputs" : [ ]
481 | }, {
482 | "metadata" : {
483 | "id" : "E4855C423EA64BD18589BF8EEE93670C"
484 | },
485 | "cell_type" : "markdown",
486 | "source" : "## Spark (Online) Clustering [GitHub](https://github.com/Spark-clustering-notebook/)"
487 | }, {
488 | "metadata" : {
489 | "id" : "5486D0036B6340EE88270E8BBDF7F48A"
490 | },
491 | "cell_type" : "markdown",
492 | "source" : "Project started at the LIPN (University of Paris 13 lab), team leaded by Mustapha Lebbah and focusing on online algorithms (mainly classification) on distributed computing (mainly Spark)."
493 | }, {
494 | "metadata" : {
495 | "id" : "FE8C0729A38546CB83ECF9625E4725EC"
496 | },
497 | "cell_type" : "markdown",
498 | "source" : "### Example G-Stream"
499 | }, {
500 | "metadata" : {
501 | "id" : "897E8ACAB74649338FDA86D5A01B510D"
502 | },
503 | "cell_type" : "markdown",
504 | "source" : "Publications:\n\n1. Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag: Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming. INNS Conference on Big Data 2015: 158-166.\n\n2. Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag, Tarn Duong: Streaming Data Clustering using Spark Streaming: Application to Big-Data of Insurance. KDD 2016 (** Paper under submission **)."
505 | }, {
506 | "metadata" : {
507 | "id" : "A6E0E11FCD2447098FF286A3B767846C"
508 | },
509 | "cell_type" : "markdown",
510 | "source" : "Prepare spark streaming context"
511 | }, {
512 | "metadata" : {
513 | "trusted" : true,
514 | "input_collapsed" : false,
515 | "collapsed" : true,
516 | "id" : "694A2EB24B4D4B5CBEF45FC658145100"
517 | },
518 | "cell_type" : "code",
519 | "source" : "import org.apache.spark.streaming.{Seconds, StreamingContext, Milliseconds}\n@transient val ssc:StreamingContext = {\n StreamingContext.getActive.foreach(_.stop(false))\n new StreamingContext(sparkContext, Milliseconds(m(\"intervalMs\").toInt))\n}",
520 | "outputs" : [ ]
521 | }, {
522 | "metadata" : {
523 | "id" : "384DC3DC983A44548EADF09121029996"
524 | },
525 | "cell_type" : "markdown",
526 | "source" : "Init first data and connect to stream (see https://github.com/Spark-clustering-notebook/coliseum/blob/master/notebooks/coliseum/G-Stream.snb)"
527 | }, {
528 | "metadata" : {
529 | "trusted" : true,
530 | "input_collapsed" : false,
531 | "collapsed" : true,
532 | "id" : "6F14D2AF06164782A886D500EEE1C2ED"
533 | },
534 | "cell_type" : "code",
535 | "source" : "val separator = \" \"\n// 'points2' contains the first two data-points used for initialising the model\n@transient val points2 = sc.textFile(s\"$expDir/data0\").map(x => x.split(separator).map(_.toDouble))\n\n// Create a DStreams that reads batch files from dirData\n@transient val stream = ssc.textFileStream(expDir).map(x => x.split(separator).map(_.toDouble))\n// Create a DStreams that will connect to a socket hostname:port\n//val stream = ssc.socketTextStream(\"localhost\", 9999).map(x => x.split(separator).map(_.toDouble)) //localhost or 10.32.2.153 for Teralab",
536 | "outputs" : [ ]
537 | }, {
538 | "metadata" : {
539 | "id" : "95D4808CE17242E48044C38D76D3AC77"
540 | },
541 | "cell_type" : "markdown",
542 | "source" : "Transform data as feature vectors"
543 | }, {
544 | "metadata" : {
545 | "trusted" : true,
546 | "input_collapsed" : false,
547 | "collapsed" : true,
548 | "id" : "BC36ED5CC523493B83D96470EE159096"
549 | },
550 | "cell_type" : "code",
551 | "source" : "stream.foreachRDD{r => \n val d = r.take(10).map(_.toList.toString)\n datalist.appendAll(d)\n } \nval labId = 2 //TODO: change -1 to -2 when you add the id to the file (last column) //-2 because the last 2 columns represent label & id\nval dim = points2.take(1)(0).size - labId",
552 | "outputs" : [ ]
553 | }, {
554 | "metadata" : {
555 | "id" : "6C07FE40130D4BAA87CB9C7FF9F45AE1"
556 | },
557 | "cell_type" : "markdown",
558 | "source" : "Import G-Stream model"
559 | }, {
560 | "metadata" : {
561 | "trusted" : true,
562 | "input_collapsed" : false,
563 | "collapsed" : true,
564 | "id" : "E8798372293841EB828EF2DCCF0F4479"
565 | },
566 | "cell_type" : "code",
567 | "source" : "import org.lipn.clustering.batchStream.batchStream",
568 | "outputs" : [ ]
569 | }, {
570 | "metadata" : {
571 | "id" : "277F3A7B08D74ED888D549206E7CD6B6"
572 | },
573 | "cell_type" : "markdown",
574 | "source" : "Configure G-Stream"
575 | }, {
576 | "metadata" : {
577 | "trusted" : true,
578 | "input_collapsed" : false,
579 | "collapsed" : true,
580 | "id" : "990EDC710D5A40229303981047872201"
581 | },
582 | "cell_type" : "code",
583 | "source" : "val decayFactor = 0.9\n val lambdaAge = 1.2\n val nbNodesToAdd = 3\n val nbWind = 5\n val DSname = \"dsname\"\n\n@transient var gstream = new batchStream()\n .setDecayFactor(decayFactor)\n .setLambdaAge(lambdaAge)\n .setMaxInsert(nbNodesToAdd)\n\n// converting each point into an object\n@transient val dstreamObj = stream.map( e =>\n gstream.model.pointToObjet(e, dim, labId)\n)",
584 | "outputs" : [ ]
585 | }, {
586 | "metadata" : {
587 | "id" : "923B04DCECB841278D5681F17B3967CA"
588 | },
589 | "cell_type" : "markdown",
590 | "source" : "Init the model wiht the first 2 points"
591 | }, {
592 | "metadata" : {
593 | "trusted" : true,
594 | "input_collapsed" : false,
595 | "collapsed" : true,
596 | "id" : "45202B93E2764B1988D53EDC368BB1B4"
597 | },
598 | "cell_type" : "code",
599 | "source" : "// initialization of the model by creating a graph of two nodes (the first 2 data-points)\ngstream.initModelObj(points2, dim)",
600 | "outputs" : [ ]
601 | }, {
602 | "metadata" : {
603 | "id" : "59FB6C24C1AF490C8DF9B3194C975649"
604 | },
605 | "cell_type" : "markdown",
606 | "source" : "Train the model online with new data coming in the `DStream`"
607 | }, {
608 | "metadata" : {
609 | "trusted" : true,
610 | "input_collapsed" : false,
611 | "collapsed" : true,
612 | "id" : "CC9724E021EB4B3787BD16595224274E"
613 | },
614 | "cell_type" : "code",
615 | "source" : "// training on the model\ngstream.trainOnObj(dstreamObj, gstream, outputDir+\"/\"+DSname+\"-\"+nbNodesToAdd, dim, nbWind)",
616 | "outputs" : [ ]
617 | }, {
618 | "metadata" : {
619 | "id" : "0BD046386C0940A995943F4CB978C202"
620 | },
621 | "cell_type" : "markdown",
622 | "source" : "This will create a new dataset (File) for each batch of data (RDD) which will contain the new `protoptypes` (~ `clusters`) which are linked as a _Self Organized Map_ "
623 | }, {
624 | "metadata" : {
625 | "id" : "77BBFF186BC94896B20C0B703DA3AF69"
626 | },
627 | "cell_type" : "markdown",
628 | "source" : "# TO BE CONTINUED"
629 | }, {
630 | "metadata" : {
631 | "id" : "A697180C935B4BE99AA332927D2B507E"
632 | },
633 | "cell_type" : "markdown",
634 | "source" : "For instance,\n\n* H2O\n* OptiML (stanford)\n* Figaro (https://github.com/p2t2/figaro)\n* sysml?\n* Factorie (http://factorie.cs.umass.edu/)\n* OscaR (https://bitbucket.org/oscarlib/oscar/wiki/Home)\n* Chalk for NLP (https://github.com/scalanlp/chalk)\n* Bayes Scala (https://github.com/danielkorzekwa/bayes-scala)"
635 | } ],
636 | "nbformat" : 4
637 | }
--------------------------------------------------------------------------------
/notebooks/Why Spark Notebook.snb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "Why Spark Notebook",
4 | "user_save_timestamp": "1970-01-01T01:00:00.000Z",
5 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z",
6 | "language_info": {
7 | "name": "scala",
8 | "file_extension": "scala",
9 | "codemirror_mode": "text/x-scala"
10 | },
11 | "trusted": true,
12 | "customLocalRepo": "/tmp/repo",
13 | "customRepos": null,
14 | "customDeps": [
15 | "org.apache.spark %% spark-streaming-kafka-0-8 % _",
16 | "com.datastax.spark %% spark-cassandra-connector-java % 1.6.0-M1",
17 | "- org.scala-lang % _ % _"
18 | ],
19 | "customImports": null,
20 | "customArgs": null,
21 | "customSparkConf": {
22 | "spark.default.parallelism": "4"
23 | }
24 | },
25 | "cells": [
26 | {
27 | "metadata": {
28 | "id": "7EDB55950A0C4BA585619E38C5D106FB"
29 | },
30 | "cell_type": "markdown",
31 | "source": "# Spark Notebook fills the gap"
32 | },
33 | {
34 | "metadata": {
35 | "id": "FD1F7891FDCA4BF9898E243432930B07"
36 | },
37 | "cell_type": "markdown",
38 | "source": "The Spark Notebook is the open source notebook focusing on productive and enterprise environments.\n\nFor that, it is only based on JVM components and has no other dependencies.\n\nScala is the only supported language for the reasons mentioned above, and if the language would still miss some features to do data science efficiently, the Spark Notebook is there to help, hereafter you'll be presented what is available."
39 | },
40 | {
41 | "metadata": {
42 | "id": "9D4B9FB7CE8547AF8E22DEFB2F004560"
43 | },
44 | "cell_type": "markdown",
45 | "source": "Before getting started, the Spark Notebook is also famous for \n* its great community out of its ~1500 on GitHub,\n* its very active [gitter](https://gitter.im/andypetrella/spark-notebook) channel: \n * 540+ participants and \n * 750+ messages per month.\n\n(17st August 2016)"
46 | },
47 | {
48 | "metadata": {
49 | "id": "816B3C0AC858439AB425EE1BBCF3F553"
50 | },
51 | "cell_type": "markdown",
52 | "source": ""
53 | },
54 | {
55 | "metadata": {
56 | "id": "4FB943930C654421BF949217B72BC5CE"
57 | },
58 | "cell_type": "markdown",
59 | "source": "---\n## Skin to control the flow visually"
60 | },
61 | {
62 | "metadata": {
63 | "trusted": true,
64 | "input_collapsed": false,
65 | "collapsed": false,
66 | "id": "E11445B1CCCE465B83A27E374CCF5BED"
67 | },
68 | "cell_type": "code",
69 | "source": ":html\n",
70 | "outputs": []
71 | },
72 | {
73 | "metadata": {
74 | "id": "ADA0BB6570A04F9A832BC2A4A1827441"
75 | },
76 | "cell_type": "markdown",
77 | "source": "---\n## Multiple Spark Contexts"
78 | },
79 | {
80 | "metadata": {
81 | "id": "24F6BEA80211494580BE0C17D62C995E"
82 | },
83 | "cell_type": "markdown",
84 | "source": "One of the top most useful feature brought by the spark notebook is its separation of the running notebooks.\n\nIndeed, each started notebook will spawn a new JVM with its own `SparkSession` instance. This allows a maximal flexibility for:\n* dependencies without clashes\n* access different clusters\n* tune differently each notebook\n* external scheduling (on the roadmap)"
85 | },
86 | {
87 | "metadata": {
88 | "id": "E092B8282D04469697399B855136522D"
89 | },
90 | "cell_type": "markdown",
91 | "source": "You can recognize easily the spawned processes using `ps` (*unix* only) and search for the main class `ChildProcessMain` and verify that the process contains the name of the started notebooks."
92 | },
93 | {
94 | "metadata": {
95 | "trusted": true,
96 | "input_collapsed": false,
97 | "collapsed": false,
98 | "presentation": {
99 | "tabs_state": "{\n \"tab_id\": \"#tab1592853846-0\"\n}",
100 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}"
101 | },
102 | "id": "8BCC90AE74CF4C00811E64FD82287703"
103 | },
104 | "cell_type": "code",
105 | "source": "import sys.process._\nimport scala.language.postfixOps\n\"ps aux\" #| \"grep ChildProcessMain\" lines_!",
106 | "outputs": []
107 | },
108 | {
109 | "metadata": {
110 | "id": "3638F8A33FBB4FD28408513DCBA49F75"
111 | },
112 | "cell_type": "markdown",
113 | "source": "So this notebook declares the variables `sparkSession` and `sparkContext` (with its alias `sc`)."
114 | },
115 | {
116 | "metadata": {
117 | "trusted": true,
118 | "input_collapsed": false,
119 | "collapsed": false,
120 | "presentation": {
121 | "tabs_state": "{\n \"tab_id\": \"#tab53254654-0\"\n}",
122 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}"
123 | },
124 | "id": "22D8EFAF7D8C4463A064B8C90F720F63"
125 | },
126 | "cell_type": "code",
127 | "source": "val context = (sparkSession, sparkContext, sc)",
128 | "outputs": []
129 | },
130 | {
131 | "metadata": {
132 | "id": "6D96AEA2A64E48E383FCEF609B635966"
133 | },
134 | "cell_type": "markdown",
135 | "source": "---\n## Metadata"
136 | },
137 | {
138 | "metadata": {
139 | "id": "1B7ACC1DDB8348688F7FF4D7895727B3"
140 | },
141 | "cell_type": "markdown",
142 | "source": "A notebook has a context enricheded via its metadata, here are a few important ones."
143 | },
144 | {
145 | "metadata": {
146 | "id": "22E00CB49C524B559F3D04E871023582"
147 | },
148 | "cell_type": "markdown",
149 | "source": "### Spark Configuration"
150 | },
151 | {
152 | "metadata": {
153 | "id": "E12EF4FE0D6B4BED84C15ADEAA087968"
154 | },
155 | "cell_type": "markdown",
156 | "source": "The metadata can define a JSON object (String to String!) to declare extra configuration for spark."
157 | },
158 | {
159 | "metadata": {
160 | "trusted": true,
161 | "input_collapsed": false,
162 | "collapsed": false,
163 | "id": "D96B3008B9E440408F5200F325A4FADA"
164 | },
165 | "cell_type": "code",
166 | "source": ":javascript \nalert(JSON.stringify(IPython.notebook.metadata.customSparkConf, null, 2))",
167 | "outputs": []
168 | },
169 | {
170 | "metadata": {
171 | "trusted": true,
172 | "input_collapsed": false,
173 | "collapsed": false,
174 | "id": "79A7AA0CF09B446B883FB07360684127"
175 | },
176 | "cell_type": "code",
177 | "source": "sparkSession.conf.get(\"spark.default.parallelism\")",
178 | "outputs": []
179 | },
180 | {
181 | "metadata": {
182 | "id": "99E5DFFCBA0A4110AE870FE3482667FD"
183 | },
184 | "cell_type": "markdown",
185 | "source": "### Dependencies"
186 | },
187 | {
188 | "metadata": {
189 | "id": "286794277F8848CB8CFF763AD90DBD84"
190 | },
191 | "cell_type": "markdown",
192 | "source": "This notebook has injected a few dependencies from the [datastax cassandra connector](https://github.com/datastax/spark-cassandra-connector/)."
193 | },
194 | {
195 | "metadata": {
196 | "trusted": true,
197 | "input_collapsed": false,
198 | "collapsed": false,
199 | "id": "215600D86BCB41669835C288F4FFC9CB"
200 | },
201 | "cell_type": "code",
202 | "source": ":javascript \nalert(JSON.stringify(IPython.notebook.metadata.customDeps, null, 2))",
203 | "outputs": []
204 | },
205 | {
206 | "metadata": {
207 | "id": "113E0EC89B7E40198DF7AE174D067328"
208 | },
209 | "cell_type": "markdown",
210 | "source": "Hence, this code compiles."
211 | },
212 | {
213 | "metadata": {
214 | "trusted": true,
215 | "input_collapsed": false,
216 | "collapsed": false,
217 | "id": "1BE83A0B7B6744FF83D8513BECC55DB9"
218 | },
219 | "cell_type": "code",
220 | "source": "import com.datastax.spark.connector._ ",
221 | "outputs": []
222 | },
223 | {
224 | "metadata": {
225 | "id": "14FF64D0BFD44D00816058D236828C39"
226 | },
227 | "cell_type": "markdown",
228 | "source": "Also it includes the kafka external modules for the current scala version (using `%%`) and the current spark version (using `_`)"
229 | },
230 | {
231 | "metadata": {
232 | "trusted": true,
233 | "input_collapsed": false,
234 | "collapsed": false,
235 | "id": "E5CABC64B14B442D8A824994EC18C4EA"
236 | },
237 | "cell_type": "code",
238 | "source": "import org.apache.spark.streaming.kafka",
239 | "outputs": []
240 | },
241 | {
242 | "metadata": {
243 | "id": "15A133EFC2AF4EE0A3E160E468961066"
244 | },
245 | "cell_type": "markdown",
246 | "source": "We can also see that we can remove dependencies by prepending `-` to the definition. So we avoid downloading any extra libraries from the scala language."
247 | },
248 | {
249 | "metadata": {
250 | "id": "A7EABF28DB654A27800804D88C1DC546"
251 | },
252 | "cell_type": "markdown",
253 | "source": "### Change the metadata"
254 | },
255 | {
256 | "metadata": {
257 | "id": "82F81FDC8E9B473D94A7B3479E7BB550"
258 | },
259 | "cell_type": "markdown",
260 | "source": "There are a few metadata available and you can configure them from the editor in the menu: _Edit > Edit Notebook Metadata_."
261 | },
262 | {
263 | "metadata": {
264 | "trusted": true,
265 | "input_collapsed": false,
266 | "collapsed": false,
267 | "id": "75D9790EC99F4F23ADE7434CE9DC72EC"
268 | },
269 | "cell_type": "code",
270 | "source": ":javascript \nIPython.notebook.edit_metadata()",
271 | "outputs": []
272 | },
273 | {
274 | "metadata": {
275 | "id": "7F3F63814584413E817A6AE5B56C0FAD"
276 | },
277 | "cell_type": "markdown",
278 | "source": "---\n## Logs"
279 | },
280 | {
281 | "metadata": {
282 | "id": "D4B5803CA95D4E6B9497CA86DE6C866F"
283 | },
284 | "cell_type": "markdown",
285 | "source": "Checking logs is always painful when using a notebook since this is simply a web client on the remote REPL in the server. \n\nHence the logs are quite far, or even worse inaccessible!\n\nSo, the spark notebook will forwards **all logs using slf4j** to the browser console → go check it, use the `F12` key and open the _console_ tab!"
286 | },
287 | {
288 | "metadata": {
289 | "id": "A187C1B705A644B794DDEDB33AE7B223"
290 | },
291 | "cell_type": "markdown",
292 | "source": "---\n## Side pane"
293 | },
294 | {
295 | "metadata": {
296 | "id": "9870E3C823B048CC8D558F2B538CC83F"
297 | },
298 | "cell_type": "markdown",
299 | "source": "In the **View** menu, you'll can open the side pane which contains many interesting panels:\n* `terms` : a table listing the defined _functions_, _variables_ and _types_ !\n* `error logs` : displaying and bringing back any errors thrown in the server\n* `chat room` : a fancy chat room available for the current notebook (see below in the synchronized section)"
300 | },
301 | {
302 | "metadata": {
303 | "trusted": true,
304 | "input_collapsed": false,
305 | "collapsed": false,
306 | "id": "90AF0CFA143E4F6283502273EF5DB351"
307 | },
308 | "cell_type": "code",
309 | "source": ":javascript\njQuery('a#toggle-sidebar').click()",
310 | "outputs": []
311 | },
312 | {
313 | "metadata": {
314 | "id": "3F098E2A54C446E486EE544C75C756F9"
315 | },
316 | "cell_type": "markdown",
317 | "source": "---\n## Plotting"
318 | },
319 | {
320 | "metadata": {
321 | "id": "39F23F6B83C6465D8DE2CD5BD18C2C41"
322 | },
323 | "cell_type": "markdown",
324 | "source": "There exist many predefined `Chart` that you can use directly on any kind of **Scala** container that can be iterated."
325 | },
326 | {
327 | "metadata": {
328 | "id": "099B21AF5E2B468E8C7A65C4DD2AAC66"
329 | },
330 | "cell_type": "markdown",
331 | "source": "If the last statement of a cell isn't an assignment or a definition, then the spark notebook will try to plot it the best way it can automatically."
332 | },
333 | {
334 | "metadata": {
335 | "trusted": true,
336 | "input_collapsed": false,
337 | "collapsed": false,
338 | "id": "3B03DFAD401D44B08D0C8592C5246AF0"
339 | },
340 | "cell_type": "code",
341 | "source": "case class Example(id:Int, category:String, value:Long, advanced:Boolean)\nimport scala.util.Random._\n\nval categories = List.fill(5)(List.fill(10)(nextPrintableChar).mkString)\ndef category:String = shuffle(categories).head\nval examples = List.fill(100)(Example(nextInt(2000), category, nextLong, nextBoolean))",
342 | "outputs": []
343 | },
344 | {
345 | "metadata": {
346 | "id": "9B375D9DBD51403E8BE8E838F1734300"
347 | },
348 | "cell_type": "markdown",
349 | "source": "The above cell doesn't plot anything since it terminates with a assignement."
350 | },
351 | {
352 | "metadata": {
353 | "trusted": true,
354 | "input_collapsed": false,
355 | "collapsed": true,
356 | "presentation": {
357 | "tabs_state": "{\n \"tab_id\": \"#tab1337288181-1\"\n}",
358 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [\n \"category\"\n ],\n \"rows\": [\n \"id\"\n ],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}"
359 | },
360 | "id": "2E61401D7F6B4DCEA0D5F52DE96FB973"
361 | },
362 | "cell_type": "code",
363 | "source": "examples",
364 | "outputs": []
365 | },
366 | {
367 | "metadata": {
368 | "id": "0B51A8C208074B7C9ABADED8494A61F5"
369 | },
370 | "cell_type": "markdown",
371 | "source": "Now we have a `TableChart` and a `PivotChart` tabs for the data, which we can use to have a better feeling of the data."
372 | },
373 | {
374 | "metadata": {
375 | "id": "9588CA4E38C74D1C8BD20F16CB218D55"
376 | },
377 | "cell_type": "markdown",
378 | "source": "We can of course create them ourselves:"
379 | },
380 | {
381 | "metadata": {
382 | "trusted": true,
383 | "input_collapsed": false,
384 | "collapsed": false,
385 | "presentation": {
386 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}"
387 | },
388 | "id": "595E9615FC9A410AA16F64B33FF80912"
389 | },
390 | "cell_type": "code",
391 | "source": "TableChart(examples)",
392 | "outputs": []
393 | },
394 | {
395 | "metadata": {
396 | "id": "670BE2E3168D47298CC6B17A30A4FF41"
397 | },
398 | "cell_type": "markdown",
399 | "source": "### Grouping plots"
400 | },
401 | {
402 | "metadata": {
403 | "id": "D2895E0887E14A588731F1FAB389B875"
404 | },
405 | "cell_type": "markdown",
406 | "source": "Among available charts, you have for instance the pretty common ones like:\n* `LineChart`\n* `ScatterChart`\n* `BarChart`\n\nWhich accept at least two other parameters: \n* `fields`: the two field names to use to plot \n* `groupField`: the field used to group the data"
407 | },
408 | {
409 | "metadata": {
410 | "trusted": true,
411 | "input_collapsed": false,
412 | "collapsed": false,
413 | "id": "A993AFEA7F2945A184B760EE6251EBAE"
414 | },
415 | "cell_type": "code",
416 | "source": "LineChart(examples, fields=Some((\"id\", \"value\")), groupField=Some(\"advanced\"))",
417 | "outputs": []
418 | },
419 | {
420 | "metadata": {
421 | "trusted": true,
422 | "input_collapsed": false,
423 | "collapsed": false,
424 | "id": "A993AFEA7F2945A184B760EE6251EBAE"
425 | },
426 | "cell_type": "code",
427 | "source": "ScatterChart(examples, fields=Some((\"id\", \"value\")), groupField=Some(\"advanced\"))",
428 | "outputs": []
429 | },
430 | {
431 | "metadata": {
432 | "trusted": true,
433 | "input_collapsed": false,
434 | "collapsed": false,
435 | "id": "A993AFEA7F2945A184B760EE6251EBAE"
436 | },
437 | "cell_type": "code",
438 | "source": "BarChart(examples, fields=Some((\"id\", \"value\")), groupField=Some(\"advanced\"))",
439 | "outputs": []
440 | },
441 | {
442 | "metadata": {
443 | "id": "4153FEFD93464E86A0F7FF113FB1139F"
444 | },
445 | "cell_type": "markdown",
446 | "source": "---\n## Graphs"
447 | },
448 | {
449 | "metadata": {
450 | "id": "F91D80C4BFD0408DA915051BC3AEE740"
451 | },
452 | "cell_type": "markdown",
453 | "source": "Graph is generally a common way to represent data where connections matter. Hence the Spark Notebook defines an API easing the definition of `Node` and `Edge`.\n\n* `Graph[T]`: abstract class defining a graph component with an id of type `T`, a value of type `Any` and a color\n* `Node[T]`: defines a node as a circle which can be specified a radius and its position ($x$, $y$) (initial or static if it's fixed)\n* `Edge[T]`: defines an edge using the ids of both ends"
454 | },
455 | {
456 | "metadata": {
457 | "trusted": true,
458 | "input_collapsed": false,
459 | "collapsed": false,
460 | "id": "5B96F1C40FA74E55ADBB6071A1B2FB0C"
461 | },
462 | "cell_type": "code",
463 | "source": "case class GraphExample(id:Int, cluster:Char, value:Long)\nval clusters = ('A' to 'D').toList\nval cluCol = clusters.zip(List(\"#000\", \"#478\", \"#127\", \"#984\", \"#F5A\")).toMap\nval gexamples = List.tabulate(10, 4)((i,j) => GraphExample(i*4+j, clusters(j), nextLong)).flatten\n\nval nodes = gexamples.map(e => notebook.front.widgets.magic.Node(e.id, e, cluCol(e.cluster), 5))\n\nval clustered = gexamples.groupBy(_.cluster).toList\nval connectedClusters = clustered.flatMap { case (c, cl) => \n for {\n a <- cl\n b <- cl if a != b\n } yield notebook.front.widgets.magic.Edge[Int](400+nextInt(400)+a.id+b.id, (a.id, b.id), \"intra\", \"red\")\n }",
464 | "outputs": []
465 | },
466 | {
467 | "metadata": {
468 | "trusted": true,
469 | "input_collapsed": false,
470 | "collapsed": false,
471 | "id": "E31A5ACA4F8D442383289F17C620C5B4"
472 | },
473 | "cell_type": "code",
474 | "source": "val singleConnections = {\n val s = gexamples.take(4)\n \n for (a <- s; b <- s if a != b) \n yield Edge(800+nextInt(400)+a.id+b.id, (a.id, b.id), \"inter\", \"green\")\n}\n\nval all = nodes ::: connectedClusters ::: singleConnections",
475 | "outputs": []
476 | },
477 | {
478 | "metadata": {
479 | "trusted": true,
480 | "input_collapsed": false,
481 | "collapsed": false,
482 | "presentation": {
483 | "tabs_state": "{\n \"tab_id\": \"#tab1249479791-0\"\n}",
484 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}"
485 | },
486 | "id": "1F073EEB99CE4B1781AD74C59E208805"
487 | },
488 | "cell_type": "code",
489 | "source": "GraphChart(all, maxPoints = 1000, sizes=(600, 600))",
490 | "outputs": []
491 | },
492 | {
493 | "metadata": {
494 | "id": "E14D74FA3B164544A79D22800BB84BDD"
495 | },
496 | "cell_type": "markdown",
497 | "source": "---\n## Geo charts"
498 | },
499 | {
500 | "metadata": {
501 | "id": "E78EF394884248DF87A887AFE627B13F"
502 | },
503 | "cell_type": "markdown",
504 | "source": "There are two types of geo charts:\n* `GeoPointsChart` for simple points lat long points\n* `GeoChart` for _GeoJSON_ or _opengis_ data\n"
505 | },
506 | {
507 | "metadata": {
508 | "id": "41C5ABBEC5754FC58AD56ADD7E799E5C"
509 | },
510 | "cell_type": "markdown",
511 | "source": "### GeoPointsChart"
512 | },
513 | {
514 | "metadata": {
515 | "id": "BB38DC199B5E4DAA9ECFFBF4FBD9B3AB"
516 | },
517 | "cell_type": "markdown",
518 | "source": "Let's load some airports data with latitude and longitude coordinates"
519 | },
520 | {
521 | "metadata": {
522 | "trusted": true,
523 | "input_collapsed": false,
524 | "collapsed": false,
525 | "id": "B23EDAF0D8204B7D845248EE24E20F30"
526 | },
527 | "cell_type": "code",
528 | "source": "val root = sys.env(\"NOTEBOOKS_DIR\")\nval airportsDF = sparkSession.read.json(s\"$root/notebooks/airports.json\")\nairportsDF.cache\nairportsDF",
529 | "outputs": []
530 | },
531 | {
532 | "metadata": {
533 | "trusted": true,
534 | "input_collapsed": false,
535 | "collapsed": false,
536 | "presentation": {
537 | "tabs_state": "{\n \"tab_id\": \"#tab1529529486-0\"\n}",
538 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}"
539 | },
540 | "id": "1707B838D5494D1F8476AADCA120EFC5"
541 | },
542 | "cell_type": "code",
543 | "source": "val statsDF = airportsDF.groupBy(\"state\").count.orderBy($\"count\".desc).limit(5)",
544 | "outputs": []
545 | },
546 | {
547 | "metadata": {
548 | "id": "0A7F973B66494A349A505FAE9A4454ED"
549 | },
550 | "cell_type": "markdown",
551 | "source": "Convert to Dataset (for the fun)."
552 | },
553 | {
554 | "metadata": {
555 | "trusted": true,
556 | "input_collapsed": false,
557 | "collapsed": false,
558 | "id": "88DD55147E984B7F98349BFA935C2572"
559 | },
560 | "cell_type": "code",
561 | "source": "case class StateStat(state:String, count:Long)",
562 | "outputs": []
563 | },
564 | {
565 | "metadata": {
566 | "trusted": true,
567 | "input_collapsed": false,
568 | "collapsed": false,
569 | "id": "575DBAFD9887415EB6BBD0D1D63B8194"
570 | },
571 | "cell_type": "code",
572 | "source": "statsDF.as[StateStat]",
573 | "outputs": []
574 | },
575 | {
576 | "metadata": {
577 | "id": "335F1C282CE94D4B88DD6A2D3D5C3651"
578 | },
579 | "cell_type": "markdown",
580 | "source": "Plot the dataframe with dedicated colors for each state"
581 | },
582 | {
583 | "metadata": {
584 | "trusted": true,
585 | "input_collapsed": false,
586 | "collapsed": false,
587 | "id": "FF8F94DA1063463A8A38978BABA5A782"
588 | },
589 | "cell_type": "code",
590 | "source": "import org.apache.spark.sql.functions._",
591 | "outputs": []
592 | },
593 | {
594 | "metadata": {
595 | "trusted": true,
596 | "input_collapsed": false,
597 | "collapsed": false,
598 | "id": "6B6635CEBAC349EC87BAC809E5F21F78"
599 | },
600 | "cell_type": "code",
601 | "source": "def forStates[A](xs:List[A]) = when($\"state\" === \"AK\", xs(0))\n .when($\"state\" === \"TX\", xs(1))\n .when($\"state\" === \"CA\", xs(2))\n .when($\"state\" === \"OK\", xs(3))\n .when($\"state\" === \"OH\", xs(4))\n .otherwise(xs(5))\nval airportsDFWithStyles = airportsDF.withColumn(\"r\", forStates(List(10,9,8,7,6,1)))\n .withColumn(\"c\", forStates(List(\"red\",\"orange\",\"blue\",\"green\",\"yellow\",\"white\")))\nGeoPointsChart(airportsDFWithStyles, latLonFields=Some((\"lat\", \"long\")), rField = Some(\"r\"), colorField = Some(\"c\"))",
602 | "outputs": []
603 | },
604 | {
605 | "metadata": {
606 | "id": "B4273E6F1E8A4ADA8866AB9A067533E4"
607 | },
608 | "cell_type": "markdown",
609 | "source": "---\n## GeoChart"
610 | },
611 | {
612 | "metadata": {
613 | "id": "0E6BA873EB35476B8A6CC133F42ACDF4"
614 | },
615 | "cell_type": "markdown",
616 | "source": "Fetch some data on the web about parks and gardens"
617 | },
618 | {
619 | "metadata": {
620 | "trusted": true,
621 | "input_collapsed": false,
622 | "collapsed": false,
623 | "id": "ACD54EC9C7414EE98C568AD260A377ED"
624 | },
625 | "cell_type": "code",
626 | "source": ":sh wget http://data.cyc.opendata.arcgis.com/datasets/57fa576e5e8149b0b744f768e01e5ce1_0.geojson -O Parks_and_Gardens.geojson",
627 | "outputs": []
628 | },
629 | {
630 | "metadata": {
631 | "id": "7C6FD417F57A47098A685A3E4DDC721C"
632 | },
633 | "cell_type": "markdown",
634 | "source": "Parse it as GeoJSON using provided `widgets.parseGeoJSON`"
635 | },
636 | {
637 | "metadata": {
638 | "trusted": true,
639 | "input_collapsed": false,
640 | "collapsed": false,
641 | "id": "750EC1E117C24C28851D00C89298F5A8"
642 | },
643 | "cell_type": "code",
644 | "source": "val geoJSONRepr = widgets.parseGeoJSON(scala.io.Source.fromFile(\"Parks_and_Gardens.geojson\").getLines.mkString(\"\"))",
645 | "outputs": []
646 | },
647 | {
648 | "metadata": {
649 | "id": "A4544FA65C0F480EB4FFAF28E27F5DBB"
650 | },
651 | "cell_type": "markdown",
652 | "source": "Fetch some more vectorial information of the same area."
653 | },
654 | {
655 | "metadata": {
656 | "trusted": true,
657 | "input_collapsed": false,
658 | "collapsed": false,
659 | "id": "012789D6569C4DC6A15579070B0B17B6"
660 | },
661 | "cell_type": "code",
662 | "source": ":sh wget http://data.cyc.opendata.arcgis.com/datasets/9b212b7af275438ca9088ff868bda139_9.geojson -O airqual.geojson",
663 | "outputs": []
664 | },
665 | {
666 | "metadata": {
667 | "id": "86A9DEA8BDD843B08DC2985638799232"
668 | },
669 | "cell_type": "markdown",
670 | "source": "And parse it..."
671 | },
672 | {
673 | "metadata": {
674 | "trusted": true,
675 | "input_collapsed": false,
676 | "collapsed": false,
677 | "id": "E5314FF973454E0B96877C2A6690B914"
678 | },
679 | "cell_type": "code",
680 | "source": "val ng = widgets.parseGeoJSON(scala.io.Source.fromFile(\"airqual.geojson\").getLines.mkString(\"\"))",
681 | "outputs": []
682 | },
683 | {
684 | "metadata": {
685 | "id": "9F1F3B348DEA40BFA087795342006AEE"
686 | },
687 | "cell_type": "markdown",
688 | "source": "Create a `GeoChart` instance on `GeoJSON` representation of the first dataset."
689 | },
690 | {
691 | "metadata": {
692 | "trusted": true,
693 | "input_collapsed": false,
694 | "collapsed": false,
695 | "id": "273E8A681B4942838D2E0DB37C215F9E"
696 | },
697 | "cell_type": "code",
698 | "source": "val gc = GeoChart(Seq(geoJSONRepr), sizes=(800, 800))\ngc",
699 | "outputs": []
700 | },
701 | {
702 | "metadata": {
703 | "id": "898788F161C94A898F20FD85328526C8"
704 | },
705 | "cell_type": "markdown",
706 | "source": "We can now add the linear features into the same chart using the helpful function `addAndApply` which adds information to the existing chart."
707 | },
708 | {
709 | "metadata": {
710 | "trusted": true,
711 | "input_collapsed": false,
712 | "collapsed": false,
713 | "id": "1E08A11C501C4E618FDFD7E97F340D79"
714 | },
715 | "cell_type": "code",
716 | "source": "gc.addAndApply(Seq(ng))",
717 | "outputs": []
718 | },
719 | {
720 | "metadata": {
721 | "id": "2DDE82591B3E4DE1B09C71CC9292E148"
722 | },
723 | "cell_type": "markdown",
724 | "source": "---\n## Fancy charts"
725 | },
726 | {
727 | "metadata": {
728 | "id": "415502CC4E5E44D7BF7F08978A4C80BD"
729 | },
730 | "cell_type": "markdown",
731 | "source": "### Radar"
732 | },
733 | {
734 | "metadata": {
735 | "id": "F164A3ACA35A46638B4E82313A59E88E"
736 | },
737 | "cell_type": "markdown",
738 | "source": "Let's grab some data from http://www.basketball-reference.com/teams/SAS/2016.html (31st May 2016)."
739 | },
740 | {
741 | "metadata": {
742 | "trusted": true,
743 | "input_collapsed": false,
744 | "collapsed": false,
745 | "id": "DA236FFF7873453397F47AED00A55C87"
746 | },
747 | "cell_type": "code",
748 | "source": "case class TeamMember(Player:String, Age:Int, FG_pc:Double, _3P_pc:Double, _2P_pc:Double, eFG_pc:Double, FT_pc:Double)\nval team = \n s\"\"\"\n 1\tKawhi Leonard\t24\t72\t72\t2380\t551\t1090\t.506\t129\t291\t.443\t422\t799\t.528\t.565\t292\t334\t.874\t95\t398\t493\t186\t128\t71\t105\t133\t1523\n 2\tLaMarcus Aldridge\t30\t74\t74\t2261\t536\t1045\t.513\t0\t16\t.000\t536\t1029\t.521\t.513\t259\t302\t.858\t176\t456\t632\t110\t38\t81\t99\t151\t1331\n 3\tDanny Green\t28\t79\t79\t2062\t211\t561\t.376\t116\t349\t.332\t95\t212\t.448\t.480\t34\t46\t.739\t48\t255\t303\t141\t79\t64\t75\t141\t572\n 4\tTony Parker\t33\t72\t72\t1980\t350\t710\t.493\t27\t65\t.415\t323\t645\t.501\t.512\t130\t171\t.760\t17\t159\t176\t379\t54\t11\t131\t114\t857\n 5\tPatrick Mills\t27\t81\t3\t1662\t260\t612\t.425\t123\t320\t.384\t137\t292\t.469\t.525\t47\t58\t.810\t27\t131\t158\t226\t59\t6\t76\t102\t690\n 6\tTim Duncan\t39\t61\t60\t1536\t215\t441\t.488\t0\t2\t.000\t215\t439\t.490\t.488\t92\t131\t.702\t115\t332\t447\t163\t47\t78\t90\t125\t522\n 7\tDavid West\t35\t78\t19\t1404\t244\t448\t.545\t3\t7\t.429\t241\t441\t.546\t.548\t63\t80\t.788\t72\t237\t309\t143\t44\t55\t68\t142\t554\n 8\tBoris Diaw\t33\t76\t4\t1386\t202\t383\t.527\t25\t69\t.362\t177\t314\t.564\t.560\t56\t76\t.737\t58\t175\t233\t176\t26\t21\t97\t102\t485\n 9\tKyle Anderson\t22\t78\t11\t1245\t138\t295\t.468\t12\t37\t.324\t126\t258\t.488\t.488\t62\t83\t.747\t25\t219\t244\t123\t60\t29\t59\t97\t350\n 10\tManu Ginobili\t38\t58\t0\t1134\t197\t435\t.453\t70\t179\t.391\t127\t256\t.496\t.533\t91\t112\t.813\t26\t120\t146\t177\t66\t11\t99\t99\t555\n 11\tJonathon Simmons\t26\t55\t2\t813\t122\t242\t.504\t18\t47\t.383\t104\t195\t.533\t.541\t69\t92\t.750\t16\t80\t96\t58\t24\t5\t53\t103\t331\n 12\tBoban Marjanovic\t27\t54\t4\t508\t105\t174\t.603\t0\t0\t.0\t105\t174\t.603\t.603\t87\t114\t.763\t73\t121\t194\t21\t12\t23\t29\t54\t297\n 13\tRasual Butler\t36\t46\t0\t432\t49\t104\t.471\t15\t49\t.306\t34\t55\t.618\t.543\t11\t16\t.688\t3\t53\t56\t24\t13\t23\t8\t11\t124\n 14\tKevin Martin\t32\t16\t1\t261\t30\t85\t.353\t11\t33\t.333\t19\t52\t.365\t.418\t28\t30\t.933\t4\t25\t29\t12\t9\t2\t13\t15\t99\n 15\tRay McCallum\t24\t31\t3\t256\t27\t67\t.403\t5\t16\t.313\t22\t51\t.431\t.440\t9\t10\t.900\t6\t25\t31\t33\t5\t4\t11\t14\t68\n 16\tMatt Bonner\t35\t30\t2\t206\t29\t57\t.509\t15\t34\t.441\t14\t23\t.609\t.640\t3\t4\t.750\t3\t24\t27\t9\t6\t1\t3\t16\t76\n 17\tAndre Miller\t39\t13\t4\t181\t23\t48\t.479\t1\t4\t.250\t22\t44\t.500\t.490\t9\t13\t.692\t6\t21\t27\t29\t7\t0\t12\t14\t56\n \"\"\".trim.split(\"\\n\").map(s => s.trim.split(\"\\t\").drop(1).toList).map(x => (x.head, x(1).trim.toInt) → x.drop(2).filter(_.startsWith(\".\"))\n .map(_.trim.toDouble * 100)).map { case ((p, a), stats) =>\n TeamMember(p, a, stats(0), stats(1), stats(2), stats(3), stats(4))\n }",
749 | "outputs": []
750 | },
751 | {
752 | "metadata": {
753 | "trusted": true,
754 | "input_collapsed": false,
755 | "collapsed": false,
756 | "id": "CAC0438C302A4C5388A9B71E5C246FD8"
757 | },
758 | "cell_type": "code",
759 | "source": "RadarChart(shuffle(team.toList).take(5), labelField=Some(\"Player\"), sizes=(800, 600))",
760 | "outputs": []
761 | },
762 | {
763 | "metadata": {
764 | "id": "75EF85B5D6534BCF873488D9D092F54A"
765 | },
766 | "cell_type": "markdown",
767 | "source": "### Pivot"
768 | },
769 | {
770 | "metadata": {
771 | "trusted": true,
772 | "input_collapsed": false,
773 | "collapsed": false,
774 | "presentation": {
775 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [\n \"_2P_pc\",\n \"_3P_pc\"\n ],\n \"rows\": [\n \"Player\"\n ],\n \"vals\": [\n \"_3P_pc\"\n ],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Sum\",\n \"rendererName\": \"Line Chart\"\n}"
776 | },
777 | "id": "259688ADDF334ACE85203BF5AE070ED7"
778 | },
779 | "cell_type": "code",
780 | "source": "PivotChart(team)",
781 | "outputs": []
782 | },
783 | {
784 | "metadata": {
785 | "id": "77F0A0B2C4AB4A5B8AB689EB14AF40D8"
786 | },
787 | "cell_type": "markdown",
788 | "source": "### Parallel coordinates"
789 | },
790 | {
791 | "metadata": {
792 | "trusted": true,
793 | "input_collapsed": false,
794 | "collapsed": false,
795 | "id": "536D63D7B3A64281BD62B95E7FBB4E76"
796 | },
797 | "cell_type": "code",
798 | "source": "ParallelCoordChart(team, sizes=(800, 500))",
799 | "outputs": []
800 | },
801 | {
802 | "metadata": {
803 | "id": "CD0C49DF563340A38E82471242EF4F32"
804 | },
805 | "cell_type": "markdown",
806 | "source": "### Timeseries"
807 | },
808 | {
809 | "metadata": {
810 | "trusted": true,
811 | "input_collapsed": false,
812 | "collapsed": false,
813 | "id": "76CC0E2ED70D458689FF7F75297D8E77"
814 | },
815 | "cell_type": "code",
816 | "source": ":sh wget http://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/p12/12/1880-2015.csv -O /tmp/1880-2015.csv",
817 | "outputs": []
818 | },
819 | {
820 | "metadata": {
821 | "trusted": true,
822 | "input_collapsed": false,
823 | "collapsed": false,
824 | "id": "52DBD8124E374C25A4F336A6A26D245C"
825 | },
826 | "cell_type": "code",
827 | "source": "import java.util.Calendar\nimport java.util.Calendar._\nval cal = Calendar.getInstance\ncal.set(DAY_OF_MONTH, 0)\ncal.set(HOUR, 0)\ncal.set(MINUTE, 0)\ncal.set(SECOND, 0)\ncal.set(MILLISECOND, 0)\nval ts = scala.io.Source.fromFile(new File(\"/tmp/1880-2015.csv\")).getLines.drop(4)\n .map(_.split(\",\").toList.map(_.trim))\n .map{case List(y,c) => (y.take(4).toInt, y.drop(4).take(2).dropWhile(_ == '0').toInt-1, c.toDouble)}\n .map{ case (y, m, c) => \n cal.set(YEAR, y)\n cal.set(MONTH, m)\n cal.getTime → c\n }.toList",
828 | "outputs": []
829 | },
830 | {
831 | "metadata": {
832 | "trusted": true,
833 | "input_collapsed": false,
834 | "collapsed": false,
835 | "id": "2B91343B09F041B19FD683E0C04F99D8"
836 | },
837 | "cell_type": "code",
838 | "source": "val tc = TimeseriesChart(ts)\ntc",
839 | "outputs": []
840 | },
841 | {
842 | "metadata": {
843 | "id": "7C30152109AD4727827599D6949592F0"
844 | },
845 | "cell_type": "markdown",
846 | "source": "---\n## Everything is Dynamic and Reactive"
847 | },
848 | {
849 | "metadata": {
850 | "id": "3AADA3697ABE44BF85FFB47209E06036"
851 | },
852 | "cell_type": "markdown",
853 | "source": "Since data can come live in a system or you want to log vizualy some events or perhaps you need to have two visual components to interact... what you don't want to do is to write the html, js, server code and who knows what else you'll need to master...\n\nFor that, the spark notebook comes with dynamicity of charts and most (if not all) components can be listened and react to events."
854 | },
855 | {
856 | "metadata": {
857 | "id": "83B7B4F029324AADA94E34497C395F35"
858 | },
859 | "cell_type": "markdown",
860 | "source": "### Dynamic Line Chart"
861 | },
862 | {
863 | "metadata": {
864 | "trusted": true,
865 | "input_collapsed": false,
866 | "collapsed": false,
867 | "id": "CE563B405582434AAFA53E4C4AE8C893"
868 | },
869 | "cell_type": "code",
870 | "source": "val tsH :: tsS = ts.sliding(100, 100).toList\nval dynTC = TimeseriesChart(tsH, maxPoints = ts.size)\ndynTC",
871 | "outputs": []
872 | },
873 | {
874 | "metadata": {
875 | "trusted": true,
876 | "input_collapsed": false,
877 | "collapsed": true,
878 | "id": "FA0E02EBBD7D467385D7C17B4708DFBB"
879 | },
880 | "cell_type": "code",
881 | "source": "var cont = true\nnew Thread() {\n override def run = \n tsS.foreach { l =>\n if (cont) {\n Thread.sleep(1000)\n dynTC.addAndApply(l)\n }\n }\n}.start",
882 | "outputs": []
883 | },
884 | {
885 | "metadata": {
886 | "id": "9BD0C8C4788049C78D4C6E90059E3A83"
887 | },
888 | "cell_type": "markdown",
889 | "source": "### Components"
890 | },
891 | {
892 | "metadata": {
893 | "trusted": true,
894 | "input_collapsed": false,
895 | "collapsed": true,
896 | "id": "C3A5386580284B5D9C8D4BE907F6FA8C"
897 | },
898 | "cell_type": "code",
899 | "source": "val rteam = shuffle(team.toList).take(5)\nval dd = new DropDown(\"All\" :: rteam.map(_.Player))\nval rc = RadarChart(rteam, labelField=Some(\"Player\"), sizes=(800, 600))\nval bout = out\n\ndd.selected --> Connection.fromObserver { p =>\n bout(p + \" is selected\")\n rc.applyOn(rc.originalData.filter(_.Player == p || p == \"All\"))\n}\n\ndd ++ bout ++ rc",
900 | "outputs": []
901 | },
902 | {
903 | "metadata": {
904 | "id": "42D57AE9A76F4FA1A114BA95C2E1FADB"
905 | },
906 | "cell_type": "markdown",
907 | "source": "---\n## Synchronization"
908 | },
909 | {
910 | "metadata": {
911 | "id": "7460856EFF64425A83DD0DD4B5814360"
912 | },
913 | "cell_type": "markdown",
914 | "source": "Oh... notebooks are synchronized!\n\nOpen another browser window and relaunch the timeseries example."
915 | },
916 | {
917 | "metadata": {
918 | "id": "593AD5B17B854593B20EA88F8B18E338"
919 | },
920 | "cell_type": "markdown",
921 | "source": "---\n## Create new chart type... live"
922 | },
923 | {
924 | "metadata": {
925 | "id": "E51CB610568E4936ADA024549386E8E3"
926 | },
927 | "cell_type": "markdown",
928 | "source": "If you want/need to exerce your js fu, you can always use the `Chart` (for instance) API to create new dynamic widgets types.\n\nIn the following, we'll create a widget that can plot duration bars based for given operations (only a name):\n* a `js` string which is the javascript to execute for the new chart. It:\n * has to be a function with 3 params\n * `dataO` a knockout observable wich can be listened for new incoming data, see the `subscribe` call\n * `container` is the div element where you can add new elements\n * `options` is an extra object passed to the widget which defines additional configuration options (like width or a specific color or whatever)\n * has a `this` object containing:\n * `dataInit` this is the JSON representation of the Scala data as an array of objects having the same schema as the Scala type\n * `genId` a unique id that you can use for a high level element for instance"
929 | },
930 | {
931 | "metadata": {
932 | "trusted": true,
933 | "input_collapsed": false,
934 | "collapsed": true,
935 | "id": "81349F9C871E416093E496CEA8A9F534"
936 | },
937 | "cell_type": "code",
938 | "source": "val js = \"\"\"\nfunction progressgraph (dataO, container, options) {\n var css = 'div.prog {position: relative; overflow: hidden; } span.pp {display: inline-block; position: absolute; height: 16px;} span.prog {display: inline-block; position: absolute; height: 16px; }' +\n '.progs {border: solid 1px #ccc; background: #eee; } .progs .pv {background: #3182bd; }',\n head = document.head || document.getElementsByTagName('head')[0],\n style = document.createElement('style');\n\n style.type = 'text/css';\n if (style.styleSheet){\n style.styleSheet.cssText = css;\n } else {\n style.appendChild(document.createTextNode(css));\n }\n\n head.appendChild(style);\n\n\n var width = options.width||600\n var height = options.height||400\n \n function create(name, duration) {\n var div = d3.select(container).append(\"div\").attr(\"class\", \"prog\");\n\n div.append(\"span\").attr(\"class\", \"pp prog\")\n .style(\"width\", \"74px\")\n .style(\"text-align\", \"right\")\n .style(\"z-index\", \"2000\")\n .text(name);\n\n div.append(\"span\")\n .attr(\"class\", \"progs\")\n .style(\"width\", \"240px\")\n .style(\"left\", \"80px\")\n .append(\"span\")\n .attr(\"class\", \"pp pv\")\n .transition()\n .duration(duration)\n .ease(\"linear\")\n .style(\"width\", \"350px\");\n\n div.transition()\n .style(\"height\", \"20px\")\n .transition()\n .delay(duration)\n .style(\"height\", \"0px\")\n .remove();\n\n }\n\n function onData(data) {\n _.each(data, function(d) {\n create(d[options.name], 5000 + d[options.duration])\n });\n }\n\n onData(this.dataInit);\n dataO.subscribe(onData);\n}\n\"\"\".trim",
939 | "outputs": []
940 | },
941 | {
942 | "metadata": {
943 | "id": "D652E9BA1D884CDB98B8DF06D77DEB35"
944 | },
945 | "cell_type": "markdown",
946 | "source": "Now we can create the widget extending `notebook.front.widgets.charts.Chart[C]`, where `C` is any Scala type, it'll be converted to JS using the implicit instance of `ToPoints`.\n\nIt has to declare the original dataset which needs to be a wrapper (`List`, `Array`, ...) of the `C` instances we want to plot. But it can also define other things like below:\n* `sizes` are the $w \\times h$ dimension of the chart\n* `maxPoints` the number of points to plot, the way to select them is defined in the implicitly available instance of `Sampler`.\n* `scripts` a list of references to existing javascript scripts\n* `snippets` a list of string that represent snippets to execute in JS, they take the form of a JSON object with\n * `f` the function to call when the snippet will be executed\n * `o` a JSON object that will be provided to the above function at execution time. Here we define which field has to be used for the name and duration."
947 | },
948 | {
949 | "metadata": {
950 | "trusted": true,
951 | "input_collapsed": false,
952 | "collapsed": true,
953 | "id": "FA51FED8558140FEB37478F0594DAB4E"
954 | },
955 | "cell_type": "code",
956 | "source": "import notebook.front.widgets._\nimport notebook.front.widgets.magic._\nimport notebook.front.widgets.magic.Implicits._\nimport notebook.front.widgets.magic.SamplerImplicits._\ncase class ProgChart[C:ToPoints:Sampler](\n originalData:C,\n override val sizes:(Int, Int)=(600, 400),\n maxPoints:Int = 1000,\n name:String,\n duration:String\n) extends notebook.front.widgets.charts.Chart[C](originalData, maxPoints) {\n def mToSeq(t:MagicRenderPoint):Seq[(String, Any)] = t.data.toSeq\n\n\n override val snippets = List(s\"\"\"|{\n | \"f\": $js, \n | \"o\": {\n | \"name\": \"$name\",\n | \"duration\": \"$duration\"\n | }\n |}\n \"\"\".stripMargin)\n \n override val scripts = Nil\n}",
957 | "outputs": []
958 | },
959 | {
960 | "metadata": {
961 | "id": "1A5FD93B14BD42E99F49806BDEB16AC5"
962 | },
963 | "cell_type": "markdown",
964 | "source": "We can define the type of data we'll use for this example"
965 | },
966 | {
967 | "metadata": {
968 | "trusted": true,
969 | "input_collapsed": false,
970 | "collapsed": true,
971 | "id": "4A344133173D40158A4091AAC56F8AD2"
972 | },
973 | "cell_type": "code",
974 | "source": "case class ProgData(n:String, v:Int)",
975 | "outputs": []
976 | },
977 | {
978 | "metadata": {
979 | "id": "9DA8C6053A214FB2A01FBCE9C25EBE7F"
980 | },
981 | "cell_type": "markdown",
982 | "source": "Here we generate a bunch of data bucketized by 10, and we create an instance of the new widget giving it the first bucket of data and specifying the right field names for `name` and `duration`."
983 | },
984 | {
985 | "metadata": {
986 | "trusted": true,
987 | "input_collapsed": false,
988 | "collapsed": true,
989 | "id": "4BB2C6EA4B634DFC97B7D5D9845DF0AD"
990 | },
991 | "cell_type": "code",
992 | "source": "val pdata = for {\n c1 <- 'a' to 'e'\n c2 <- 'a' to 'e'\n} yield ProgData(\"\"+c1+c2, (nextDouble * 10000).toInt)\nval pdataH :: pdataS = pdata.toList.sliding(10, 10).toList\n\nval pc = ProgChart(pdataH, name = \"n\", duration = \"v\")\npc",
993 | "outputs": []
994 | },
995 | {
996 | "metadata": {
997 | "id": "35435A5C31AF4F67BF4508910220FA92"
998 | },
999 | "cell_type": "markdown",
1000 | "source": "We update the chart by passing the value using the `addAndApply` approach."
1001 | },
1002 | {
1003 | "metadata": {
1004 | "trusted": true,
1005 | "input_collapsed": false,
1006 | "collapsed": true,
1007 | "id": "15392CC271B04DA5842904489A447B1C"
1008 | },
1009 | "cell_type": "code",
1010 | "source": "var pcont = true\nnew Thread() {\n override def run = \n pdataS.foreach { l =>\n if (pcont) {\n Thread.sleep(9000)\n pc.addAndApply(l, true)\n }\n }\n}.start",
1011 | "outputs": []
1012 | },
1013 | {
1014 | "metadata": {
1015 | "id": "787984960C10439D81F5DA7FBD00505A"
1016 | },
1017 | "cell_type": "markdown",
1018 | "source": "---\n## Contexts with interpolation"
1019 | },
1020 | {
1021 | "metadata": {
1022 | "trusted": true,
1023 | "input_collapsed": false,
1024 | "collapsed": true,
1025 | "id": "09F41F85FD34409C837320D904108BE6"
1026 | },
1027 | "cell_type": "code",
1028 | "source": ":sh ls ${sys.env(\"NOTEBOOKS_DIR\")}",
1029 | "outputs": []
1030 | },
1031 | {
1032 | "metadata": {
1033 | "trusted": true,
1034 | "input_collapsed": false,
1035 | "collapsed": true,
1036 | "id": "C924A6AA52B94822B64718240069CFB0"
1037 | },
1038 | "cell_type": "code",
1039 | "source": "val ok = \"$\\\\LaTeX$ interpolated in Scala is $\\\\Re$\"",
1040 | "outputs": []
1041 | },
1042 | {
1043 | "metadata": {
1044 | "trusted": true,
1045 | "input_collapsed": false,
1046 | "collapsed": true,
1047 | "id": "416D1E2D54154AE7845CE8FC14ECF334"
1048 | },
1049 | "cell_type": "code",
1050 | "source": ":markdown \nYup, **$ok** in Spark Notebook",
1051 | "outputs": []
1052 | },
1053 | {
1054 | "metadata": {
1055 | "trusted": true,
1056 | "input_collapsed": false,
1057 | "collapsed": true,
1058 | "id": "7B10ABEE96784E458F34447A8C9E7C46"
1059 | },
1060 | "cell_type": "code",
1061 | "source": ":javascript\nalert(\"I am ${(\"whoami\".!!).trim}\")",
1062 | "outputs": []
1063 | }
1064 | ],
1065 | "nbformat": 4
1066 | }
--------------------------------------------------------------------------------
/notebooks/WhyScala.md:
--------------------------------------------------------------------------------
1 | # Scala: the Unpredicted Lingua Franca for Data Science
2 |
3 | **Andy Petrella**
4 | [noootsab@data-fellas.guru](mailto:noootsab@data-fellas.guru)
5 | **Dean Wampler**
6 | [dean.wampler@lightbend.com](mailto:dean.wampler@lightbend.com)
7 |
8 | * Scala Days NYC, May 5th, 2016
9 | * GOTO Chicago, May 24, 2016
10 | * Strata + Hadoop World London, June 3, 2016
11 | * Scala Days Berlin, June 16th, 2016
12 |
13 | See also the [Spark Notebook](http://spark-notebook.io) version of this content, available at [github.com/data-fellas/scala-for-data-science](https://github.com/data-fellas/scala-for-data-science).
14 |
15 | ## Why Scala for Data Science with Spark?
16 |
17 | While Python and R are traditional languages of choice for Data Science, [Spark](http://spark.apache.org) also supports Scala (the language in which it's written) and Java.
18 |
19 | However, using one language for all work has advantages like simplifying the software development process, such as building, testing, and deploying techniques, coding conventions, etc.
20 |
21 | If you want a thorough introduction to Scala, see [Dean's book](http://shop.oreilly.com/product/0636920033073.do).
22 |
23 | So, what are the advantages, as well as disadvantages of Scala?
24 |
25 | ## 1. Functional Programming Plus Objects
26 |
27 | Scala is a _multi-paradigm_ language. Code can look a lot like traditional Java code using _Object-Oriented Programming_ (OOP), but it also embraces _Function Programming_ (FP), which emphasizes the virtues of:
28 |
29 | 1. **Immutable values:** Mutability is a common source of bugs.
30 | 1. **Functions with no _side effects_:** All the information they need is passed in and all the "work" is returned. No external state is modified.
31 | 1. **Referential transparency:** You can replace a function call with a cached value that was returned from a previous invocation with the same arguments. (This is a benefit enabled by functions without side effects.)
32 | 1. **Higher-order functions:** Functions that take other functions as arguments are return functions as results.
33 | 1. **Structure separated from operations:** A core set of collections meets most needs. An operation applicable to one data structure is applicable to all.
34 |
35 | However, objects are still useful as an _encapsulation_ mechanism. This is valuable for projects with large teams and code bases.
36 | Scala also implements some _functional_ features using _object-oriented inheritance_ (e.g., "abstract data types" and "type classes", for you experts...).
37 |
38 | ### What about the other languages?
39 | * **Python:** Supports mixed FP-OOP programming, too, but isn't as "rigorous".
40 | * **R:** As a Statistics language, R is more functional than object-oriented.
41 | * **Java:** An object-oriented language, but with recently introduced functional constructs, _lambdas_ (anonymous functions) and collection operations that follow a more _functional_ style, rather than _imperative_ (i.e., where mutating the collection is embraced).
42 |
43 | There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning specifically:
44 |
45 | ### 1a. Traits vs. Interfaces
46 | Scala's object model adds a _trait_ feature, which is a more powerful concept than Java 8 interfaces. Before Java 8, there was no [mixin composition](https://en.wikipedia.org/wiki/Mixin) capability in Java, where composition is generally [preferred over inheritance](https://en.wikipedia.org/wiki/Composition_over_inheritance).
47 |
48 | Imagine that you want to define reusable logging code and mix it into other classes declaratively. Before Java 8, you could define the abstraction for logging in an interface, but you had to use some ad hoc mechanism to implement it (like implementing all methods to delegate to a helper object). Java 8 added the ability to provide default method definitions, as well as declarations in interfaces. This makes mixin composition easier, but you still can't add fields (for state), so the capability is limited.
49 |
50 | Scala traits fully support mixin composition by supporting both field and method definitions with flexibility rules for overriding behavior, once the traits are mixed into classes.
51 |
52 | ### 1b. Java Streams
53 | When you use the Java 8 collections, you can convert the traditional collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional.
54 |
55 | ### The Virtue of Functional Collections:
56 | Let's examine how concisely we can operate on a collection of values in Scala and Spark.
57 |
58 | First, let's define a helper function: is an integer a prime? (Naïve algorithm from [Wikipedia](https://en.wikipedia.org/wiki/Primality_test).)
59 |
60 | ```scala
61 | def isPrime(n: Int): Boolean = {
62 | def test(i: Int, n2: Int): Boolean = {
63 | if (i*i > n2) true
64 | else if (n2 % i == 0 || n2 % (i + 2) == 0) false
65 | else test(i+6, n2)
66 | }
67 | if (n <= 1) false
68 | else if (n <= 3) true
69 | else if (n % 2 == 0 || n % 3 == 0) false
70 | else test(5, n)
71 | }
72 | ```
73 |
74 | Note that no values are mutated here ("virtue" #1 listed above) and `isPrime` has no side effects (#2), which means we could cache previous invocations for a given `n` for better performance if we called this a lot (#3)!
75 |
76 | ### Scala Collections Example
77 | Let's compare a Scala collections calculation vs. the same thing in Spark; how many prime numbers are there between 1 and 100, inclusive?
78 |
79 | ```scala
80 | (1 to 100). // Range of integers from 1 to 100, inclusive.
81 | map(i => (i, isPrime(i))). // `map` is a higher-order method; we pass it
82 | // a function (#4)
83 | groupBy(tuple => tuple._2). // ... and so is `groupBy`, etc.
84 | map(tuple => (tuple._1, tuple._2.size))
85 | ```
86 |
87 | This produces the results:
88 | ```scala
89 | res16: scala.collection.immutable.Map[Boolean,Int] = Map(
90 | false -> 75, true -> 25)
91 | ```
92 | Note that for the numbers between 1 and 100, inclusive, exactly 1/4 of them are prime!
93 |
94 | ### Spark Example
95 |
96 | Note how similar the following code is to the previous example. After constructing the data set, the "core" three lines are _identical_, even though they are operating on completely different underlying collections (#5 above).
97 |
98 | However, because Spark collections are "lazy" by default (i.e., not evaluated until we ask for results), we explicitly print the results so Spark evaluates them!
99 |
100 | ```scala
101 | val rddPrimes = sparkContext.parallelize(1 to 100).
102 | map(i => (i, isPrime(i))).
103 | groupBy(tuple => tuple._2).
104 | map(tuple => (tuple._1, tuple._2.size))
105 | rddPrimes.collect
106 | ```
107 |
108 | This produces the result:
109 | ```scala
110 | rddPrimes: org.apache.spark.rdd.RDD[(Boolean, Int)] =
111 | MapPartitionsRDD[4] at map at :61
112 | res18: Array[(Boolean, Int)] = Array((false,75), (true,25))
113 | ```
114 |
115 | Note the inferred type, an `RDD` with records of type `(Boolean, Int)`, meaning two-element tuples.
116 |
117 | Spark's RDD API is inspired by the Scala collections API, which is inspired by classic _functional programming_ operations on data collections, i.e., using a series of transformations from one form to the next, without mutating any of the collections. (Spark is very efficient about avoiding the materialization of intermediate outputs.)
118 |
119 | Once you know these operations, it's quick and effective to implement robust, non-trivial transformations.
120 |
121 | What about the other languages?
122 |
123 | * **Python:** Supports very similar functional programming. In fact, Spark Python code looks very similar to Spark Scala code.
124 | * **R:** More idiomatic (see below).
125 | * **Java:** Looks similar when _lambdas_ are used, but missing features (see below) limit concision and flexibility.
126 |
127 | ## 2. Interpreter (REPL)
128 |
129 | In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks like this one possible!
130 |
131 | What about the other languages?
132 |
133 | * **Python:** Also has an interpreter and [iPython/Jupyter](https://ipython.org/) was one of the first, widely-used notebook environments.
134 | * **R:** Also has an interpreter and notebook/IDE environments.
135 | * **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment. However, Java 9 will have a REPL, after 20+ years!
136 |
137 | ## 3. Tuple Syntax
138 | In data, you work with records of `n` fields (for some value of `n`) all the time. Support for `n`-element _tuples_ is very convenient and Scala has a shorthand syntax for instantiating tuples. We used it twice previously to return two-element tuples in the anonymous functions passed to the `map` methods above:
139 |
140 | ```scala
141 | sparkContext.parallelize(1 to 100).
142 | map(i => (i, isPrime(i))). // <-- here
143 | groupBy(tuple => tuple._2).
144 | map(tuple => (tuple._1, tuple._2.size)) // <-- here
145 | ```
146 |
147 | As before, REPL prints the following:
148 |
149 | ```scala
150 | res20: org.apache.spark.rdd.RDD[(Boolean, Int)] =
151 | MapPartitionsRDD[9] at map at :63
152 | ```
153 |
154 | **Tuples are used all the time** in Spark Scala RDD code, where it's common to use key-value pairs.
155 |
156 | What about the other languages?
157 |
158 | * **Python:** Also has some support for the same tuple syntax.
159 | * **R:** Also has tuple types, but a less convenient syntax for instantiating them.
160 | * **Java:** Does _not_ have tuple types, not even the special case of two-element tuples (pairs), much less a convenient syntax for them. However, Spark defines a [MutablePair](http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/MutablePair.html) type for this purpose:
161 |
162 | ```scala
163 | // Using Scala syntax here:
164 | import org.apache.spark.util.MutablePair
165 | val pair = new MutablePair[Int,String](1, "one")
166 | ```
167 |
168 | The REPL prints:
169 | ```scala
170 | import org.apache.spark.util.MutablePair
171 | pair: org.apache.spark.util.MutablePair[Int,String] = (1,one)
172 | ```
173 |
174 | ## 4. Pattern Matching
175 | This is one of the most powerful features you'll find in most functional languages, Scala included. It has no equivalent in Python, R, or Java.
176 |
177 | Let's rewrite our previous primes example:
178 |
179 | ```scala
180 | sparkContext.parallelize(1 to 100).
181 | map(i => (i, isPrime(i))).
182 | groupBy{ case (_, primality) => primality}. // Syntax: { case pattern => body }
183 | map{ case (primality, values) => (primality, values.size) } . // same here
184 | foreach(println)
185 | ```
186 |
187 | The output is:
188 | ```scala
189 | (true,25)
190 | (false,75)
191 | ```
192 |
193 | Note the `case` keyword and `=>` separating the pattern from the body to execute if the pattern matches.
194 |
195 | In the first pattern, `(_, primality)`, we didn't need the first tuple element, so we used the "don't care" placeholder, `_`. Note also that `{...}` must be used instead of `(...)`. (The extra whitespace after the `{` and before the `}` is not required; it's here for legibility.)
196 |
197 | Pattern matching is much richer, while more concise than `if ... else ...` constructs in the other languages and we can use it on nearly anything to match what it is and then decompose it into its constituent parts, which are assigned to variables with meaningful names, e.g., `primality`, `values`, etc.
198 |
199 | Here's another example, where we _deconstruct_ a nested tuple. We also show that you can use pattern matching for assignment, too!
200 |
201 | ```scala
202 | val (a, (b1, (b21, b22)), c) = ("A", ("B1", ("B21", "B22")), "C")
203 | ```
204 |
205 | Note the output of the REPL:
206 | ```scala
207 | a: String = A
208 | b1: String = B1
209 | b21: String = B21
210 | b22: String = B22
211 | c: String = C
212 | ```
213 |
214 | ## 5. Case Classes
215 | Now is a good time to introduce a convenient way to declare classes that encapsulate some state that is composed of some values, called _case classes_.
216 |
217 | ```scala
218 | case class Person(firstName: String, lastName: String, age: Int)
219 | ```
220 |
221 | The `case` keyword tells the compiler to:
222 |
223 | * Make immutable instance fields out of the constructor arguments (the list after the name).
224 | * Add `equals`, `hashCode`, and `toString` methods (which you can explicitly define yourself, if you want).
225 | * Add a _companion object_ with the same name, which holds methods for constructing instances and "destructuring" instances through patterning matching.
226 | * etc.
227 |
228 | Case classes are useful for implementing records in RDDs.
229 |
230 | Let's see case class pattern matching in action:
231 |
232 | ```scala
233 | sparkContext.parallelize(
234 | Seq(Person("Dean", "Wampler", 39),
235 | Person("Andy", "Petrella", 29))).
236 | map {
237 | // Convert Person instances to tuples
238 | case Person(first, last, age) => (first, last, age)
239 | }.
240 | foreach(println)
241 | ```
242 |
243 | Output:
244 | ```scala
245 | (Andy,Petrella,29)
246 | (Dean,Wampler,39)
247 | ```
248 |
249 | What about the other languages?
250 |
251 | * **Python:** Regular expression matching for strings is built in. Pattern matching as shown requires a third-party library with an idiomatic syntax. Nothing like case classes.
252 | * **R:** Only supports regular expression matching for strings. Nothing like case classes.
253 | * **Java:** Only supports regular expression matching for strings. Nothing like case classes.
254 |
255 | ## 6. Type Inference
256 | Most languages associate a type with values, but they fall into two categories, crudely speaking, those which evaluate the type of expressions and variables at compile time (like Scala and Java) and those which do so at runtime (Python and R). This is called _static typing_ and _dynamic typing_, respectively.
257 |
258 | So, languages with static typing either have to be told the type of every expression or variable, or they can _infer_ types in some or all cases. Scala can infer types most of the time, while Java can do so only in limited cases. Here are some examples for Scala. Note the results shown for each expression:
259 |
260 | ```scala
261 | val i = 100 // <- infer that i is an integer
262 | val j = i*i % 27 // <- since i is an integer, j must be one, too.
263 | ```
264 |
265 | The compiler infers the following:
266 | ```scala
267 | i: Int = 100
268 | j: Int = 10
269 | ```
270 |
271 | Recall our previous Spark example, where we wrote nothing about types, but they were inferred:
272 |
273 | ```scala
274 | sparkContext.parallelize(1 to 100).
275 | map(i => (i, isPrime(i))).
276 | groupBy{ case(_, primality) => primality }.
277 | map{ case (primality, values) => (primality, values.size) }
278 | ```
279 |
280 | Output:
281 | ```scala
282 | res30: org.apache.spark.rdd.RDD[(Boolean, Int)] =
283 | MapPartitionsRDD[21] at map at :66
284 | ```
285 |
286 | So this long expression (and it is a four-line expression - note the "."'s) returns an `RDD[(Boolean, Int)]`. Note that we can also express a tuple _type_ with the `(...)` syntax, just like for tuple _instances_. This type could also be written `RDD[Tuple2[Boolean, Int]]`.
287 |
288 | Put another way, we have an `RDD` where the records are key-value pairs of `Booleans` and `Ints`.
289 |
290 | I really like the extra safety that static typing provides, without the hassle of writing the types for almost everything, compared to Java. Furthermore, when I'm using an API with the Scala interpreter or a notebook like this one, the return value's type is shown, as in the previous example, so I know exactly what "kinds of things" I have. That also means I don't have to know _in advance_ what a method will return, in order to explicit add a required type, as in Java.
291 |
292 | What about the other languages?
293 |
294 | * **Python:** Uses dynamic typing, so no types are written explicitly, but you also don't get the feedback type inference provides, as in our `RDD[(Boolean, Int)]` example.
295 | * **R:** Also dynamically typed.
296 | * **Java:** Statically typed with explicit types required almost everywhere.
297 |
298 | ## 7. Unification of Primitives and Types
299 | In Java, there is a clear distinction between primitives, which are nice for performance (you can put them in registers, you can pass them on the stack, you don't heap allocate them), and instances of classes, which give you the expressiveness of OOP, but with the overhead of heap allocation, etc.
300 |
301 | Scala unifies the syntax, but in most cases, compiles optimal code. So, for example, `Float` acts like any other type, e.g., `String`, with methods you call, but the compiler uses JVM `float` primitives. `Float` and the other primitives are subtypes of `AnyVal` and include `Byte`, `Short`, `Int`, `Long`, `Float`, `Double`, `Char`, `Boolean`, and `Unit`.
302 |
303 | Another benefit is that the uniformity extends to parameterized types, like collections. If you implement your own `Tree[T]` type, `T` can be `Float`, `String`, `MyMassiveClass`, whatever. There's no mental burden of explicitly boxing and unboxing primitives.
304 |
305 | However, the downside is that your primitives will be boxed when used in a context like this. Scala does have an annotation `@specialized(a,b,c)` that's used to tell the compiler to generate optimal implementations for the primitives listed for `a,b,c`, but it's not a perfect solution.
306 |
307 | ```scala
308 | val listString: List[String] = List("one", "two", "three")
309 | val listInt: List[Int] = List(1, 2, 3) // No need to use Integer.
310 | ```
311 |
312 | Output:
313 | ```scala
314 | listString: List[String] = List(one, two, three)
315 | listInt: List[Int] = List(1, 2, 3)
316 | ```
317 |
318 | See also **Value Classes** below.
319 |
320 | ## 8. Elegant Tools to Create "Domain Specific Languages"
321 | The Spark [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame) API is a good example of DSL that mimics the original Python and R DataFrame APIs for single-node use.
322 |
323 | First, set up the API:
324 | ```scala
325 | import org.apache.spark.sql.SQLContext
326 | val sqlContext = new SQLContext(sparkContext)
327 | import sqlContext.implicits._
328 | import org.apache.spark.sql.functions._ // for min, max, etc. column operations
329 | ```
330 |
331 | Get the root directory of the notebooks:
332 | ```scala
333 | val root = sys.env("NOTEBOOKS_DIR")
334 | ```
335 |
336 | Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame).
337 |
338 | ```scala
339 | val airportsDF = sqlContext.read.json(s"$root/airports.json")
340 | ```
341 |
342 | Note the "schema" is inferred from the JSON and shown by the REPL (by calling `DataFrame.toString`).
343 |
344 | ```scala
345 | airportsDF: org.apache.spark.sql.DataFrame = [airport: string, city: string, country: string, iata: string, lat: double, long: double, state: string]
346 | ```
347 |
348 | We cache the results, so Spark will keep the data in memory since we'll run a few queries over it. `DataFrame.show` is convenient for displaying the first `N` records (20 by default).
349 |
350 | ```scala
351 | airportsDF.cache
352 | airportsDF.show
353 | ```
354 |
355 | Here's the output of `show`:
356 | ```
357 | +--------------------+------------------+-------+----+-----------+------------+-----+
358 | | airport| city|country|iata| lat| long|state|
359 | +--------------------+------------------+-------+----+-----------+------------+-----+
360 | | Thigpen | Bay Springs| USA| 00M|31.95376472|-89.23450472| MS|
361 | |Livingston Municipal| Livingston| USA| 00R|30.68586111|-95.01792778| TX|
362 | | Meadow Lake| Colorado Springs| USA| 00V|38.94574889|-104.5698933| CO|
363 | | Perry-Warsaw| Perry| USA| 01G|42.74134667|-78.05208056| NY|
364 | | Hilliard Airpark| Hilliard| USA| 01J| 30.6880125|-81.90594389| FL|
365 | | Tishomingo County| Belmont| USA| 01M|34.49166667|-88.20111111| MS|
366 | | Gragg-Wade | Clanton| USA| 02A|32.85048667|-86.61145333| AL|
367 | | Capitol| Brookfield| USA| 02C| 43.08751|-88.17786917| WI|
368 | | Columbiana County| East Liverpool| USA| 02G|40.67331278|-80.64140639| OH|
369 | | Memphis Memorial| Memphis| USA| 03D|40.44725889|-92.22696056| MO|
370 | | Calhoun County| Pittsboro| USA| 04M|33.93011222|-89.34285194| MS|
371 | | Hawley Municipal| Hawley| USA| 04Y|46.88384889|-96.35089861| MN|
372 | |Griffith-Merrillv...| Griffith| USA| 05C|41.51961917|-87.40109333| IN|
373 | |Gatesville - City...| Gatesville| USA| 05F|31.42127556|-97.79696778| TX|
374 | | Eureka| Eureka| USA| 05U|39.60416667|-116.0050597| NV|
375 | | Moton Municipal| Tuskegee| USA| 06A|32.46047167|-85.68003611| AL|
376 | | Schaumburg|Chicago/Schaumburg| USA| 06C|41.98934083|-88.10124278| IL|
377 | | Rolla Municipal| Rolla| USA| 06D|48.88434111|-99.62087694| ND|
378 | | Eupora Municipal| Eupora| USA| 06M|33.53456583|-89.31256917| MS|
379 | | Randall | Middletown| USA| 06N|41.43156583|-74.39191722| NY|
380 | +--------------------+------------------+-------+----+-----------+------------+-----+
381 | only showing top 20 rows
382 | ```
383 |
384 | Now we can show the idiomatic DataFrame API (DSL) in action:
385 |
386 | ```scala
387 | val grouped = airportsDF.groupBy($"state", $"country").count.orderBy($"count".desc)
388 | grouped.printSchema
389 | grouped.show(100) // 50 states + territories < 100
390 | ```
391 |
392 | Here is the output:
393 |
394 | ```
395 | root
396 | |-- state: string (nullable = true)
397 | |-- country: string (nullable = true)
398 | |-- count: long (nullable = false)
399 |
400 | +-----+-------+-----+
401 | |state|country|count|
402 | +-----+-------+-----+
403 | | AK| USA| 263|
404 | | TX| USA| 209|
405 | | CA| USA| 205|
406 | | OK| USA| 102|
407 | | FL| USA| 100|
408 | | OH| USA| 100|
409 | | NY| USA| 97|
410 | | GA| USA| 97|
411 | | MI| USA| 94|
412 | | MN| USA| 89|
413 | | IL| USA| 88|
414 | | WI| USA| 84|
415 | | KS| USA| 78|
416 | | IA| USA| 78|
417 | | AR| USA| 74|
418 | | MO| USA| 74|
419 | | NE| USA| 73|
420 | | AL| USA| 73|
421 | | MS| USA| 72|
422 | | NC| USA| 72|
423 | | PA| USA| 71|
424 | | MT| USA| 71|
425 | | TN| USA| 70|
426 | | WA| USA| 65|
427 | | IN| USA| 65|
428 | | AZ| USA| 59|
429 | | SD| USA| 57|
430 | | OR| USA| 57|
431 | | LA| USA| 55|
432 | | ND| USA| 52|
433 | | SC| USA| 52|
434 | | NM| USA| 51|
435 | | KY| USA| 50|
436 | | CO| USA| 49|
437 | | VA| USA| 47|
438 | | ID| USA| 37|
439 | | UT| USA| 35|
440 | | NJ| USA| 35|
441 | | ME| USA| 34|
442 | | WY| USA| 32|
443 | | NV| USA| 32|
444 | | MA| USA| 30|
445 | | WV| USA| 24|
446 | | MD| USA| 18|
447 | | HI| USA| 16|
448 | | CT| USA| 15|
449 | | NH| USA| 14|
450 | | VT| USA| 13|
451 | | PR| USA| 11|
452 | | RI| USA| 6|
453 | | DE| USA| 5|
454 | | VI| USA| 5|
455 | | CQ| USA| 4|
456 | | AS| USA| 3|
457 | | GU| USA| 1|
458 | | DC| USA| 1|
459 | +-----+-------+-----+
460 |
461 | grouped: org.apache.spark.sql.DataFrame = [state: string, country: string, count: bigint]
462 | ```
463 |
464 | By the way, this DSL is essentially a programmatic version of SQL:
465 |
466 | ```scala
467 | airportsDF.registerTempTable("airports")
468 | val grouped2 = sqlContext.sql("""
469 | SELECT state, country, COUNT(*) AS cnt FROM airports
470 | GROUP BY state, country
471 | ORDER BY cnt DESC
472 | """)
473 | ```
474 |
475 | What about the other languages?
476 |
477 | * **Python:** Dynamically-typed languages often have features that make idiomatic DSLs easy to define. The Spark DataFrame API is inspired by the [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) API.
478 | * **R:** Less flexible for idiomatic DSLs, but syntax is designed for Mathematics. The Pandas DataFrame API is inspired by the [R Data Frame](http://www.r-tutor.com/r-introduction/data-frame) API.
479 | * **Java:** Limited to so-called _fluent_ APIs, similar to our collections and RDD examples above.
480 |
481 | ## 9. And a Few Other Things...
482 | There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used less frequently in Spark code. Here they are, for completeness.
483 |
484 | ### 9A. Singletons Are a Built-in Feature
485 | Implement the _Singleton Design Pattern_ without special logic to ensure there's only one instance.
486 |
487 | ```scala
488 | object Foo {
489 | def main(args: Array[String]):Unit = {
490 | args.foreach(arg => println(s"arg = $arg"))
491 | }
492 | }
493 | Foo.main(Array("Scala", "is", "great!"))
494 | ```
495 |
496 | The output is:
497 | ```
498 | arg = Scala
499 | arg = is
500 | arg = great!
501 | defined object Foo
502 | ```
503 |
504 | ### 9B. Named and Default Arguments
505 | Does a method have a long argument list? Provide defaults for some of them. Name the arguments when calling the method to document what you're doing.
506 |
507 | ```scala
508 | val airportsRDD = grouped.select($"count", $"state").
509 | map(row => (row.getLong(0), row.getString(1)))
510 |
511 | val rdd1 = airportsRDD.sortByKey() // defaults: ascending = true, numPartitions = current # of partitions
512 | val rdd2 = airportsRDD.sortByKey(ascending = false) // name the ascending argument explicitly
513 | val rdd3 = airportsRDD.sortByKey(numPartitions = 4) // name the numPartitions argument explicitly
514 | val rdd4 = airportsRDD.sortByKey(ascending = false, numPartitions = 4) // Okay to do both...
515 | ```
516 |
517 | All four variants return the same type:
518 | ```scala
519 | rdd1: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[60] at sortByKey at :74
520 | rdd2: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[63] at sortByKey at :75
521 | rdd3: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[66] at sortByKey at :76
522 | rdd4: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[69] at sortByKey at :77
523 | ```
524 |
525 | To see the impacts of the arguments:
526 | ```scala
527 | Seq(rdd1, rdd2, rdd3, rdd4).foreach { rdd =>
528 | println(s"RDD (#partitions = ${rdd.partitions.length}):")
529 | rdd.take(10).foreach(println)
530 | }
531 | ```
532 |
533 | Output:
534 | ```
535 | RDD (#partitions = 41):
536 | (1,GU)
537 | (1,DC)
538 | (3,AS)
539 | (4,CQ)
540 | (5,VI)
541 | (5,DE)
542 | (6,RI)
543 | (11,PR)
544 | (13,VT)
545 | (14,NH)
546 | RDD (#partitions = 41):
547 | (263,AK)
548 | (209,TX)
549 | (205,CA)
550 | (102,OK)
551 | (100,OH)
552 | (100,FL)
553 | (97,NY)
554 | (97,GA)
555 | (94,MI)
556 | (89,MN)
557 | RDD (#partitions = 4):
558 | (1,GU)
559 | (1,DC)
560 | (3,AS)
561 | (4,CQ)
562 | (5,VI)
563 | (5,DE)
564 | (6,RI)
565 | (11,PR)
566 | (13,VT)
567 | (14,NH)
568 | RDD (#partitions = 4):
569 | (263,AK)
570 | (209,TX)
571 | (205,CA)
572 | (102,OK)
573 | (100,OH)
574 | (100,FL)
575 | (97,NY)
576 | (97,GA)
577 | (94,MI)
578 | (89,MN)
579 | ```
580 |
581 | ### 9C. String Interpolation
582 | You've seen it used already:
583 |
584 | ```scala
585 | println(s"RDD #partitions = ${rdd4.partitions.length}")
586 | // prints: RDD #partitions = 4
587 | ```
588 |
589 | ### 9D. Few Semicolons
590 | Semicolons are inferred, making your code just that much more concise. You can use them if you want to write more than one expression on a line:
591 |
592 | ```scala
593 | val result = "foo" match {
594 | case "foo" => println("Found foo!"); true
595 | case _ => false
596 | }
597 | // prints: Found foo!
598 | ```
599 |
600 | ### 9E. Tail Recursion Optimization
601 | Recursion isn't used much in user code for Spark, but for general programming it's a powerful technique. Unfortunately, most OO languages (like Java) do not optimize [tail call recursion](https://en.wikipedia.org/wiki/Tail_call) by converting the recursion into a loop. Without this optimization, use of recursion is risky, because of the risk of stack overflow. Scala's compiler implements this optimization.
602 |
603 | ```scala
604 | def printSeq[T](seq: Seq[T]): Unit = seq match {
605 | case head +: tail => println(head); printSeq(tail)
606 | case Nil => // done
607 | }
608 | printSeq(Seq(1,2,3,4))
609 | // prints:
610 | // 1
611 | // 2
612 | // 3
613 | // 4
614 | ```
615 |
616 | ### 9F. Everything Is an Expression
617 | Some constructs are _statements_ (meaning they return nothing) in some languages, like `if ... then ... else`, `for` loops, etc. Almost everything is an expression in Scala which means you can assign results of the `if` or `for` expression. The alternative in the other languages is that you have to declare a mutable variable, then set its value inside the statement.
618 |
619 | ```scala
620 | val worldRocked = if (true == false) "yes!" else "no"
621 | ```
622 |
623 | As you might expect, the output is:
624 | ```
625 | worldRocked: String = no
626 | ```
627 |
628 | ```scala
629 | val primes = for {
630 | i <- 0 until 100
631 | if isPrime(i)
632 | } yield i
633 | ```
634 |
635 | The output is:
636 | ```scala
637 | primes: scala.collection.immutable.IndexedSeq[Int] =
638 | Vector(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37,
639 | 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
640 | ```
641 |
642 | ### 9G. Implicits
643 | One of Scala's most powerful features is the _implicits_ mechanism. It's used (or misused) for several capabilities, but one of the most useful is the ability to "add" methods to existing types that don't already have the methods. What actually happens is the compiler invokes an _implicit conversion_ from an instance of the type to a wrapper type that has the desired method.
644 |
645 | For example, suppose I want to add a `toJSON` method to my `Person` type above, but I don't want this added to the class itself. Maybe it's from a library that I can't modify. Maybe I only want this method in some contexts, but I don't want its baggage everywhere. Here's how to do it.
646 |
647 | ```scala
648 | // repeat definition of Person:
649 | case class Person(firstName: String, lastName: String, age: Int)
650 |
651 | implicit class PersonToJSON(person: Person) {
652 | // Just return a JSON-formatted string, for simplicity of the example:
653 | def toJSON: String =
654 | s"""{ "firstName": ${person.firstName}, "lastName": ${person.lastName}, "age": ${person.age} }"""
655 | }
656 |
657 | val p = Person("Dean", "Wampler", 39)
658 | p.toJSON // Like magic!!
659 | // returns: { "firstName": Dean, "lastName": Wampler, "age": 39 }
660 | ```
661 |
662 | The `implicit` keyword tells the compiler to consider `PersonToJSON` when I attempt to call `toJSON` on a `Person` instance. The compiler finds this implicit class and does the conversion implicitly, then calls the `toJSON` method.
663 |
664 | There are many other uses for implicits. They are a powerful implementation tool for various design problems, but they have to be used wisely, because it can be difficult for the reader to know what's going on.
665 |
666 | ### 9H. Sealed Type Hierarchies
667 | An important concept in modern languages is _sum types_, where there is a finite set of possible instances. Two examples from Scala are `Option[T]` and its allowed subtypes `Some[T]` and `None`, and `Either[L,R]` and its subtypes `Left[L]` and `Right`[R]`.
668 |
669 | Note that `Option[T]` represents two and only two possible states, either I have something, a `T` inside a `Some[T]`, or I don't anything, a `None`. There are no additional "states" that are logically possible for the `Option[T]` "abstraction". Similarly, `Either[L,R]` encapsulates a similar dichotomy, often used for "failure" (e.g., `Left[Throwable]` by convention) and "successful result" (`Right[T]` for some "expected" `T`).
670 |
671 | The term _sum type_ comes from an analog between types and arithmetic. For `Option`, the number of allowed intances (ignoring the type parameter `T`) is just the sum, _two_. Similarly for `Either`.
672 |
673 | There are also _product types_, like tuples, where combining types together _multiplies_ the number of instances. For example, a tuple of `(Option,Either)` would have 2*2 instances. A tuple `(Boolean,Option,HTTP_Commands)` has 2*2*7 possible instances (there are 7 HTTP 1.1 commands, like `GET`, `POST`, etc.)
674 |
675 | Scala uses type hierarchies for sum types, where an abstract _sealed_ trait or class is used for the base type, e.g., `Option[T]` and `Either[L,R]`, and subtypes represent the concrete types. The `sealed` keyword is used on the base type and it is crucial; it tells the compiler to only allow subtypes to be defined in the same _file_, which means users can't add their own subtypes, breaking the logic of the type.
676 |
677 | Some other languages implement sum types using a variation of _enumerations_. Java has that, but it's a much more limited concept than true subtypes.
678 |
679 | Here's an example, sort of like `Either`, but oriented more towards the usage of encapsulating success or failure. However, we'll put "success" on the left instead of the right, which is the convention when using `Either`.
680 |
681 | We'll have one type parameter `Result`. On `Success`, it will hold an instance of the type `Result`. On Failure, it will hold no successful result, so we'll use the "bottom" type `Nothing` for the type parameter,
682 | and expect the error information to be returned in a `RuntimeException`.
683 |
684 | ```scala
685 | import scala.util.control.NonFatal
686 |
687 | // The + means "contravariant"; we can use subtypes of the declared "Result".
688 | // See also the **Definition Site Invariance...** section below.
689 | sealed trait SuccessOrFailure[+Result]
690 | case class Success[Result](result: Result) extends SuccessOrFailure[Result]
691 | case class Failure(error: RuntimeException) extends SuccessOrFailure[Nothing]
692 | ```
693 |
694 | The `sealed` keyword is actually less useful in the context of this notebook; we can keep on defining subclasses below. However, in library code, you would put the three declarations in a separate file and then the compiler would prevent anyone from defining a third subclass in a different location.
695 |
696 | Let's try it out.
697 |
698 | ```scala
699 | def parseInt(string: String): SuccessOrFailure[Int] = try {
700 | Success(Integer.parseInt(string))
701 | } catch {
702 | case nfe: NumberFormatException => Failure(new RuntimeException(s"""Invalid integer string: "$string" """))
703 | }
704 |
705 | Seq("1", "202", "three").map(parseInt)
706 | // Seq[SuccessOrFailure[Int]] = List(
707 | // Success(1),
708 | // Success(202),
709 | // Failure(java.lang.RuntimeException: Invalid integer string: "three" ))
710 | ```
711 |
712 | ### 9I. Option Type Broken in Java
713 | Speaking of `Option[T]`, Java 8 introduced a similar type called `Optional`. (The name `Option` was already used for something else.) However, its design has some subtleties that make the behavior not straightforward when `nulls` are involved. For details, see [this blog post](https://developer.atlassian.com/blog/2015/08/optional-broken/).
714 |
715 | ### 9J: Definition-site Variance vs. Call-site Variance
716 | This is a technical point. In Java, when you define a type with a type parameter, like our `SuccessOrFailure[T]` previously, to hold items of some type `T`, you can't specify in the declaration whether it's okay to substitute a subtype of `SuccessOrFailure` with a subtype of `T`. For example, is the following okay?:
717 |
718 | ```java
719 | // Java
720 | SuccessOrFailure