├── Dockerfile ├── README.md └── notebooks ├── Data Science with Scala.snb ├── Why Spark Notebook.snb ├── WhyScala.md ├── WhyScala.pdf ├── WhyScala.snb ├── airports.json └── images ├── JavaMemory.jpg └── TungstenMemory.jpg /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM andypetrella/spark-notebook-demo:master-2.0.0-preview 2 | 3 | # Data Fellas 4 | MAINTAINER Data Fellas info@data-fellas.guru 5 | 6 | USER root 7 | 8 | ENV HOME /root 9 | 10 | ENV NOTEBOOKS_DIR /root/spark-notebook/notebooks/scala-ds 11 | 12 | ENV ADD_JARS /root/spark-notebook/lib/common.common-0.7.0-SNAPSHOT-scala-2.10.6-spark-2.0.0-preview-hadoop-2.2.0-with-hive-with-parquet.jar 13 | 14 | ADD notebooks /root/spark-notebook/notebooks/scala-ds 15 | 16 | WORKDIR /root/demo-base -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Scala for Data Science 2 | 3 | The enclosed notebooks and other materials are for the [Scala Days 2016](http://www.scaladays.org/) and [Strata London 2016](http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49739) talks by [Andy Petrella](mailto:noootsab@data-fellas.guru) and [Dean Wampler](dean.wampler@lightbend.com) on why Scala is a great language for Data Science. 4 | 5 | The talk includes a notebook for [Spark Notebook](http://spark-notebook.io/), which provides a notebook metaphor for interactive Spark development using Scala. If you aren't familiar with the idea of a notebook interface, think of it as an enhanced REPL that makes it easy to edit and run (or rerun) code, plot results, mix in markdown-based documentation, etc. 6 | 7 | However, if you don't want to go to the trouble of installing and using [Spark Notebook](http://spark-notebook.io/), there are Markdown and PDF versions of the same content in the `notebooks` directory. 8 | 9 | ## Use Existing Docker 10 | A docker container exists with the Spark Notebook available with the current notebooks. 11 | 12 | ### Pull it from docker hub: 13 | ``` 14 | docker pull datafellas/scala-for-data-science:1.0-spark2 15 | ``` 16 | 17 | ### Run it 18 | ``` 19 | docker run --rm -it --net=host -m 8g datafellas/scala-for-data-science:1.0-spark2 bash 20 | ``` 21 | 22 | ### Start the services 23 | ``` 24 | source start.sh 25 | ``` 26 | 27 | ### Use it 28 | On Linux, go to [http://localhost:9000](http://localhost:9000). 29 | 30 | On Mac/Win, you'll probably have to use the VM's IP/Name. 31 | 32 | 33 | ## Install manually 34 | 35 | Otherwise, install [Spark Notebook](http://spark-notebook.io/), version 0.6.3 or later. You can use either Scala 2.10 or 2.11. In the commands below, we'll assume the root directory of this installation is `/path/to/spark-notebook`. Just use your real path instead. Due to a bug in library path handling, **you must start Spark Notebook from this directory**. 36 | 37 | We'll also use `/path/to/scala-for-data-science` as the path to your local clone of this Git repo. Again, substitute the real path... 38 | 39 | There is one environment variable that you **must** define, `NOTEBOOKS_DIR`. Run the following commands to define this variable and start Spark Notebook. 40 | 41 | For Linux or OSX, use the following: 42 | ``` 43 | export NOTEBOOKS_DIR=/path/to/scala-for-data-science/notebooks 44 | cd /path/to/spark-notebook 45 | bin/spark-notebook 46 | ``` 47 | 48 | For Windows, use the following: 49 | ``` 50 | set NOTEBOOKS_DIR=c:\path\to\scala-for-data-science\notebooks 51 | cd \path\to\spark-notebook 52 | bin\spark-notebook 53 | ``` 54 | 55 | Open a browser window to [localhost:9000](http://localhost:9000). Then click the link to open the notebook [WhyScala](http://localhost:9000/notebooks/WhyScala.snb). 56 | 57 | To evaluate all the cells in a notebook, use the _Cell > Run All_ menu item. You can evaluate one cell at a time with the ▶︎ button on the toolbar, or use "shift+return". Both options run the currently-selected cell and advance to the next cell. Note that the notebook copy in the repo includes the output from a run. 58 | 59 | Grab the slides for the rest of the presentation [here](https://docs.google.com/a/data-fellas.guru/presentation/d/1d7vT3mgo4ppHXHtKRQjcVW8SsMs3PeRAkq3PHRgWKaQ/edit?usp=sharing). 60 | -------------------------------------------------------------------------------- /notebooks/Data Science with Scala.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "name" : "Data Science with Scala", 4 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 5 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "language_info" : { 7 | "name" : "scala", 8 | "file_extension" : "scala", 9 | "codemirror_mode" : "text/x-scala" 10 | }, 11 | "trusted" : true, 12 | "customLocalRepo" : "/tmp/repo", 13 | "customRepos" : [ "spartakus % default % http://dl.bintray.com/spark-clustering-notebook/maven % maven" ], 14 | "customDeps" : [ "com.github.haifengl % smile-core % 1.0.4", "org.deeplearning4j % deeplearning4j-core % 0.4-rc3.9", "org.deeplearning4j % deeplearning4j-nlp % 0.4-rc3.9", "batchstream %% batchstream % 1.0" ], 15 | "customImports" : null, 16 | "customArgs" : null, 17 | "customSparkConf" : null 18 | }, 19 | "cells" : [ { 20 | "metadata" : { 21 | "id" : "B0A64FACA3AE41FD8A8CC61106CDB042" 22 | }, 23 | "cell_type" : "markdown", 24 | "source" : "# What is available?" 25 | }, { 26 | "metadata" : { 27 | "id" : "88B84167E7E9422C869BA7FF63F07099" 28 | }, 29 | "cell_type" : "markdown", 30 | "source" : "## Libraries: Awesome-Scala" 31 | }, { 32 | "metadata" : { 33 | "id" : "E9751047EE5F49CA8E79F8D865F5E7BB" 34 | }, 35 | "cell_type" : "markdown", 36 | "source" : "The _Awesome Scala_ project by [@lauris](https://github.com/lauris) is listing \"all\" available libraries in Scala, there is an awesome (of course) sublist for Data Science related stuff, you can check it [here](https://github.com/lauris/awesome-scala#science-and-data-analysis)." 37 | }, { 38 | "metadata" : { 39 | "id" : "FAAD385D942E4423B869F3D6A4B1EB31" 40 | }, 41 | "cell_type" : "markdown", 42 | "source" : "## Models: in Scala/JVM" 43 | }, { 44 | "metadata" : { 45 | "id" : "FBBF02F2BD594437A5A2CF68B01A29A5" 46 | }, 47 | "cell_type" : "markdown", 48 | "source" : "The JVM is getting ready for the new world it is entering into, and we can count on the multi millions users (10M based on a mix of wikipedia and other sources) to continue the addition of new models or to improve existing implementations." 49 | }, { 50 | "metadata" : { 51 | "id" : "E93B500E5A174D268D7EF4E86FC30B1E" 52 | }, 53 | "cell_type" : "markdown", 54 | "source" : "So we will show a few of them in the following section (help wanted :-D)" 55 | }, { 56 | "metadata" : { 57 | "id" : "FBF9AD0483E6481790E8111314F7D4B8" 58 | }, 59 | "cell_type" : "markdown", 60 | "source" : "## Smile [GitHub](https://github.com/haifengl/smile)" 61 | }, { 62 | "metadata" : { 63 | "id" : "4E0845C51EEE484A8CF7C1B606E0AA4E" 64 | }, 65 | "cell_type" : "markdown", 66 | "source" : "Probably the most complete project available in Scala in terms of implementation, with more than 90 (98 at the time writing) methods/models." 67 | }, { 68 | "metadata" : { 69 | "id" : "091F7CD7443B491C8C482BC3A8D82286" 70 | }, 71 | "cell_type" : "markdown", 72 | "source" : "* Classification\n * Support Vector Machines\n * Decision Trees\n * AdaBoost\n * Gradient Boosting\n * Random Forest\n * Logistic Regression\n * Neural Networks\n * RBF Networks\n * Maximum Entropy Classifier\n * Naïve Bayesian\n * Fisher / Linear / Quadratic / Regularized Discriminant Analysis\n* Regression\n * Support Vector Regression\n * Gaussian Process\n * Regression Trees\n * Gradient Boosting\n * Random Forest\n * RBF Networks\n * Linear Regression\n * LASSO\n * Ridge Regression\n* Feature Selection\n * Genetic Algorithm based Feature Selection\n * Ensemble Learning based Feature Selection\n * Signal Noise ratio\n * Sum Squares ratio\n* Dimension Reduction\n * PCA\n * Kernel PCA\n * Probabilistic PCA\n * Generalized Hebbian Algorithm\n * Random Project\n* Model Validation\n * Cross Validation\n * Leave-One-Out Validation\n * Bootstrap\n * Confusion Matrix\n * AUC\n * Fallout\n * FDR\n * F-Score\n * Precision\n * Recall\n * Sensitivity\n * Specificity\n * MSE\n * RMSE\n * RSS\n * Absolute Deviation\n * Rand Index\n * Adjusted Rand Index\n* Clustering\n * BIRCH\n * CLARANS\n * DBScan\n * DENCLUE\n * Deterministic Annealing\n * K-Means\n * X-Means\n * G-Means\n * Neural Gas\n * Growing Neural Gas\n * Hierarchical Clustering\n * Sequential Information Bottleneck\n * Self-Organizing Maps\n * Spectral Clustering\n * Minimum Entropy Clustering\n* Association Rules\n * Frequent Itemset Mining\n * Association Rule Mining\n* Manifold learning\n * IsoMap\n * LLE\n * Laplacian Eigenmap\n* Multi-Dimensional Scaling\n * Classical MDS\n * Isotonic MDS\n * Sammon Mapping\n* Nearest Neighbor Search\n * BK-Tree\n * Cover Tree\n * KD-Tree\n * Locality-Sensitive Hashing\n* Sequence Learning\n * Hidden Markov Model\n * Conditional Random Field\n* Natural Language Processing\n * Sentence Splitter\n * Tokenizer\n * Bigram Statistical Test\n * Phrase Extractor\n * Keyword Extractor\n * Porter Stemmer\n * Lancaster Stemmer\n * POS Tagging\n * Relevance Ranking\n* Interpolation\n * Linear\n * Bilinear\n * Cubic\n * Bicubic\n * Kriging\n * Laplace\n * Shepard\n * RBF\n* Wavelet\n * Discrete Wavelet Transform\n * Wavelet Shrinkage Haar Daubechies D4 Best Localized Wavelet\n * Coiflet\n * Symmlet" 73 | }, { 74 | "metadata" : { 75 | "id" : "8C49D8D0D60B4DD5956FE2DAC8F2808F" 76 | }, 77 | "cell_type" : "markdown", 78 | "source" : "However, this is local only.\n> Haifeng Li (the main author) is providing a quick benchmark where Smile outperforms R/Python/Spark/H2O and claims too quickly that we can train the model locally only. This wouldn't work if the data is getting bigger or if we simply want to run ensembles or many algorithms -- a cluster would still be worth considering." 79 | }, { 80 | "metadata" : { 81 | "id" : "3E0F0919691340D3B3A946411500D586" 82 | }, 83 | "cell_type" : "markdown", 84 | "source" : "### Example of Maximum Entropy Classifier (Maxent) using Smile" 85 | }, { 86 | "metadata" : { 87 | "id" : "114AA350182B48A0ABDBFBA19C441111" 88 | }, 89 | "cell_type" : "markdown", 90 | "source" : "Maximum entropy is a technique for learning probability distributions from data. \n\nIn maximum entropy models, the observed data itself is assumed to be the testable information. Maximum entropy models don't assume anything about the probability distribution other than what have been observed and always choose the most uniform distribution subject to the observed constraints." 91 | }, { 92 | "metadata" : { 93 | "id" : "9700BA7AFD894F729780D0F0BB0E8AE3" 94 | }, 95 | "cell_type" : "markdown", 96 | "source" : "```scala\ndef maxent(x: Array[Array[Int]], y: Array[Int], p: Int, lambda: Double = 0.1, tol: Double = 1E-5, maxIter: Int = 500): Maxent\n```" 97 | }, { 98 | "metadata" : { 99 | "id" : "446B6E618681440CAA93A50975B4C7C6" 100 | }, 101 | "cell_type" : "markdown", 102 | "source" : "where `x` is the sparse training samples. Each sample is represented by a set of sparse binary features. The features are stored in an integer array, of which are the indices of nonzero features. \n\nThe parameter `p` is the dimension of feature space, and `lambda` is the regularization factor.\n\nBasically, maximum entropy classifier is another name of multinomial logistic regression applied to categorical independent variables, \nwhich are converted to binary dummy variables. \n\nMaximum entropy models are widely used in natural language processing. Therefore, Smile's implementation assumes that **binary features** are stored in a sparse array, of which entries are the indices of nonzero features." 103 | }, { 104 | "metadata" : { 105 | "trusted" : true, 106 | "input_collapsed" : false, 107 | "collapsed" : true, 108 | "id" : "4699144082CF486A804DCED6326066BD" 109 | }, 110 | "cell_type" : "code", 111 | "source" : "import sys.process._\nimport scala.language.postfixOps\n\n\"wget https://raw.githubusercontent.com/haifengl/smile/master/shell/src/universal/data/sequence/sparse.hyphen.6.train -O /tmp/sparse.hyphen.6.train \"!!\n\n\"wget https://raw.githubusercontent.com/haifengl/smile/master/shell/src/universal/data/sequence/sparse.hyphen.6.test -O /tmp/sparse.hyphen.6.test \"!!", 112 | "outputs" : [ ] 113 | }, { 114 | "metadata" : { 115 | "trusted" : true, 116 | "input_collapsed" : false, 117 | "collapsed" : true, 118 | "id" : "DD548922D7734331811948F0FF6946BF" 119 | }, 120 | "cell_type" : "code", 121 | "source" : "case class SmileDataset(\n x:Array[Array[Int]],\n y:Array[Int],\n p:Int\n)", 122 | "outputs" : [ ] 123 | }, { 124 | "metadata" : { 125 | "trusted" : true, 126 | "input_collapsed" : false, 127 | "collapsed" : true, 128 | "id" : "D76F34DAF0934FA29142026AA82BDBB9" 129 | }, 130 | "cell_type" : "code", 131 | "source" : "def load(resource:String):SmileDataset = {\n val xs = scala.collection.mutable.ArrayBuffer.empty[Array[Int]]\n val ys = scala.collection.mutable.ArrayBuffer.empty[Int]\n \n val head :: content = scala.io.Source.fromFile(new java.io.File(resource)).getLines.toList\n \n val Array(nseq, k, p) = head.split(\" \").map(_.trim.toInt)\n \n content.foreach{ line =>\n val seqid :: pos :: len :: featureAndY = line.split(\" \").map(_.trim.toInt).toList\n val (feature, y) = (featureAndY.init, featureAndY.last)\n xs += feature.toArray\n ys += y\n }\n \n SmileDataset(xs.toArray, ys.toArray, p)\n}", 132 | "outputs" : [ ] 133 | }, { 134 | "metadata" : { 135 | "trusted" : true, 136 | "input_collapsed" : false, 137 | "collapsed" : true, 138 | "id" : "9CC0506B62854016892AE78C94D7A9F5" 139 | }, 140 | "cell_type" : "code", 141 | "source" : "import smile.classification.Maxent\nval train = load(\"/tmp/sparse.hyphen.6.train\")\nval test = load(\"/tmp/sparse.hyphen.6.test\")\n\nval maxent = new Maxent(train.p, train.x, train.y, 0.1, 1E-5, 500);\n\nval error = (test.x zip test.y).filter{ case (x,y) => maxent.predict(x) != y }.size", 142 | "outputs" : [ ] 143 | }, { 144 | "metadata" : { 145 | "trusted" : true, 146 | "input_collapsed" : false, 147 | "collapsed" : true, 148 | "id" : "0AD1F07DFE6C45A68AE15AF4001C5A18" 149 | }, 150 | "cell_type" : "code", 151 | "source" : ":markdown \nHyphen error is $error of ${test.x.size}", 152 | "outputs" : [ ] 153 | }, { 154 | "metadata" : { 155 | "trusted" : true, 156 | "input_collapsed" : false, 157 | "collapsed" : true, 158 | "id" : "3504A2996BBB43128F077F9DC97E588D" 159 | }, 160 | "cell_type" : "code", 161 | "source" : ":markdown\nHyphen error rate = ${100.0 * error / test.x.length}", 162 | "outputs" : [ ] 163 | }, { 164 | "metadata" : { 165 | "id" : "7D761878E55148ACBC6545F0E0B80332" 166 | }, 167 | "cell_type" : "markdown", 168 | "source" : "## DeepLearning4J [GitHub](https://github.com/deeplearning4j/deeplearning4j)" 169 | }, { 170 | "metadata" : { 171 | "id" : "E43F83384B284FCD86C3B6BF392026B1" 172 | }, 173 | "cell_type" : "markdown", 174 | "source" : "Probably the Ultimate library to follow in terms of local optimization (CPU/GPU) and obviously for Deep Learning models (both local and distributed using Spark for instance)." 175 | }, { 176 | "metadata" : { 177 | "id" : "9EFEFD14676049348478B8BB8C029C75" 178 | }, 179 | "cell_type" : "markdown", 180 | "source" : "### Example of LSTM" 181 | }, { 182 | "metadata" : { 183 | "trusted" : true, 184 | "input_collapsed" : false, 185 | "collapsed" : true, 186 | "id" : "8B2DEFC72CBB47C2A5080209171F1676" 187 | }, 188 | "cell_type" : "code", 189 | "source" : "import org.deeplearning4j.datasets.iterator._\nimport org.deeplearning4j.eval.Evaluation\nimport org.deeplearning4j.models.embeddings.loader.WordVectorSerializer\nimport org.deeplearning4j.models.embeddings.wordvectors.WordVectors\n\nimport org.deeplearning4j.nn.api.OptimizationAlgorithm\nimport org.deeplearning4j.nn.conf._\nimport org.deeplearning4j.nn.conf.layers._\nimport org.deeplearning4j.nn.multilayer.MultiLayerNetwork\nimport org.deeplearning4j.nn.weights.WeightInit\n\nimport org.nd4j.linalg.api.ndarray.INDArray\nimport org.nd4j.linalg.dataset.DataSet\nimport org.nd4j.linalg.lossfunctions.LossFunctions", 190 | "outputs" : [ ] 191 | }, { 192 | "metadata" : { 193 | "id" : "46A4DEB01C024725B8DD7D6F045F1A2D" 194 | }, 195 | "cell_type" : "markdown", 196 | "source" : "Using Word2Vec feature space" 197 | }, { 198 | "metadata" : { 199 | "trusted" : true, 200 | "input_collapsed" : false, 201 | "collapsed" : true, 202 | "id" : "458E3AC613CA433A8F1533F97DAEC48E" 203 | }, 204 | "cell_type" : "code", 205 | "source" : "val wordVectors: WordVectors =WordVectorSerializer.loadGoogleModel(WORD_VECTORS_PATH, true, false)", 206 | "outputs" : [ ] 207 | }, { 208 | "metadata" : { 209 | "id" : "7F7660A14B044A628CB8DE7115BACA23" 210 | }, 211 | "cell_type" : "markdown", 212 | "source" : "LSTM: The solution to exploding and vanishing gradients" 213 | }, { 214 | "metadata" : { 215 | "trusted" : true, 216 | "input_collapsed" : false, 217 | "collapsed" : true, 218 | "id" : "B2A8B81473564238836CEC096829FADE" 219 | }, 220 | "cell_type" : "code", 221 | "source" : "val lstmLayer:GravesLSTM = new GravesLSTM.Builder()\n .nIn(vectorSize)\n .nOut(200) // 200 hidden units\n .activation(\"softsign\")\n .build()", 222 | "outputs" : [ ] 223 | }, { 224 | "metadata" : { 225 | "id" : "7FD8C1AA70E24BDA8E8749AF93AB5545" 226 | }, 227 | "cell_type" : "markdown", 228 | "source" : "Output Layer" 229 | }, { 230 | "metadata" : { 231 | "trusted" : true, 232 | "input_collapsed" : false, 233 | "collapsed" : true, 234 | "id" : "D88CDED53ECC4A7590DC676501BFA58C" 235 | }, 236 | "cell_type" : "code", 237 | "source" : "val rnnLayer:RnnOutputLayer = new RnnOutputLayer.Builder()\n .activation(\"softmax\")\n .lossFunction(LossFunctions.LossFunction.MCXENT)\n .nIn(200)\n .nOut(2)\n .build()", 238 | "outputs" : [ ] 239 | }, { 240 | "metadata" : { 241 | "id" : "6257296C93C04E7DB5CADE461299DE36" 242 | }, 243 | "cell_type" : "markdown", 244 | "source" : "Model" 245 | }, { 246 | "metadata" : { 247 | "trusted" : true, 248 | "input_collapsed" : false, 249 | "collapsed" : true, 250 | "id" : "3A57CF5E0C344AF087F877349F9F41B0" 251 | }, 252 | "cell_type" : "code", 253 | "source" : "//Set up network configuration\nval conf = new NeuralNetConfiguration.Builder()\n .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)\n .iterations(1) // 1 iteration per mini-batch\n .updater(Updater.RMSPROP) // How to propagate the \"errors\"\n .regularization(true).l2(1e-5)\n .weightInit(WeightInit.XAVIER)\n .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)\n .gradientNormalizationThreshold(1.0)\n .learningRate(0.0018)\n .list()\n .layer(0, lstmLayer)\n .layer(1, rnnLayer)\n .pretrain(false) \n .backprop(true)\n .build()", 254 | "outputs" : [ ] 255 | }, { 256 | "metadata" : { 257 | "trusted" : true, 258 | "input_collapsed" : false, 259 | "collapsed" : true, 260 | "id" : "4473F552C0A642508471FF7D984A9924" 261 | }, 262 | "cell_type" : "code", 263 | "source" : "val net = new MultiLayerNetwork(conf)\nnet.init()", 264 | "outputs" : [ ] 265 | }, { 266 | "metadata" : { 267 | "id" : "5C730159197C42C89CE5BBE82C0E6F82" 268 | }, 269 | "cell_type" : "markdown", 270 | "source" : "Spark" 271 | }, { 272 | "metadata" : { 273 | "trusted" : true, 274 | "input_collapsed" : false, 275 | "collapsed" : true, 276 | "id" : "C7D7D7F652F043B99834E71974D82220" 277 | }, 278 | "cell_type" : "code", 279 | "source" : "import org.deeplearning4j.spark.impl.multilayer.SparkDl4jMultiLayer\nval sparkNetwork = new SparkDl4jMultiLayer(sparkContext, net)", 280 | "outputs" : [ ] 281 | }, { 282 | "metadata" : { 283 | "id" : "B62E83CBB7C44969A0E13EC42F8D4E9C" 284 | }, 285 | "cell_type" : "markdown", 286 | "source" : "Load distributed data" 287 | }, { 288 | "metadata" : { 289 | "trusted" : true, 290 | "input_collapsed" : false, 291 | "collapsed" : true, 292 | "id" : "237B2AA9241D49AA83597DF4C5DABA78" 293 | }, 294 | "cell_type" : "code", 295 | "source" : "val rdd = ???", 296 | "outputs" : [ ] 297 | }, { 298 | "metadata" : { 299 | "id" : "3292362830C546C29B03E54F9DEBD6D0" 300 | }, 301 | "cell_type" : "markdown", 302 | "source" : "Train on Spark" 303 | }, { 304 | "metadata" : { 305 | "trusted" : true, 306 | "input_collapsed" : false, 307 | "collapsed" : true, 308 | "id" : "118F303E945C4D2DA09873A3BDF0A503" 309 | }, 310 | "cell_type" : "code", 311 | "source" : "val trainedNetwork = sparkNetwork.fitDataSet(rdd)", 312 | "outputs" : [ ] 313 | }, { 314 | "metadata" : { 315 | "id" : "2471B6E4297D40458575C054905F548C" 316 | }, 317 | "cell_type" : "markdown", 318 | "source" : "## MLlib [Guide](https://spark.apache.org/docs/latest/mllib-guide.html)" 319 | }, { 320 | "metadata" : { 321 | "id" : "CF6CAE27D5A74DA68AB9667A521BE940" 322 | }, 323 | "cell_type" : "markdown", 324 | "source" : "Apache Spark's machine learning library, focused on scalability and distributed dataset." 325 | }, { 326 | "metadata" : { 327 | "id" : "7CBAE40AF08D41ED8098FD53F5BD07C1" 328 | }, 329 | "cell_type" : "markdown", 330 | "source" : "MLlib has more than 20 optimized and distributed methods/models implementation available (at the time writing)." 331 | }, { 332 | "metadata" : { 333 | "id" : "DDA47801A31A4A9C8915F40C7FA2954C" 334 | }, 335 | "cell_type" : "markdown", 336 | "source" : "* Basic statistics\n * summary statistics\n * correlations\n * stratified sampling\n * hypothesis testing\n * streaming significance testing\n * random data generation\n* Classification and regression\n * linear models (SVMs, logistic regression, linear regression)\n * naive Bayes\n * decision trees\n * ensembles of trees (Random Forests and Gradient-Boosted Trees)\n * isotonic regression\n* Collaborative filtering\n * alternating least squares (ALS)\n* Clustering\n * k-means\n * Gaussian mixture\n * power iteration clustering (PIC)\n * latent Dirichlet allocation (LDA)\n * bisecting k-means\n * streaming k-means\n* Dimensionality reduction\n * singular value decomposition (SVD)\n * principal component analysis (PCA)\n* Feature extraction and transformation\n* Frequent pattern mining\n * FP-growth\n * association rules\n * PrefixSpan\n* Evaluation metrics\n* Optimization (developer)\n * stochastic gradient descent\n * limited-memory BFGS (L-BFGS)" 337 | }, { 338 | "metadata" : { 339 | "id" : "2806EB46987A49798EE2A58E251317E8" 340 | }, 341 | "cell_type" : "markdown", 342 | "source" : "### Example of Random Forest" 343 | }, { 344 | "metadata" : { 345 | "id" : "AA31064E508843B3832F8B23078A4CDA" 346 | }, 347 | "cell_type" : "markdown", 348 | "source" : "The MLlib guide is really good and present the models with examples plus their theorical and practical foundations.\n\nHence the following example of Random Forest is shamelessly stealt from the guide :-)." 349 | }, { 350 | "metadata" : { 351 | "id" : "E084EA0943974E67BE8199AA2DE7DE47" 352 | }, 353 | "cell_type" : "markdown", 354 | "source" : "As usual we first import the required classes, which are the model, the algorithm and a utils class to load predefined data types " 355 | }, { 356 | "metadata" : { 357 | "trusted" : true, 358 | "input_collapsed" : false, 359 | "collapsed" : true, 360 | "id" : "D641028ABE2B47748513B9406574D5E2" 361 | }, 362 | "cell_type" : "code", 363 | "source" : "import org.apache.spark.mllib.tree.RandomForest\nimport org.apache.spark.mllib.tree.model.RandomForestModel\nimport org.apache.spark.mllib.util.MLUtils", 364 | "outputs" : [ ] 365 | }, { 366 | "metadata" : { 367 | "id" : "D9EB6B5A237E4B64877E8C7097EC82DB" 368 | }, 369 | "cell_type" : "markdown", 370 | "source" : "Download the dataset" 371 | }, { 372 | "metadata" : { 373 | "trusted" : true, 374 | "input_collapsed" : false, 375 | "collapsed" : true, 376 | "id" : "8442F74EAC064B6A8C7FA7540AD7C756" 377 | }, 378 | "cell_type" : "code", 379 | "source" : ":sh wget https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt -O /tmp/sample_libsvm_data.txt", 380 | "outputs" : [ ] 381 | }, { 382 | "metadata" : { 383 | "id" : "50FF68EFD4B44B8983BF132B78B0FCE5" 384 | }, 385 | "cell_type" : "markdown", 386 | "source" : "Load and parse the data file." 387 | }, { 388 | "metadata" : { 389 | "trusted" : true, 390 | "input_collapsed" : false, 391 | "collapsed" : true, 392 | "id" : "23E9B4A6AD6040CE96A32B2B7DB76BD4" 393 | }, 394 | "cell_type" : "code", 395 | "source" : "val data = MLUtils.loadLibSVMFile(sc, \"/tmp/sample_libsvm_data.txt\")", 396 | "outputs" : [ ] 397 | }, { 398 | "metadata" : { 399 | "id" : "A0602B0895DC460F8B5D7AC660E4C1C5" 400 | }, 401 | "cell_type" : "markdown", 402 | "source" : "Split the data into training and test sets (30% held out for testing)" 403 | }, { 404 | "metadata" : { 405 | "trusted" : true, 406 | "input_collapsed" : false, 407 | "collapsed" : true, 408 | "id" : "65FC2F1EA6884FA991ECA28A962FA127" 409 | }, 410 | "cell_type" : "code", 411 | "source" : "val splits = data.randomSplit(Array(0.7, 0.3))\nval (trainingData, testData) = (splits(0), splits(1))", 412 | "outputs" : [ ] 413 | }, { 414 | "metadata" : { 415 | "id" : "8A72D49D5A1849DE92CB23BCB60A3C81" 416 | }, 417 | "cell_type" : "markdown", 418 | "source" : "Train a RandomForest model.\n\nEmpty categoricalFeaturesInfo indicates all features are continuous." 419 | }, { 420 | "metadata" : { 421 | "trusted" : true, 422 | "input_collapsed" : false, 423 | "collapsed" : true, 424 | "id" : "A8DC8E18B8AC48B3901230FB25D5E180" 425 | }, 426 | "cell_type" : "code", 427 | "source" : "val numClasses = 2\nval categoricalFeaturesInfo = Map[Int, Int]()\nval numTrees = 3 // Use more in practice.\nval featureSubsetStrategy = \"auto\" // Let the algorithm choose.\nval impurity = \"gini\"\nval maxDepth = 4\nval maxBins = 32", 428 | "outputs" : [ ] 429 | }, { 430 | "metadata" : { 431 | "trusted" : true, 432 | "input_collapsed" : false, 433 | "collapsed" : true, 434 | "id" : "7A58234C49814CFC80B6609DD12AE88D" 435 | }, 436 | "cell_type" : "code", 437 | "source" : "val model = RandomForest.trainClassifier(trainingData, \n numClasses, \n categoricalFeaturesInfo,\n numTrees, \n featureSubsetStrategy, \n impurity, \n maxDepth, \n maxBins)", 438 | "outputs" : [ ] 439 | }, { 440 | "metadata" : { 441 | "id" : "3FC587E4A580450794BA4704EF7A6766" 442 | }, 443 | "cell_type" : "markdown", 444 | "source" : "Evaluate model on test instances and compute test error" 445 | }, { 446 | "metadata" : { 447 | "trusted" : true, 448 | "input_collapsed" : false, 449 | "collapsed" : true, 450 | "id" : "C6AF1D4F9B5F421F9F9443E5DF6B812A" 451 | }, 452 | "cell_type" : "code", 453 | "source" : "val labelAndPreds = testData.map { point =>\n val prediction = model.predict(point.features)\n (point.label, prediction)\n}", 454 | "outputs" : [ ] 455 | }, { 456 | "metadata" : { 457 | "trusted" : true, 458 | "input_collapsed" : false, 459 | "collapsed" : true, 460 | "id" : "69A330F4C5544F628C744C623AE280C5" 461 | }, 462 | "cell_type" : "code", 463 | "source" : "val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()\ntext(\"Test Error = \" + testErr) ++ html(
) ++ text(\"Learned classification forest model:\\n\") ++ html(
{model.toDebugString}
)", 464 | "outputs" : [ ] 465 | }, { 466 | "metadata" : { 467 | "id" : "6EC433E37604464C863DFB3F850F9103" 468 | }, 469 | "cell_type" : "markdown", 470 | "source" : "Save and load model" 471 | }, { 472 | "metadata" : { 473 | "trusted" : true, 474 | "input_collapsed" : false, 475 | "collapsed" : true, 476 | "id" : "DE9462E48A9C4D198385A017A8EDA776" 477 | }, 478 | "cell_type" : "code", 479 | "source" : "model.save(sc, \"/tmp/myRandomForestClassificationModel\")\nval sameModel = RandomForestModel.load(sc, \"/tmp/myRandomForestClassificationModel\")", 480 | "outputs" : [ ] 481 | }, { 482 | "metadata" : { 483 | "id" : "E4855C423EA64BD18589BF8EEE93670C" 484 | }, 485 | "cell_type" : "markdown", 486 | "source" : "## Spark (Online) Clustering [GitHub](https://github.com/Spark-clustering-notebook/)" 487 | }, { 488 | "metadata" : { 489 | "id" : "5486D0036B6340EE88270E8BBDF7F48A" 490 | }, 491 | "cell_type" : "markdown", 492 | "source" : "Project started at the LIPN (University of Paris 13 lab), team leaded by Mustapha Lebbah and focusing on online algorithms (mainly classification) on distributed computing (mainly Spark)." 493 | }, { 494 | "metadata" : { 495 | "id" : "FE8C0729A38546CB83ECF9625E4725EC" 496 | }, 497 | "cell_type" : "markdown", 498 | "source" : "### Example G-Stream" 499 | }, { 500 | "metadata" : { 501 | "id" : "897E8ACAB74649338FDA86D5A01B510D" 502 | }, 503 | "cell_type" : "markdown", 504 | "source" : "Publications:\n\n1. Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag: Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming. INNS Conference on Big Data 2015: 158-166.\n\n2. Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag, Tarn Duong: Streaming Data Clustering using Spark Streaming: Application to Big-Data of Insurance. KDD 2016 (** Paper under submission **)." 505 | }, { 506 | "metadata" : { 507 | "id" : "A6E0E11FCD2447098FF286A3B767846C" 508 | }, 509 | "cell_type" : "markdown", 510 | "source" : "Prepare spark streaming context" 511 | }, { 512 | "metadata" : { 513 | "trusted" : true, 514 | "input_collapsed" : false, 515 | "collapsed" : true, 516 | "id" : "694A2EB24B4D4B5CBEF45FC658145100" 517 | }, 518 | "cell_type" : "code", 519 | "source" : "import org.apache.spark.streaming.{Seconds, StreamingContext, Milliseconds}\n@transient val ssc:StreamingContext = {\n StreamingContext.getActive.foreach(_.stop(false))\n new StreamingContext(sparkContext, Milliseconds(m(\"intervalMs\").toInt))\n}", 520 | "outputs" : [ ] 521 | }, { 522 | "metadata" : { 523 | "id" : "384DC3DC983A44548EADF09121029996" 524 | }, 525 | "cell_type" : "markdown", 526 | "source" : "Init first data and connect to stream (see https://github.com/Spark-clustering-notebook/coliseum/blob/master/notebooks/coliseum/G-Stream.snb)" 527 | }, { 528 | "metadata" : { 529 | "trusted" : true, 530 | "input_collapsed" : false, 531 | "collapsed" : true, 532 | "id" : "6F14D2AF06164782A886D500EEE1C2ED" 533 | }, 534 | "cell_type" : "code", 535 | "source" : "val separator = \" \"\n// 'points2' contains the first two data-points used for initialising the model\n@transient val points2 = sc.textFile(s\"$expDir/data0\").map(x => x.split(separator).map(_.toDouble))\n\n// Create a DStreams that reads batch files from dirData\n@transient val stream = ssc.textFileStream(expDir).map(x => x.split(separator).map(_.toDouble))\n// Create a DStreams that will connect to a socket hostname:port\n//val stream = ssc.socketTextStream(\"localhost\", 9999).map(x => x.split(separator).map(_.toDouble)) //localhost or 10.32.2.153 for Teralab", 536 | "outputs" : [ ] 537 | }, { 538 | "metadata" : { 539 | "id" : "95D4808CE17242E48044C38D76D3AC77" 540 | }, 541 | "cell_type" : "markdown", 542 | "source" : "Transform data as feature vectors" 543 | }, { 544 | "metadata" : { 545 | "trusted" : true, 546 | "input_collapsed" : false, 547 | "collapsed" : true, 548 | "id" : "BC36ED5CC523493B83D96470EE159096" 549 | }, 550 | "cell_type" : "code", 551 | "source" : "stream.foreachRDD{r => \n val d = r.take(10).map(_.toList.toString)\n datalist.appendAll(d)\n } \nval labId = 2 //TODO: change -1 to -2 when you add the id to the file (last column) //-2 because the last 2 columns represent label & id\nval dim = points2.take(1)(0).size - labId", 552 | "outputs" : [ ] 553 | }, { 554 | "metadata" : { 555 | "id" : "6C07FE40130D4BAA87CB9C7FF9F45AE1" 556 | }, 557 | "cell_type" : "markdown", 558 | "source" : "Import G-Stream model" 559 | }, { 560 | "metadata" : { 561 | "trusted" : true, 562 | "input_collapsed" : false, 563 | "collapsed" : true, 564 | "id" : "E8798372293841EB828EF2DCCF0F4479" 565 | }, 566 | "cell_type" : "code", 567 | "source" : "import org.lipn.clustering.batchStream.batchStream", 568 | "outputs" : [ ] 569 | }, { 570 | "metadata" : { 571 | "id" : "277F3A7B08D74ED888D549206E7CD6B6" 572 | }, 573 | "cell_type" : "markdown", 574 | "source" : "Configure G-Stream" 575 | }, { 576 | "metadata" : { 577 | "trusted" : true, 578 | "input_collapsed" : false, 579 | "collapsed" : true, 580 | "id" : "990EDC710D5A40229303981047872201" 581 | }, 582 | "cell_type" : "code", 583 | "source" : "val decayFactor = 0.9\n val lambdaAge = 1.2\n val nbNodesToAdd = 3\n val nbWind = 5\n val DSname = \"dsname\"\n\n@transient var gstream = new batchStream()\n .setDecayFactor(decayFactor)\n .setLambdaAge(lambdaAge)\n .setMaxInsert(nbNodesToAdd)\n\n// converting each point into an object\n@transient val dstreamObj = stream.map( e =>\n gstream.model.pointToObjet(e, dim, labId)\n)", 584 | "outputs" : [ ] 585 | }, { 586 | "metadata" : { 587 | "id" : "923B04DCECB841278D5681F17B3967CA" 588 | }, 589 | "cell_type" : "markdown", 590 | "source" : "Init the model wiht the first 2 points" 591 | }, { 592 | "metadata" : { 593 | "trusted" : true, 594 | "input_collapsed" : false, 595 | "collapsed" : true, 596 | "id" : "45202B93E2764B1988D53EDC368BB1B4" 597 | }, 598 | "cell_type" : "code", 599 | "source" : "// initialization of the model by creating a graph of two nodes (the first 2 data-points)\ngstream.initModelObj(points2, dim)", 600 | "outputs" : [ ] 601 | }, { 602 | "metadata" : { 603 | "id" : "59FB6C24C1AF490C8DF9B3194C975649" 604 | }, 605 | "cell_type" : "markdown", 606 | "source" : "Train the model online with new data coming in the `DStream`" 607 | }, { 608 | "metadata" : { 609 | "trusted" : true, 610 | "input_collapsed" : false, 611 | "collapsed" : true, 612 | "id" : "CC9724E021EB4B3787BD16595224274E" 613 | }, 614 | "cell_type" : "code", 615 | "source" : "// training on the model\ngstream.trainOnObj(dstreamObj, gstream, outputDir+\"/\"+DSname+\"-\"+nbNodesToAdd, dim, nbWind)", 616 | "outputs" : [ ] 617 | }, { 618 | "metadata" : { 619 | "id" : "0BD046386C0940A995943F4CB978C202" 620 | }, 621 | "cell_type" : "markdown", 622 | "source" : "This will create a new dataset (File) for each batch of data (RDD) which will contain the new `protoptypes` (~ `clusters`) which are linked as a _Self Organized Map_ " 623 | }, { 624 | "metadata" : { 625 | "id" : "77BBFF186BC94896B20C0B703DA3AF69" 626 | }, 627 | "cell_type" : "markdown", 628 | "source" : "# TO BE CONTINUED" 629 | }, { 630 | "metadata" : { 631 | "id" : "A697180C935B4BE99AA332927D2B507E" 632 | }, 633 | "cell_type" : "markdown", 634 | "source" : "For instance,\n\n* H2O\n* OptiML (stanford)\n* Figaro (https://github.com/p2t2/figaro)\n* sysml?\n* Factorie (http://factorie.cs.umass.edu/)\n* OscaR (https://bitbucket.org/oscarlib/oscar/wiki/Home)\n* Chalk for NLP (https://github.com/scalanlp/chalk)\n* Bayes Scala (https://github.com/danielkorzekwa/bayes-scala)" 635 | } ], 636 | "nbformat" : 4 637 | } -------------------------------------------------------------------------------- /notebooks/Why Spark Notebook.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "Why Spark Notebook", 4 | "user_save_timestamp": "1970-01-01T01:00:00.000Z", 5 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z", 6 | "language_info": { 7 | "name": "scala", 8 | "file_extension": "scala", 9 | "codemirror_mode": "text/x-scala" 10 | }, 11 | "trusted": true, 12 | "customLocalRepo": "/tmp/repo", 13 | "customRepos": null, 14 | "customDeps": [ 15 | "org.apache.spark %% spark-streaming-kafka-0-8 % _", 16 | "com.datastax.spark %% spark-cassandra-connector-java % 1.6.0-M1", 17 | "- org.scala-lang % _ % _" 18 | ], 19 | "customImports": null, 20 | "customArgs": null, 21 | "customSparkConf": { 22 | "spark.default.parallelism": "4" 23 | } 24 | }, 25 | "cells": [ 26 | { 27 | "metadata": { 28 | "id": "7EDB55950A0C4BA585619E38C5D106FB" 29 | }, 30 | "cell_type": "markdown", 31 | "source": "# Spark Notebook fills the gap" 32 | }, 33 | { 34 | "metadata": { 35 | "id": "FD1F7891FDCA4BF9898E243432930B07" 36 | }, 37 | "cell_type": "markdown", 38 | "source": "The Spark Notebook is the open source notebook focusing on productive and enterprise environments.\n\nFor that, it is only based on JVM components and has no other dependencies.\n\nScala is the only supported language for the reasons mentioned above, and if the language would still miss some features to do data science efficiently, the Spark Notebook is there to help, hereafter you'll be presented what is available." 39 | }, 40 | { 41 | "metadata": { 42 | "id": "9D4B9FB7CE8547AF8E22DEFB2F004560" 43 | }, 44 | "cell_type": "markdown", 45 | "source": "Before getting started, the Spark Notebook is also famous for \n* its great community out of its ~1500 on GitHub,\n* its very active [gitter](https://gitter.im/andypetrella/spark-notebook) channel: \n * 540+ participants and \n * 750+ messages per month.\n\n(17st August 2016)" 46 | }, 47 | { 48 | "metadata": { 49 | "id": "816B3C0AC858439AB425EE1BBCF3F553" 50 | }, 51 | "cell_type": "markdown", 52 | "source": "\"Gitter\"" 53 | }, 54 | { 55 | "metadata": { 56 | "id": "4FB943930C654421BF949217B72BC5CE" 57 | }, 58 | "cell_type": "markdown", 59 | "source": "---\n## Skin to control the flow visually" 60 | }, 61 | { 62 | "metadata": { 63 | "trusted": true, 64 | "input_collapsed": false, 65 | "collapsed": false, 66 | "id": "E11445B1CCCE465B83A27E374CCF5BED" 67 | }, 68 | "cell_type": "code", 69 | "source": ":html\n", 70 | "outputs": [] 71 | }, 72 | { 73 | "metadata": { 74 | "id": "ADA0BB6570A04F9A832BC2A4A1827441" 75 | }, 76 | "cell_type": "markdown", 77 | "source": "---\n## Multiple Spark Contexts" 78 | }, 79 | { 80 | "metadata": { 81 | "id": "24F6BEA80211494580BE0C17D62C995E" 82 | }, 83 | "cell_type": "markdown", 84 | "source": "One of the top most useful feature brought by the spark notebook is its separation of the running notebooks.\n\nIndeed, each started notebook will spawn a new JVM with its own `SparkSession` instance. This allows a maximal flexibility for:\n* dependencies without clashes\n* access different clusters\n* tune differently each notebook\n* external scheduling (on the roadmap)" 85 | }, 86 | { 87 | "metadata": { 88 | "id": "E092B8282D04469697399B855136522D" 89 | }, 90 | "cell_type": "markdown", 91 | "source": "You can recognize easily the spawned processes using `ps` (*unix* only) and search for the main class `ChildProcessMain` and verify that the process contains the name of the started notebooks." 92 | }, 93 | { 94 | "metadata": { 95 | "trusted": true, 96 | "input_collapsed": false, 97 | "collapsed": false, 98 | "presentation": { 99 | "tabs_state": "{\n \"tab_id\": \"#tab1592853846-0\"\n}", 100 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 101 | }, 102 | "id": "8BCC90AE74CF4C00811E64FD82287703" 103 | }, 104 | "cell_type": "code", 105 | "source": "import sys.process._\nimport scala.language.postfixOps\n\"ps aux\" #| \"grep ChildProcessMain\" lines_!", 106 | "outputs": [] 107 | }, 108 | { 109 | "metadata": { 110 | "id": "3638F8A33FBB4FD28408513DCBA49F75" 111 | }, 112 | "cell_type": "markdown", 113 | "source": "So this notebook declares the variables `sparkSession` and `sparkContext` (with its alias `sc`)." 114 | }, 115 | { 116 | "metadata": { 117 | "trusted": true, 118 | "input_collapsed": false, 119 | "collapsed": false, 120 | "presentation": { 121 | "tabs_state": "{\n \"tab_id\": \"#tab53254654-0\"\n}", 122 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 123 | }, 124 | "id": "22D8EFAF7D8C4463A064B8C90F720F63" 125 | }, 126 | "cell_type": "code", 127 | "source": "val context = (sparkSession, sparkContext, sc)", 128 | "outputs": [] 129 | }, 130 | { 131 | "metadata": { 132 | "id": "6D96AEA2A64E48E383FCEF609B635966" 133 | }, 134 | "cell_type": "markdown", 135 | "source": "---\n## Metadata" 136 | }, 137 | { 138 | "metadata": { 139 | "id": "1B7ACC1DDB8348688F7FF4D7895727B3" 140 | }, 141 | "cell_type": "markdown", 142 | "source": "A notebook has a context enricheded via its metadata, here are a few important ones." 143 | }, 144 | { 145 | "metadata": { 146 | "id": "22E00CB49C524B559F3D04E871023582" 147 | }, 148 | "cell_type": "markdown", 149 | "source": "### Spark Configuration" 150 | }, 151 | { 152 | "metadata": { 153 | "id": "E12EF4FE0D6B4BED84C15ADEAA087968" 154 | }, 155 | "cell_type": "markdown", 156 | "source": "The metadata can define a JSON object (String to String!) to declare extra configuration for spark." 157 | }, 158 | { 159 | "metadata": { 160 | "trusted": true, 161 | "input_collapsed": false, 162 | "collapsed": false, 163 | "id": "D96B3008B9E440408F5200F325A4FADA" 164 | }, 165 | "cell_type": "code", 166 | "source": ":javascript \nalert(JSON.stringify(IPython.notebook.metadata.customSparkConf, null, 2))", 167 | "outputs": [] 168 | }, 169 | { 170 | "metadata": { 171 | "trusted": true, 172 | "input_collapsed": false, 173 | "collapsed": false, 174 | "id": "79A7AA0CF09B446B883FB07360684127" 175 | }, 176 | "cell_type": "code", 177 | "source": "sparkSession.conf.get(\"spark.default.parallelism\")", 178 | "outputs": [] 179 | }, 180 | { 181 | "metadata": { 182 | "id": "99E5DFFCBA0A4110AE870FE3482667FD" 183 | }, 184 | "cell_type": "markdown", 185 | "source": "### Dependencies" 186 | }, 187 | { 188 | "metadata": { 189 | "id": "286794277F8848CB8CFF763AD90DBD84" 190 | }, 191 | "cell_type": "markdown", 192 | "source": "This notebook has injected a few dependencies from the [datastax cassandra connector](https://github.com/datastax/spark-cassandra-connector/)." 193 | }, 194 | { 195 | "metadata": { 196 | "trusted": true, 197 | "input_collapsed": false, 198 | "collapsed": false, 199 | "id": "215600D86BCB41669835C288F4FFC9CB" 200 | }, 201 | "cell_type": "code", 202 | "source": ":javascript \nalert(JSON.stringify(IPython.notebook.metadata.customDeps, null, 2))", 203 | "outputs": [] 204 | }, 205 | { 206 | "metadata": { 207 | "id": "113E0EC89B7E40198DF7AE174D067328" 208 | }, 209 | "cell_type": "markdown", 210 | "source": "Hence, this code compiles." 211 | }, 212 | { 213 | "metadata": { 214 | "trusted": true, 215 | "input_collapsed": false, 216 | "collapsed": false, 217 | "id": "1BE83A0B7B6744FF83D8513BECC55DB9" 218 | }, 219 | "cell_type": "code", 220 | "source": "import com.datastax.spark.connector._ ", 221 | "outputs": [] 222 | }, 223 | { 224 | "metadata": { 225 | "id": "14FF64D0BFD44D00816058D236828C39" 226 | }, 227 | "cell_type": "markdown", 228 | "source": "Also it includes the kafka external modules for the current scala version (using `%%`) and the current spark version (using `_`)" 229 | }, 230 | { 231 | "metadata": { 232 | "trusted": true, 233 | "input_collapsed": false, 234 | "collapsed": false, 235 | "id": "E5CABC64B14B442D8A824994EC18C4EA" 236 | }, 237 | "cell_type": "code", 238 | "source": "import org.apache.spark.streaming.kafka", 239 | "outputs": [] 240 | }, 241 | { 242 | "metadata": { 243 | "id": "15A133EFC2AF4EE0A3E160E468961066" 244 | }, 245 | "cell_type": "markdown", 246 | "source": "We can also see that we can remove dependencies by prepending `-` to the definition. So we avoid downloading any extra libraries from the scala language." 247 | }, 248 | { 249 | "metadata": { 250 | "id": "A7EABF28DB654A27800804D88C1DC546" 251 | }, 252 | "cell_type": "markdown", 253 | "source": "### Change the metadata" 254 | }, 255 | { 256 | "metadata": { 257 | "id": "82F81FDC8E9B473D94A7B3479E7BB550" 258 | }, 259 | "cell_type": "markdown", 260 | "source": "There are a few metadata available and you can configure them from the editor in the menu: _Edit > Edit Notebook Metadata_." 261 | }, 262 | { 263 | "metadata": { 264 | "trusted": true, 265 | "input_collapsed": false, 266 | "collapsed": false, 267 | "id": "75D9790EC99F4F23ADE7434CE9DC72EC" 268 | }, 269 | "cell_type": "code", 270 | "source": ":javascript \nIPython.notebook.edit_metadata()", 271 | "outputs": [] 272 | }, 273 | { 274 | "metadata": { 275 | "id": "7F3F63814584413E817A6AE5B56C0FAD" 276 | }, 277 | "cell_type": "markdown", 278 | "source": "---\n## Logs" 279 | }, 280 | { 281 | "metadata": { 282 | "id": "D4B5803CA95D4E6B9497CA86DE6C866F" 283 | }, 284 | "cell_type": "markdown", 285 | "source": "Checking logs is always painful when using a notebook since this is simply a web client on the remote REPL in the server. \n\nHence the logs are quite far, or even worse inaccessible!\n\nSo, the spark notebook will forwards **all logs using slf4j** to the browser console → go check it, use the `F12` key and open the _console_ tab!" 286 | }, 287 | { 288 | "metadata": { 289 | "id": "A187C1B705A644B794DDEDB33AE7B223" 290 | }, 291 | "cell_type": "markdown", 292 | "source": "---\n## Side pane" 293 | }, 294 | { 295 | "metadata": { 296 | "id": "9870E3C823B048CC8D558F2B538CC83F" 297 | }, 298 | "cell_type": "markdown", 299 | "source": "In the **View** menu, you'll can open the side pane which contains many interesting panels:\n* `terms` : a table listing the defined _functions_, _variables_ and _types_ !\n* `error logs` : displaying and bringing back any errors thrown in the server\n* `chat room` : a fancy chat room available for the current notebook (see below in the synchronized section)" 300 | }, 301 | { 302 | "metadata": { 303 | "trusted": true, 304 | "input_collapsed": false, 305 | "collapsed": false, 306 | "id": "90AF0CFA143E4F6283502273EF5DB351" 307 | }, 308 | "cell_type": "code", 309 | "source": ":javascript\njQuery('a#toggle-sidebar').click()", 310 | "outputs": [] 311 | }, 312 | { 313 | "metadata": { 314 | "id": "3F098E2A54C446E486EE544C75C756F9" 315 | }, 316 | "cell_type": "markdown", 317 | "source": "---\n## Plotting" 318 | }, 319 | { 320 | "metadata": { 321 | "id": "39F23F6B83C6465D8DE2CD5BD18C2C41" 322 | }, 323 | "cell_type": "markdown", 324 | "source": "There exist many predefined `Chart` that you can use directly on any kind of **Scala** container that can be iterated." 325 | }, 326 | { 327 | "metadata": { 328 | "id": "099B21AF5E2B468E8C7A65C4DD2AAC66" 329 | }, 330 | "cell_type": "markdown", 331 | "source": "If the last statement of a cell isn't an assignment or a definition, then the spark notebook will try to plot it the best way it can automatically." 332 | }, 333 | { 334 | "metadata": { 335 | "trusted": true, 336 | "input_collapsed": false, 337 | "collapsed": false, 338 | "id": "3B03DFAD401D44B08D0C8592C5246AF0" 339 | }, 340 | "cell_type": "code", 341 | "source": "case class Example(id:Int, category:String, value:Long, advanced:Boolean)\nimport scala.util.Random._\n\nval categories = List.fill(5)(List.fill(10)(nextPrintableChar).mkString)\ndef category:String = shuffle(categories).head\nval examples = List.fill(100)(Example(nextInt(2000), category, nextLong, nextBoolean))", 342 | "outputs": [] 343 | }, 344 | { 345 | "metadata": { 346 | "id": "9B375D9DBD51403E8BE8E838F1734300" 347 | }, 348 | "cell_type": "markdown", 349 | "source": "The above cell doesn't plot anything since it terminates with a assignement." 350 | }, 351 | { 352 | "metadata": { 353 | "trusted": true, 354 | "input_collapsed": false, 355 | "collapsed": true, 356 | "presentation": { 357 | "tabs_state": "{\n \"tab_id\": \"#tab1337288181-1\"\n}", 358 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [\n \"category\"\n ],\n \"rows\": [\n \"id\"\n ],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 359 | }, 360 | "id": "2E61401D7F6B4DCEA0D5F52DE96FB973" 361 | }, 362 | "cell_type": "code", 363 | "source": "examples", 364 | "outputs": [] 365 | }, 366 | { 367 | "metadata": { 368 | "id": "0B51A8C208074B7C9ABADED8494A61F5" 369 | }, 370 | "cell_type": "markdown", 371 | "source": "Now we have a `TableChart` and a `PivotChart` tabs for the data, which we can use to have a better feeling of the data." 372 | }, 373 | { 374 | "metadata": { 375 | "id": "9588CA4E38C74D1C8BD20F16CB218D55" 376 | }, 377 | "cell_type": "markdown", 378 | "source": "We can of course create them ourselves:" 379 | }, 380 | { 381 | "metadata": { 382 | "trusted": true, 383 | "input_collapsed": false, 384 | "collapsed": false, 385 | "presentation": { 386 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 387 | }, 388 | "id": "595E9615FC9A410AA16F64B33FF80912" 389 | }, 390 | "cell_type": "code", 391 | "source": "TableChart(examples)", 392 | "outputs": [] 393 | }, 394 | { 395 | "metadata": { 396 | "id": "670BE2E3168D47298CC6B17A30A4FF41" 397 | }, 398 | "cell_type": "markdown", 399 | "source": "### Grouping plots" 400 | }, 401 | { 402 | "metadata": { 403 | "id": "D2895E0887E14A588731F1FAB389B875" 404 | }, 405 | "cell_type": "markdown", 406 | "source": "Among available charts, you have for instance the pretty common ones like:\n* `LineChart`\n* `ScatterChart`\n* `BarChart`\n\nWhich accept at least two other parameters: \n* `fields`: the two field names to use to plot \n* `groupField`: the field used to group the data" 407 | }, 408 | { 409 | "metadata": { 410 | "trusted": true, 411 | "input_collapsed": false, 412 | "collapsed": false, 413 | "id": "A993AFEA7F2945A184B760EE6251EBAE" 414 | }, 415 | "cell_type": "code", 416 | "source": "LineChart(examples, fields=Some((\"id\", \"value\")), groupField=Some(\"advanced\"))", 417 | "outputs": [] 418 | }, 419 | { 420 | "metadata": { 421 | "trusted": true, 422 | "input_collapsed": false, 423 | "collapsed": false, 424 | "id": "A993AFEA7F2945A184B760EE6251EBAE" 425 | }, 426 | "cell_type": "code", 427 | "source": "ScatterChart(examples, fields=Some((\"id\", \"value\")), groupField=Some(\"advanced\"))", 428 | "outputs": [] 429 | }, 430 | { 431 | "metadata": { 432 | "trusted": true, 433 | "input_collapsed": false, 434 | "collapsed": false, 435 | "id": "A993AFEA7F2945A184B760EE6251EBAE" 436 | }, 437 | "cell_type": "code", 438 | "source": "BarChart(examples, fields=Some((\"id\", \"value\")), groupField=Some(\"advanced\"))", 439 | "outputs": [] 440 | }, 441 | { 442 | "metadata": { 443 | "id": "4153FEFD93464E86A0F7FF113FB1139F" 444 | }, 445 | "cell_type": "markdown", 446 | "source": "---\n## Graphs" 447 | }, 448 | { 449 | "metadata": { 450 | "id": "F91D80C4BFD0408DA915051BC3AEE740" 451 | }, 452 | "cell_type": "markdown", 453 | "source": "Graph is generally a common way to represent data where connections matter. Hence the Spark Notebook defines an API easing the definition of `Node` and `Edge`.\n\n* `Graph[T]`: abstract class defining a graph component with an id of type `T`, a value of type `Any` and a color\n* `Node[T]`: defines a node as a circle which can be specified a radius and its position ($x$, $y$) (initial or static if it's fixed)\n* `Edge[T]`: defines an edge using the ids of both ends" 454 | }, 455 | { 456 | "metadata": { 457 | "trusted": true, 458 | "input_collapsed": false, 459 | "collapsed": false, 460 | "id": "5B96F1C40FA74E55ADBB6071A1B2FB0C" 461 | }, 462 | "cell_type": "code", 463 | "source": "case class GraphExample(id:Int, cluster:Char, value:Long)\nval clusters = ('A' to 'D').toList\nval cluCol = clusters.zip(List(\"#000\", \"#478\", \"#127\", \"#984\", \"#F5A\")).toMap\nval gexamples = List.tabulate(10, 4)((i,j) => GraphExample(i*4+j, clusters(j), nextLong)).flatten\n\nval nodes = gexamples.map(e => notebook.front.widgets.magic.Node(e.id, e, cluCol(e.cluster), 5))\n\nval clustered = gexamples.groupBy(_.cluster).toList\nval connectedClusters = clustered.flatMap { case (c, cl) => \n for {\n a <- cl\n b <- cl if a != b\n } yield notebook.front.widgets.magic.Edge[Int](400+nextInt(400)+a.id+b.id, (a.id, b.id), \"intra\", \"red\")\n }", 464 | "outputs": [] 465 | }, 466 | { 467 | "metadata": { 468 | "trusted": true, 469 | "input_collapsed": false, 470 | "collapsed": false, 471 | "id": "E31A5ACA4F8D442383289F17C620C5B4" 472 | }, 473 | "cell_type": "code", 474 | "source": "val singleConnections = {\n val s = gexamples.take(4)\n \n for (a <- s; b <- s if a != b) \n yield Edge(800+nextInt(400)+a.id+b.id, (a.id, b.id), \"inter\", \"green\")\n}\n\nval all = nodes ::: connectedClusters ::: singleConnections", 475 | "outputs": [] 476 | }, 477 | { 478 | "metadata": { 479 | "trusted": true, 480 | "input_collapsed": false, 481 | "collapsed": false, 482 | "presentation": { 483 | "tabs_state": "{\n \"tab_id\": \"#tab1249479791-0\"\n}", 484 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 485 | }, 486 | "id": "1F073EEB99CE4B1781AD74C59E208805" 487 | }, 488 | "cell_type": "code", 489 | "source": "GraphChart(all, maxPoints = 1000, sizes=(600, 600))", 490 | "outputs": [] 491 | }, 492 | { 493 | "metadata": { 494 | "id": "E14D74FA3B164544A79D22800BB84BDD" 495 | }, 496 | "cell_type": "markdown", 497 | "source": "---\n## Geo charts" 498 | }, 499 | { 500 | "metadata": { 501 | "id": "E78EF394884248DF87A887AFE627B13F" 502 | }, 503 | "cell_type": "markdown", 504 | "source": "There are two types of geo charts:\n* `GeoPointsChart` for simple points lat long points\n* `GeoChart` for _GeoJSON_ or _opengis_ data\n" 505 | }, 506 | { 507 | "metadata": { 508 | "id": "41C5ABBEC5754FC58AD56ADD7E799E5C" 509 | }, 510 | "cell_type": "markdown", 511 | "source": "### GeoPointsChart" 512 | }, 513 | { 514 | "metadata": { 515 | "id": "BB38DC199B5E4DAA9ECFFBF4FBD9B3AB" 516 | }, 517 | "cell_type": "markdown", 518 | "source": "Let's load some airports data with latitude and longitude coordinates" 519 | }, 520 | { 521 | "metadata": { 522 | "trusted": true, 523 | "input_collapsed": false, 524 | "collapsed": false, 525 | "id": "B23EDAF0D8204B7D845248EE24E20F30" 526 | }, 527 | "cell_type": "code", 528 | "source": "val root = sys.env(\"NOTEBOOKS_DIR\")\nval airportsDF = sparkSession.read.json(s\"$root/notebooks/airports.json\")\nairportsDF.cache\nairportsDF", 529 | "outputs": [] 530 | }, 531 | { 532 | "metadata": { 533 | "trusted": true, 534 | "input_collapsed": false, 535 | "collapsed": false, 536 | "presentation": { 537 | "tabs_state": "{\n \"tab_id\": \"#tab1529529486-0\"\n}", 538 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 539 | }, 540 | "id": "1707B838D5494D1F8476AADCA120EFC5" 541 | }, 542 | "cell_type": "code", 543 | "source": "val statsDF = airportsDF.groupBy(\"state\").count.orderBy($\"count\".desc).limit(5)", 544 | "outputs": [] 545 | }, 546 | { 547 | "metadata": { 548 | "id": "0A7F973B66494A349A505FAE9A4454ED" 549 | }, 550 | "cell_type": "markdown", 551 | "source": "Convert to Dataset (for the fun)." 552 | }, 553 | { 554 | "metadata": { 555 | "trusted": true, 556 | "input_collapsed": false, 557 | "collapsed": false, 558 | "id": "88DD55147E984B7F98349BFA935C2572" 559 | }, 560 | "cell_type": "code", 561 | "source": "case class StateStat(state:String, count:Long)", 562 | "outputs": [] 563 | }, 564 | { 565 | "metadata": { 566 | "trusted": true, 567 | "input_collapsed": false, 568 | "collapsed": false, 569 | "id": "575DBAFD9887415EB6BBD0D1D63B8194" 570 | }, 571 | "cell_type": "code", 572 | "source": "statsDF.as[StateStat]", 573 | "outputs": [] 574 | }, 575 | { 576 | "metadata": { 577 | "id": "335F1C282CE94D4B88DD6A2D3D5C3651" 578 | }, 579 | "cell_type": "markdown", 580 | "source": "Plot the dataframe with dedicated colors for each state" 581 | }, 582 | { 583 | "metadata": { 584 | "trusted": true, 585 | "input_collapsed": false, 586 | "collapsed": false, 587 | "id": "FF8F94DA1063463A8A38978BABA5A782" 588 | }, 589 | "cell_type": "code", 590 | "source": "import org.apache.spark.sql.functions._", 591 | "outputs": [] 592 | }, 593 | { 594 | "metadata": { 595 | "trusted": true, 596 | "input_collapsed": false, 597 | "collapsed": false, 598 | "id": "6B6635CEBAC349EC87BAC809E5F21F78" 599 | }, 600 | "cell_type": "code", 601 | "source": "def forStates[A](xs:List[A]) = when($\"state\" === \"AK\", xs(0))\n .when($\"state\" === \"TX\", xs(1))\n .when($\"state\" === \"CA\", xs(2))\n .when($\"state\" === \"OK\", xs(3))\n .when($\"state\" === \"OH\", xs(4))\n .otherwise(xs(5))\nval airportsDFWithStyles = airportsDF.withColumn(\"r\", forStates(List(10,9,8,7,6,1)))\n .withColumn(\"c\", forStates(List(\"red\",\"orange\",\"blue\",\"green\",\"yellow\",\"white\")))\nGeoPointsChart(airportsDFWithStyles, latLonFields=Some((\"lat\", \"long\")), rField = Some(\"r\"), colorField = Some(\"c\"))", 602 | "outputs": [] 603 | }, 604 | { 605 | "metadata": { 606 | "id": "B4273E6F1E8A4ADA8866AB9A067533E4" 607 | }, 608 | "cell_type": "markdown", 609 | "source": "---\n## GeoChart" 610 | }, 611 | { 612 | "metadata": { 613 | "id": "0E6BA873EB35476B8A6CC133F42ACDF4" 614 | }, 615 | "cell_type": "markdown", 616 | "source": "Fetch some data on the web about parks and gardens" 617 | }, 618 | { 619 | "metadata": { 620 | "trusted": true, 621 | "input_collapsed": false, 622 | "collapsed": false, 623 | "id": "ACD54EC9C7414EE98C568AD260A377ED" 624 | }, 625 | "cell_type": "code", 626 | "source": ":sh wget http://data.cyc.opendata.arcgis.com/datasets/57fa576e5e8149b0b744f768e01e5ce1_0.geojson -O Parks_and_Gardens.geojson", 627 | "outputs": [] 628 | }, 629 | { 630 | "metadata": { 631 | "id": "7C6FD417F57A47098A685A3E4DDC721C" 632 | }, 633 | "cell_type": "markdown", 634 | "source": "Parse it as GeoJSON using provided `widgets.parseGeoJSON`" 635 | }, 636 | { 637 | "metadata": { 638 | "trusted": true, 639 | "input_collapsed": false, 640 | "collapsed": false, 641 | "id": "750EC1E117C24C28851D00C89298F5A8" 642 | }, 643 | "cell_type": "code", 644 | "source": "val geoJSONRepr = widgets.parseGeoJSON(scala.io.Source.fromFile(\"Parks_and_Gardens.geojson\").getLines.mkString(\"\"))", 645 | "outputs": [] 646 | }, 647 | { 648 | "metadata": { 649 | "id": "A4544FA65C0F480EB4FFAF28E27F5DBB" 650 | }, 651 | "cell_type": "markdown", 652 | "source": "Fetch some more vectorial information of the same area." 653 | }, 654 | { 655 | "metadata": { 656 | "trusted": true, 657 | "input_collapsed": false, 658 | "collapsed": false, 659 | "id": "012789D6569C4DC6A15579070B0B17B6" 660 | }, 661 | "cell_type": "code", 662 | "source": ":sh wget http://data.cyc.opendata.arcgis.com/datasets/9b212b7af275438ca9088ff868bda139_9.geojson -O airqual.geojson", 663 | "outputs": [] 664 | }, 665 | { 666 | "metadata": { 667 | "id": "86A9DEA8BDD843B08DC2985638799232" 668 | }, 669 | "cell_type": "markdown", 670 | "source": "And parse it..." 671 | }, 672 | { 673 | "metadata": { 674 | "trusted": true, 675 | "input_collapsed": false, 676 | "collapsed": false, 677 | "id": "E5314FF973454E0B96877C2A6690B914" 678 | }, 679 | "cell_type": "code", 680 | "source": "val ng = widgets.parseGeoJSON(scala.io.Source.fromFile(\"airqual.geojson\").getLines.mkString(\"\"))", 681 | "outputs": [] 682 | }, 683 | { 684 | "metadata": { 685 | "id": "9F1F3B348DEA40BFA087795342006AEE" 686 | }, 687 | "cell_type": "markdown", 688 | "source": "Create a `GeoChart` instance on `GeoJSON` representation of the first dataset." 689 | }, 690 | { 691 | "metadata": { 692 | "trusted": true, 693 | "input_collapsed": false, 694 | "collapsed": false, 695 | "id": "273E8A681B4942838D2E0DB37C215F9E" 696 | }, 697 | "cell_type": "code", 698 | "source": "val gc = GeoChart(Seq(geoJSONRepr), sizes=(800, 800))\ngc", 699 | "outputs": [] 700 | }, 701 | { 702 | "metadata": { 703 | "id": "898788F161C94A898F20FD85328526C8" 704 | }, 705 | "cell_type": "markdown", 706 | "source": "We can now add the linear features into the same chart using the helpful function `addAndApply` which adds information to the existing chart." 707 | }, 708 | { 709 | "metadata": { 710 | "trusted": true, 711 | "input_collapsed": false, 712 | "collapsed": false, 713 | "id": "1E08A11C501C4E618FDFD7E97F340D79" 714 | }, 715 | "cell_type": "code", 716 | "source": "gc.addAndApply(Seq(ng))", 717 | "outputs": [] 718 | }, 719 | { 720 | "metadata": { 721 | "id": "2DDE82591B3E4DE1B09C71CC9292E148" 722 | }, 723 | "cell_type": "markdown", 724 | "source": "---\n## Fancy charts" 725 | }, 726 | { 727 | "metadata": { 728 | "id": "415502CC4E5E44D7BF7F08978A4C80BD" 729 | }, 730 | "cell_type": "markdown", 731 | "source": "### Radar" 732 | }, 733 | { 734 | "metadata": { 735 | "id": "F164A3ACA35A46638B4E82313A59E88E" 736 | }, 737 | "cell_type": "markdown", 738 | "source": "Let's grab some data from http://www.basketball-reference.com/teams/SAS/2016.html (31st May 2016)." 739 | }, 740 | { 741 | "metadata": { 742 | "trusted": true, 743 | "input_collapsed": false, 744 | "collapsed": false, 745 | "id": "DA236FFF7873453397F47AED00A55C87" 746 | }, 747 | "cell_type": "code", 748 | "source": "case class TeamMember(Player:String, Age:Int, FG_pc:Double, _3P_pc:Double, _2P_pc:Double, eFG_pc:Double, FT_pc:Double)\nval team = \n s\"\"\"\n 1\tKawhi Leonard\t24\t72\t72\t2380\t551\t1090\t.506\t129\t291\t.443\t422\t799\t.528\t.565\t292\t334\t.874\t95\t398\t493\t186\t128\t71\t105\t133\t1523\n 2\tLaMarcus Aldridge\t30\t74\t74\t2261\t536\t1045\t.513\t0\t16\t.000\t536\t1029\t.521\t.513\t259\t302\t.858\t176\t456\t632\t110\t38\t81\t99\t151\t1331\n 3\tDanny Green\t28\t79\t79\t2062\t211\t561\t.376\t116\t349\t.332\t95\t212\t.448\t.480\t34\t46\t.739\t48\t255\t303\t141\t79\t64\t75\t141\t572\n 4\tTony Parker\t33\t72\t72\t1980\t350\t710\t.493\t27\t65\t.415\t323\t645\t.501\t.512\t130\t171\t.760\t17\t159\t176\t379\t54\t11\t131\t114\t857\n 5\tPatrick Mills\t27\t81\t3\t1662\t260\t612\t.425\t123\t320\t.384\t137\t292\t.469\t.525\t47\t58\t.810\t27\t131\t158\t226\t59\t6\t76\t102\t690\n 6\tTim Duncan\t39\t61\t60\t1536\t215\t441\t.488\t0\t2\t.000\t215\t439\t.490\t.488\t92\t131\t.702\t115\t332\t447\t163\t47\t78\t90\t125\t522\n 7\tDavid West\t35\t78\t19\t1404\t244\t448\t.545\t3\t7\t.429\t241\t441\t.546\t.548\t63\t80\t.788\t72\t237\t309\t143\t44\t55\t68\t142\t554\n 8\tBoris Diaw\t33\t76\t4\t1386\t202\t383\t.527\t25\t69\t.362\t177\t314\t.564\t.560\t56\t76\t.737\t58\t175\t233\t176\t26\t21\t97\t102\t485\n 9\tKyle Anderson\t22\t78\t11\t1245\t138\t295\t.468\t12\t37\t.324\t126\t258\t.488\t.488\t62\t83\t.747\t25\t219\t244\t123\t60\t29\t59\t97\t350\n 10\tManu Ginobili\t38\t58\t0\t1134\t197\t435\t.453\t70\t179\t.391\t127\t256\t.496\t.533\t91\t112\t.813\t26\t120\t146\t177\t66\t11\t99\t99\t555\n 11\tJonathon Simmons\t26\t55\t2\t813\t122\t242\t.504\t18\t47\t.383\t104\t195\t.533\t.541\t69\t92\t.750\t16\t80\t96\t58\t24\t5\t53\t103\t331\n 12\tBoban Marjanovic\t27\t54\t4\t508\t105\t174\t.603\t0\t0\t.0\t105\t174\t.603\t.603\t87\t114\t.763\t73\t121\t194\t21\t12\t23\t29\t54\t297\n 13\tRasual Butler\t36\t46\t0\t432\t49\t104\t.471\t15\t49\t.306\t34\t55\t.618\t.543\t11\t16\t.688\t3\t53\t56\t24\t13\t23\t8\t11\t124\n 14\tKevin Martin\t32\t16\t1\t261\t30\t85\t.353\t11\t33\t.333\t19\t52\t.365\t.418\t28\t30\t.933\t4\t25\t29\t12\t9\t2\t13\t15\t99\n 15\tRay McCallum\t24\t31\t3\t256\t27\t67\t.403\t5\t16\t.313\t22\t51\t.431\t.440\t9\t10\t.900\t6\t25\t31\t33\t5\t4\t11\t14\t68\n 16\tMatt Bonner\t35\t30\t2\t206\t29\t57\t.509\t15\t34\t.441\t14\t23\t.609\t.640\t3\t4\t.750\t3\t24\t27\t9\t6\t1\t3\t16\t76\n 17\tAndre Miller\t39\t13\t4\t181\t23\t48\t.479\t1\t4\t.250\t22\t44\t.500\t.490\t9\t13\t.692\t6\t21\t27\t29\t7\t0\t12\t14\t56\n \"\"\".trim.split(\"\\n\").map(s => s.trim.split(\"\\t\").drop(1).toList).map(x => (x.head, x(1).trim.toInt) → x.drop(2).filter(_.startsWith(\".\"))\n .map(_.trim.toDouble * 100)).map { case ((p, a), stats) =>\n TeamMember(p, a, stats(0), stats(1), stats(2), stats(3), stats(4))\n }", 749 | "outputs": [] 750 | }, 751 | { 752 | "metadata": { 753 | "trusted": true, 754 | "input_collapsed": false, 755 | "collapsed": false, 756 | "id": "CAC0438C302A4C5388A9B71E5C246FD8" 757 | }, 758 | "cell_type": "code", 759 | "source": "RadarChart(shuffle(team.toList).take(5), labelField=Some(\"Player\"), sizes=(800, 600))", 760 | "outputs": [] 761 | }, 762 | { 763 | "metadata": { 764 | "id": "75EF85B5D6534BCF873488D9D092F54A" 765 | }, 766 | "cell_type": "markdown", 767 | "source": "### Pivot" 768 | }, 769 | { 770 | "metadata": { 771 | "trusted": true, 772 | "input_collapsed": false, 773 | "collapsed": false, 774 | "presentation": { 775 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [\n \"_2P_pc\",\n \"_3P_pc\"\n ],\n \"rows\": [\n \"Player\"\n ],\n \"vals\": [\n \"_3P_pc\"\n ],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Sum\",\n \"rendererName\": \"Line Chart\"\n}" 776 | }, 777 | "id": "259688ADDF334ACE85203BF5AE070ED7" 778 | }, 779 | "cell_type": "code", 780 | "source": "PivotChart(team)", 781 | "outputs": [] 782 | }, 783 | { 784 | "metadata": { 785 | "id": "77F0A0B2C4AB4A5B8AB689EB14AF40D8" 786 | }, 787 | "cell_type": "markdown", 788 | "source": "### Parallel coordinates" 789 | }, 790 | { 791 | "metadata": { 792 | "trusted": true, 793 | "input_collapsed": false, 794 | "collapsed": false, 795 | "id": "536D63D7B3A64281BD62B95E7FBB4E76" 796 | }, 797 | "cell_type": "code", 798 | "source": "ParallelCoordChart(team, sizes=(800, 500))", 799 | "outputs": [] 800 | }, 801 | { 802 | "metadata": { 803 | "id": "CD0C49DF563340A38E82471242EF4F32" 804 | }, 805 | "cell_type": "markdown", 806 | "source": "### Timeseries" 807 | }, 808 | { 809 | "metadata": { 810 | "trusted": true, 811 | "input_collapsed": false, 812 | "collapsed": false, 813 | "id": "76CC0E2ED70D458689FF7F75297D8E77" 814 | }, 815 | "cell_type": "code", 816 | "source": ":sh wget http://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/p12/12/1880-2015.csv -O /tmp/1880-2015.csv", 817 | "outputs": [] 818 | }, 819 | { 820 | "metadata": { 821 | "trusted": true, 822 | "input_collapsed": false, 823 | "collapsed": false, 824 | "id": "52DBD8124E374C25A4F336A6A26D245C" 825 | }, 826 | "cell_type": "code", 827 | "source": "import java.util.Calendar\nimport java.util.Calendar._\nval cal = Calendar.getInstance\ncal.set(DAY_OF_MONTH, 0)\ncal.set(HOUR, 0)\ncal.set(MINUTE, 0)\ncal.set(SECOND, 0)\ncal.set(MILLISECOND, 0)\nval ts = scala.io.Source.fromFile(new File(\"/tmp/1880-2015.csv\")).getLines.drop(4)\n .map(_.split(\",\").toList.map(_.trim))\n .map{case List(y,c) => (y.take(4).toInt, y.drop(4).take(2).dropWhile(_ == '0').toInt-1, c.toDouble)}\n .map{ case (y, m, c) => \n cal.set(YEAR, y)\n cal.set(MONTH, m)\n cal.getTime → c\n }.toList", 828 | "outputs": [] 829 | }, 830 | { 831 | "metadata": { 832 | "trusted": true, 833 | "input_collapsed": false, 834 | "collapsed": false, 835 | "id": "2B91343B09F041B19FD683E0C04F99D8" 836 | }, 837 | "cell_type": "code", 838 | "source": "val tc = TimeseriesChart(ts)\ntc", 839 | "outputs": [] 840 | }, 841 | { 842 | "metadata": { 843 | "id": "7C30152109AD4727827599D6949592F0" 844 | }, 845 | "cell_type": "markdown", 846 | "source": "---\n## Everything is Dynamic and Reactive" 847 | }, 848 | { 849 | "metadata": { 850 | "id": "3AADA3697ABE44BF85FFB47209E06036" 851 | }, 852 | "cell_type": "markdown", 853 | "source": "Since data can come live in a system or you want to log vizualy some events or perhaps you need to have two visual components to interact... what you don't want to do is to write the html, js, server code and who knows what else you'll need to master...\n\nFor that, the spark notebook comes with dynamicity of charts and most (if not all) components can be listened and react to events." 854 | }, 855 | { 856 | "metadata": { 857 | "id": "83B7B4F029324AADA94E34497C395F35" 858 | }, 859 | "cell_type": "markdown", 860 | "source": "### Dynamic Line Chart" 861 | }, 862 | { 863 | "metadata": { 864 | "trusted": true, 865 | "input_collapsed": false, 866 | "collapsed": false, 867 | "id": "CE563B405582434AAFA53E4C4AE8C893" 868 | }, 869 | "cell_type": "code", 870 | "source": "val tsH :: tsS = ts.sliding(100, 100).toList\nval dynTC = TimeseriesChart(tsH, maxPoints = ts.size)\ndynTC", 871 | "outputs": [] 872 | }, 873 | { 874 | "metadata": { 875 | "trusted": true, 876 | "input_collapsed": false, 877 | "collapsed": true, 878 | "id": "FA0E02EBBD7D467385D7C17B4708DFBB" 879 | }, 880 | "cell_type": "code", 881 | "source": "var cont = true\nnew Thread() {\n override def run = \n tsS.foreach { l =>\n if (cont) {\n Thread.sleep(1000)\n dynTC.addAndApply(l)\n }\n }\n}.start", 882 | "outputs": [] 883 | }, 884 | { 885 | "metadata": { 886 | "id": "9BD0C8C4788049C78D4C6E90059E3A83" 887 | }, 888 | "cell_type": "markdown", 889 | "source": "### Components" 890 | }, 891 | { 892 | "metadata": { 893 | "trusted": true, 894 | "input_collapsed": false, 895 | "collapsed": true, 896 | "id": "C3A5386580284B5D9C8D4BE907F6FA8C" 897 | }, 898 | "cell_type": "code", 899 | "source": "val rteam = shuffle(team.toList).take(5)\nval dd = new DropDown(\"All\" :: rteam.map(_.Player))\nval rc = RadarChart(rteam, labelField=Some(\"Player\"), sizes=(800, 600))\nval bout = out\n\ndd.selected --> Connection.fromObserver { p =>\n bout(p + \" is selected\")\n rc.applyOn(rc.originalData.filter(_.Player == p || p == \"All\"))\n}\n\ndd ++ bout ++ rc", 900 | "outputs": [] 901 | }, 902 | { 903 | "metadata": { 904 | "id": "42D57AE9A76F4FA1A114BA95C2E1FADB" 905 | }, 906 | "cell_type": "markdown", 907 | "source": "---\n## Synchronization" 908 | }, 909 | { 910 | "metadata": { 911 | "id": "7460856EFF64425A83DD0DD4B5814360" 912 | }, 913 | "cell_type": "markdown", 914 | "source": "Oh... notebooks are synchronized!\n\nOpen another browser window and relaunch the timeseries example." 915 | }, 916 | { 917 | "metadata": { 918 | "id": "593AD5B17B854593B20EA88F8B18E338" 919 | }, 920 | "cell_type": "markdown", 921 | "source": "---\n## Create new chart type... live" 922 | }, 923 | { 924 | "metadata": { 925 | "id": "E51CB610568E4936ADA024549386E8E3" 926 | }, 927 | "cell_type": "markdown", 928 | "source": "If you want/need to exerce your js fu, you can always use the `Chart` (for instance) API to create new dynamic widgets types.\n\nIn the following, we'll create a widget that can plot duration bars based for given operations (only a name):\n* a `js` string which is the javascript to execute for the new chart. It:\n * has to be a function with 3 params\n * `dataO` a knockout observable wich can be listened for new incoming data, see the `subscribe` call\n * `container` is the div element where you can add new elements\n * `options` is an extra object passed to the widget which defines additional configuration options (like width or a specific color or whatever)\n * has a `this` object containing:\n * `dataInit` this is the JSON representation of the Scala data as an array of objects having the same schema as the Scala type\n * `genId` a unique id that you can use for a high level element for instance" 929 | }, 930 | { 931 | "metadata": { 932 | "trusted": true, 933 | "input_collapsed": false, 934 | "collapsed": true, 935 | "id": "81349F9C871E416093E496CEA8A9F534" 936 | }, 937 | "cell_type": "code", 938 | "source": "val js = \"\"\"\nfunction progressgraph (dataO, container, options) {\n var css = 'div.prog {position: relative; overflow: hidden; } span.pp {display: inline-block; position: absolute; height: 16px;} span.prog {display: inline-block; position: absolute; height: 16px; }' +\n '.progs {border: solid 1px #ccc; background: #eee; } .progs .pv {background: #3182bd; }',\n head = document.head || document.getElementsByTagName('head')[0],\n style = document.createElement('style');\n\n style.type = 'text/css';\n if (style.styleSheet){\n style.styleSheet.cssText = css;\n } else {\n style.appendChild(document.createTextNode(css));\n }\n\n head.appendChild(style);\n\n\n var width = options.width||600\n var height = options.height||400\n \n function create(name, duration) {\n var div = d3.select(container).append(\"div\").attr(\"class\", \"prog\");\n\n div.append(\"span\").attr(\"class\", \"pp prog\")\n .style(\"width\", \"74px\")\n .style(\"text-align\", \"right\")\n .style(\"z-index\", \"2000\")\n .text(name);\n\n div.append(\"span\")\n .attr(\"class\", \"progs\")\n .style(\"width\", \"240px\")\n .style(\"left\", \"80px\")\n .append(\"span\")\n .attr(\"class\", \"pp pv\")\n .transition()\n .duration(duration)\n .ease(\"linear\")\n .style(\"width\", \"350px\");\n\n div.transition()\n .style(\"height\", \"20px\")\n .transition()\n .delay(duration)\n .style(\"height\", \"0px\")\n .remove();\n\n }\n\n function onData(data) {\n _.each(data, function(d) {\n create(d[options.name], 5000 + d[options.duration])\n });\n }\n\n onData(this.dataInit);\n dataO.subscribe(onData);\n}\n\"\"\".trim", 939 | "outputs": [] 940 | }, 941 | { 942 | "metadata": { 943 | "id": "D652E9BA1D884CDB98B8DF06D77DEB35" 944 | }, 945 | "cell_type": "markdown", 946 | "source": "Now we can create the widget extending `notebook.front.widgets.charts.Chart[C]`, where `C` is any Scala type, it'll be converted to JS using the implicit instance of `ToPoints`.\n\nIt has to declare the original dataset which needs to be a wrapper (`List`, `Array`, ...) of the `C` instances we want to plot. But it can also define other things like below:\n* `sizes` are the $w \\times h$ dimension of the chart\n* `maxPoints` the number of points to plot, the way to select them is defined in the implicitly available instance of `Sampler`.\n* `scripts` a list of references to existing javascript scripts\n* `snippets` a list of string that represent snippets to execute in JS, they take the form of a JSON object with\n * `f` the function to call when the snippet will be executed\n * `o` a JSON object that will be provided to the above function at execution time. Here we define which field has to be used for the name and duration." 947 | }, 948 | { 949 | "metadata": { 950 | "trusted": true, 951 | "input_collapsed": false, 952 | "collapsed": true, 953 | "id": "FA51FED8558140FEB37478F0594DAB4E" 954 | }, 955 | "cell_type": "code", 956 | "source": "import notebook.front.widgets._\nimport notebook.front.widgets.magic._\nimport notebook.front.widgets.magic.Implicits._\nimport notebook.front.widgets.magic.SamplerImplicits._\ncase class ProgChart[C:ToPoints:Sampler](\n originalData:C,\n override val sizes:(Int, Int)=(600, 400),\n maxPoints:Int = 1000,\n name:String,\n duration:String\n) extends notebook.front.widgets.charts.Chart[C](originalData, maxPoints) {\n def mToSeq(t:MagicRenderPoint):Seq[(String, Any)] = t.data.toSeq\n\n\n override val snippets = List(s\"\"\"|{\n | \"f\": $js, \n | \"o\": {\n | \"name\": \"$name\",\n | \"duration\": \"$duration\"\n | }\n |}\n \"\"\".stripMargin)\n \n override val scripts = Nil\n}", 957 | "outputs": [] 958 | }, 959 | { 960 | "metadata": { 961 | "id": "1A5FD93B14BD42E99F49806BDEB16AC5" 962 | }, 963 | "cell_type": "markdown", 964 | "source": "We can define the type of data we'll use for this example" 965 | }, 966 | { 967 | "metadata": { 968 | "trusted": true, 969 | "input_collapsed": false, 970 | "collapsed": true, 971 | "id": "4A344133173D40158A4091AAC56F8AD2" 972 | }, 973 | "cell_type": "code", 974 | "source": "case class ProgData(n:String, v:Int)", 975 | "outputs": [] 976 | }, 977 | { 978 | "metadata": { 979 | "id": "9DA8C6053A214FB2A01FBCE9C25EBE7F" 980 | }, 981 | "cell_type": "markdown", 982 | "source": "Here we generate a bunch of data bucketized by 10, and we create an instance of the new widget giving it the first bucket of data and specifying the right field names for `name` and `duration`." 983 | }, 984 | { 985 | "metadata": { 986 | "trusted": true, 987 | "input_collapsed": false, 988 | "collapsed": true, 989 | "id": "4BB2C6EA4B634DFC97B7D5D9845DF0AD" 990 | }, 991 | "cell_type": "code", 992 | "source": "val pdata = for {\n c1 <- 'a' to 'e'\n c2 <- 'a' to 'e'\n} yield ProgData(\"\"+c1+c2, (nextDouble * 10000).toInt)\nval pdataH :: pdataS = pdata.toList.sliding(10, 10).toList\n\nval pc = ProgChart(pdataH, name = \"n\", duration = \"v\")\npc", 993 | "outputs": [] 994 | }, 995 | { 996 | "metadata": { 997 | "id": "35435A5C31AF4F67BF4508910220FA92" 998 | }, 999 | "cell_type": "markdown", 1000 | "source": "We update the chart by passing the value using the `addAndApply` approach." 1001 | }, 1002 | { 1003 | "metadata": { 1004 | "trusted": true, 1005 | "input_collapsed": false, 1006 | "collapsed": true, 1007 | "id": "15392CC271B04DA5842904489A447B1C" 1008 | }, 1009 | "cell_type": "code", 1010 | "source": "var pcont = true\nnew Thread() {\n override def run = \n pdataS.foreach { l =>\n if (pcont) {\n Thread.sleep(9000)\n pc.addAndApply(l, true)\n }\n }\n}.start", 1011 | "outputs": [] 1012 | }, 1013 | { 1014 | "metadata": { 1015 | "id": "787984960C10439D81F5DA7FBD00505A" 1016 | }, 1017 | "cell_type": "markdown", 1018 | "source": "---\n## Contexts with interpolation" 1019 | }, 1020 | { 1021 | "metadata": { 1022 | "trusted": true, 1023 | "input_collapsed": false, 1024 | "collapsed": true, 1025 | "id": "09F41F85FD34409C837320D904108BE6" 1026 | }, 1027 | "cell_type": "code", 1028 | "source": ":sh ls ${sys.env(\"NOTEBOOKS_DIR\")}", 1029 | "outputs": [] 1030 | }, 1031 | { 1032 | "metadata": { 1033 | "trusted": true, 1034 | "input_collapsed": false, 1035 | "collapsed": true, 1036 | "id": "C924A6AA52B94822B64718240069CFB0" 1037 | }, 1038 | "cell_type": "code", 1039 | "source": "val ok = \"$\\\\LaTeX$ interpolated in Scala is $\\\\Re$\"", 1040 | "outputs": [] 1041 | }, 1042 | { 1043 | "metadata": { 1044 | "trusted": true, 1045 | "input_collapsed": false, 1046 | "collapsed": true, 1047 | "id": "416D1E2D54154AE7845CE8FC14ECF334" 1048 | }, 1049 | "cell_type": "code", 1050 | "source": ":markdown \nYup, **$ok** in Spark Notebook", 1051 | "outputs": [] 1052 | }, 1053 | { 1054 | "metadata": { 1055 | "trusted": true, 1056 | "input_collapsed": false, 1057 | "collapsed": true, 1058 | "id": "7B10ABEE96784E458F34447A8C9E7C46" 1059 | }, 1060 | "cell_type": "code", 1061 | "source": ":javascript\nalert(\"I am ${(\"whoami\".!!).trim}\")", 1062 | "outputs": [] 1063 | } 1064 | ], 1065 | "nbformat": 4 1066 | } -------------------------------------------------------------------------------- /notebooks/WhyScala.md: -------------------------------------------------------------------------------- 1 | # Scala: the Unpredicted Lingua Franca for Data Science 2 | 3 | **Andy Petrella** 4 | [noootsab@data-fellas.guru](mailto:noootsab@data-fellas.guru)
5 | **Dean Wampler** 6 | [dean.wampler@lightbend.com](mailto:dean.wampler@lightbend.com) 7 | 8 | * Scala Days NYC, May 5th, 2016 9 | * GOTO Chicago, May 24, 2016 10 | * Strata + Hadoop World London, June 3, 2016 11 | * Scala Days Berlin, June 16th, 2016 12 | 13 | See also the [Spark Notebook](http://spark-notebook.io) version of this content, available at [github.com/data-fellas/scala-for-data-science](https://github.com/data-fellas/scala-for-data-science). 14 | 15 | ## Why Scala for Data Science with Spark? 16 | 17 | While Python and R are traditional languages of choice for Data Science, [Spark](http://spark.apache.org) also supports Scala (the language in which it's written) and Java. 18 | 19 | However, using one language for all work has advantages like simplifying the software development process, such as building, testing, and deploying techniques, coding conventions, etc. 20 | 21 | If you want a thorough introduction to Scala, see [Dean's book](http://shop.oreilly.com/product/0636920033073.do). 22 | 23 | So, what are the advantages, as well as disadvantages of Scala? 24 | 25 | ## 1. Functional Programming Plus Objects 26 | 27 | Scala is a _multi-paradigm_ language. Code can look a lot like traditional Java code using _Object-Oriented Programming_ (OOP), but it also embraces _Function Programming_ (FP), which emphasizes the virtues of: 28 | 29 | 1. **Immutable values:** Mutability is a common source of bugs. 30 | 1. **Functions with no _side effects_:** All the information they need is passed in and all the "work" is returned. No external state is modified. 31 | 1. **Referential transparency:** You can replace a function call with a cached value that was returned from a previous invocation with the same arguments. (This is a benefit enabled by functions without side effects.) 32 | 1. **Higher-order functions:** Functions that take other functions as arguments are return functions as results. 33 | 1. **Structure separated from operations:** A core set of collections meets most needs. An operation applicable to one data structure is applicable to all. 34 | 35 | However, objects are still useful as an _encapsulation_ mechanism. This is valuable for projects with large teams and code bases. 36 | Scala also implements some _functional_ features using _object-oriented inheritance_ (e.g., "abstract data types" and "type classes", for you experts...). 37 | 38 | ### What about the other languages? 39 | * **Python:** Supports mixed FP-OOP programming, too, but isn't as "rigorous". 40 | * **R:** As a Statistics language, R is more functional than object-oriented. 41 | * **Java:** An object-oriented language, but with recently introduced functional constructs, _lambdas_ (anonymous functions) and collection operations that follow a more _functional_ style, rather than _imperative_ (i.e., where mutating the collection is embraced). 42 | 43 | There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning specifically: 44 | 45 | ### 1a. Traits vs. Interfaces 46 | Scala's object model adds a _trait_ feature, which is a more powerful concept than Java 8 interfaces. Before Java 8, there was no [mixin composition](https://en.wikipedia.org/wiki/Mixin) capability in Java, where composition is generally [preferred over inheritance](https://en.wikipedia.org/wiki/Composition_over_inheritance). 47 | 48 | Imagine that you want to define reusable logging code and mix it into other classes declaratively. Before Java 8, you could define the abstraction for logging in an interface, but you had to use some ad hoc mechanism to implement it (like implementing all methods to delegate to a helper object). Java 8 added the ability to provide default method definitions, as well as declarations in interfaces. This makes mixin composition easier, but you still can't add fields (for state), so the capability is limited. 49 | 50 | Scala traits fully support mixin composition by supporting both field and method definitions with flexibility rules for overriding behavior, once the traits are mixed into classes. 51 | 52 | ### 1b. Java Streams 53 | When you use the Java 8 collections, you can convert the traditional collections to a "stream", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional. 54 | 55 | ### The Virtue of Functional Collections: 56 | Let's examine how concisely we can operate on a collection of values in Scala and Spark. 57 | 58 | First, let's define a helper function: is an integer a prime? (Naïve algorithm from [Wikipedia](https://en.wikipedia.org/wiki/Primality_test).) 59 | 60 | ```scala 61 | def isPrime(n: Int): Boolean = { 62 | def test(i: Int, n2: Int): Boolean = { 63 | if (i*i > n2) true 64 | else if (n2 % i == 0 || n2 % (i + 2) == 0) false 65 | else test(i+6, n2) 66 | } 67 | if (n <= 1) false 68 | else if (n <= 3) true 69 | else if (n % 2 == 0 || n % 3 == 0) false 70 | else test(5, n) 71 | } 72 | ``` 73 | 74 | Note that no values are mutated here ("virtue" #1 listed above) and `isPrime` has no side effects (#2), which means we could cache previous invocations for a given `n` for better performance if we called this a lot (#3)! 75 | 76 | ### Scala Collections Example 77 | Let's compare a Scala collections calculation vs. the same thing in Spark; how many prime numbers are there between 1 and 100, inclusive? 78 | 79 | ```scala 80 | (1 to 100). // Range of integers from 1 to 100, inclusive. 81 | map(i => (i, isPrime(i))). // `map` is a higher-order method; we pass it 82 | // a function (#4) 83 | groupBy(tuple => tuple._2). // ... and so is `groupBy`, etc. 84 | map(tuple => (tuple._1, tuple._2.size)) 85 | ``` 86 | 87 | This produces the results: 88 | ```scala 89 | res16: scala.collection.immutable.Map[Boolean,Int] = Map( 90 | false -> 75, true -> 25) 91 | ``` 92 | Note that for the numbers between 1 and 100, inclusive, exactly 1/4 of them are prime! 93 | 94 | ### Spark Example 95 | 96 | Note how similar the following code is to the previous example. After constructing the data set, the "core" three lines are _identical_, even though they are operating on completely different underlying collections (#5 above). 97 | 98 | However, because Spark collections are "lazy" by default (i.e., not evaluated until we ask for results), we explicitly print the results so Spark evaluates them! 99 | 100 | ```scala 101 | val rddPrimes = sparkContext.parallelize(1 to 100). 102 | map(i => (i, isPrime(i))). 103 | groupBy(tuple => tuple._2). 104 | map(tuple => (tuple._1, tuple._2.size)) 105 | rddPrimes.collect 106 | ``` 107 | 108 | This produces the result: 109 | ```scala 110 | rddPrimes: org.apache.spark.rdd.RDD[(Boolean, Int)] = 111 | MapPartitionsRDD[4] at map at :61 112 | res18: Array[(Boolean, Int)] = Array((false,75), (true,25)) 113 | ``` 114 | 115 | Note the inferred type, an `RDD` with records of type `(Boolean, Int)`, meaning two-element tuples. 116 | 117 | Spark's RDD API is inspired by the Scala collections API, which is inspired by classic _functional programming_ operations on data collections, i.e., using a series of transformations from one form to the next, without mutating any of the collections. (Spark is very efficient about avoiding the materialization of intermediate outputs.) 118 | 119 | Once you know these operations, it's quick and effective to implement robust, non-trivial transformations. 120 | 121 | What about the other languages? 122 | 123 | * **Python:** Supports very similar functional programming. In fact, Spark Python code looks very similar to Spark Scala code. 124 | * **R:** More idiomatic (see below). 125 | * **Java:** Looks similar when _lambdas_ are used, but missing features (see below) limit concision and flexibility. 126 | 127 | ## 2. Interpreter (REPL) 128 | 129 | In the notebook, we've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks like this one possible! 130 | 131 | What about the other languages? 132 | 133 | * **Python:** Also has an interpreter and [iPython/Jupyter](https://ipython.org/) was one of the first, widely-used notebook environments. 134 | * **R:** Also has an interpreter and notebook/IDE environments. 135 | * **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment. However, Java 9 will have a REPL, after 20+ years! 136 | 137 | ## 3. Tuple Syntax 138 | In data, you work with records of `n` fields (for some value of `n`) all the time. Support for `n`-element _tuples_ is very convenient and Scala has a shorthand syntax for instantiating tuples. We used it twice previously to return two-element tuples in the anonymous functions passed to the `map` methods above: 139 | 140 | ```scala 141 | sparkContext.parallelize(1 to 100). 142 | map(i => (i, isPrime(i))). // <-- here 143 | groupBy(tuple => tuple._2). 144 | map(tuple => (tuple._1, tuple._2.size)) // <-- here 145 | ``` 146 | 147 | As before, REPL prints the following: 148 | 149 | ```scala 150 | res20: org.apache.spark.rdd.RDD[(Boolean, Int)] = 151 | MapPartitionsRDD[9] at map at :63 152 | ``` 153 | 154 | **Tuples are used all the time** in Spark Scala RDD code, where it's common to use key-value pairs. 155 | 156 | What about the other languages? 157 | 158 | * **Python:** Also has some support for the same tuple syntax. 159 | * **R:** Also has tuple types, but a less convenient syntax for instantiating them. 160 | * **Java:** Does _not_ have tuple types, not even the special case of two-element tuples (pairs), much less a convenient syntax for them. However, Spark defines a [MutablePair](http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/MutablePair.html) type for this purpose: 161 | 162 | ```scala 163 | // Using Scala syntax here: 164 | import org.apache.spark.util.MutablePair 165 | val pair = new MutablePair[Int,String](1, "one") 166 | ``` 167 | 168 | The REPL prints: 169 | ```scala 170 | import org.apache.spark.util.MutablePair 171 | pair: org.apache.spark.util.MutablePair[Int,String] = (1,one) 172 | ``` 173 | 174 | ## 4. Pattern Matching 175 | This is one of the most powerful features you'll find in most functional languages, Scala included. It has no equivalent in Python, R, or Java. 176 | 177 | Let's rewrite our previous primes example: 178 | 179 | ```scala 180 | sparkContext.parallelize(1 to 100). 181 | map(i => (i, isPrime(i))). 182 | groupBy{ case (_, primality) => primality}. // Syntax: { case pattern => body } 183 | map{ case (primality, values) => (primality, values.size) } . // same here 184 | foreach(println) 185 | ``` 186 | 187 | The output is: 188 | ```scala 189 | (true,25) 190 | (false,75) 191 | ``` 192 | 193 | Note the `case` keyword and `=>` separating the pattern from the body to execute if the pattern matches. 194 | 195 | In the first pattern, `(_, primality)`, we didn't need the first tuple element, so we used the "don't care" placeholder, `_`. Note also that `{...}` must be used instead of `(...)`. (The extra whitespace after the `{` and before the `}` is not required; it's here for legibility.) 196 | 197 | Pattern matching is much richer, while more concise than `if ... else ...` constructs in the other languages and we can use it on nearly anything to match what it is and then decompose it into its constituent parts, which are assigned to variables with meaningful names, e.g., `primality`, `values`, etc. 198 | 199 | Here's another example, where we _deconstruct_ a nested tuple. We also show that you can use pattern matching for assignment, too! 200 | 201 | ```scala 202 | val (a, (b1, (b21, b22)), c) = ("A", ("B1", ("B21", "B22")), "C") 203 | ``` 204 | 205 | Note the output of the REPL: 206 | ```scala 207 | a: String = A 208 | b1: String = B1 209 | b21: String = B21 210 | b22: String = B22 211 | c: String = C 212 | ``` 213 | 214 | ## 5. Case Classes 215 | Now is a good time to introduce a convenient way to declare classes that encapsulate some state that is composed of some values, called _case classes_. 216 | 217 | ```scala 218 | case class Person(firstName: String, lastName: String, age: Int) 219 | ``` 220 | 221 | The `case` keyword tells the compiler to: 222 | 223 | * Make immutable instance fields out of the constructor arguments (the list after the name). 224 | * Add `equals`, `hashCode`, and `toString` methods (which you can explicitly define yourself, if you want). 225 | * Add a _companion object_ with the same name, which holds methods for constructing instances and "destructuring" instances through patterning matching. 226 | * etc. 227 | 228 | Case classes are useful for implementing records in RDDs. 229 | 230 | Let's see case class pattern matching in action: 231 | 232 | ```scala 233 | sparkContext.parallelize( 234 | Seq(Person("Dean", "Wampler", 39), 235 | Person("Andy", "Petrella", 29))). 236 | map { 237 | // Convert Person instances to tuples 238 | case Person(first, last, age) => (first, last, age) 239 | }. 240 | foreach(println) 241 | ``` 242 | 243 | Output: 244 | ```scala 245 | (Andy,Petrella,29) 246 | (Dean,Wampler,39) 247 | ``` 248 | 249 | What about the other languages? 250 | 251 | * **Python:** Regular expression matching for strings is built in. Pattern matching as shown requires a third-party library with an idiomatic syntax. Nothing like case classes. 252 | * **R:** Only supports regular expression matching for strings. Nothing like case classes. 253 | * **Java:** Only supports regular expression matching for strings. Nothing like case classes. 254 | 255 | ## 6. Type Inference 256 | Most languages associate a type with values, but they fall into two categories, crudely speaking, those which evaluate the type of expressions and variables at compile time (like Scala and Java) and those which do so at runtime (Python and R). This is called _static typing_ and _dynamic typing_, respectively. 257 | 258 | So, languages with static typing either have to be told the type of every expression or variable, or they can _infer_ types in some or all cases. Scala can infer types most of the time, while Java can do so only in limited cases. Here are some examples for Scala. Note the results shown for each expression: 259 | 260 | ```scala 261 | val i = 100 // <- infer that i is an integer 262 | val j = i*i % 27 // <- since i is an integer, j must be one, too. 263 | ``` 264 | 265 | The compiler infers the following: 266 | ```scala 267 | i: Int = 100 268 | j: Int = 10 269 | ``` 270 | 271 | Recall our previous Spark example, where we wrote nothing about types, but they were inferred: 272 | 273 | ```scala 274 | sparkContext.parallelize(1 to 100). 275 | map(i => (i, isPrime(i))). 276 | groupBy{ case(_, primality) => primality }. 277 | map{ case (primality, values) => (primality, values.size) } 278 | ``` 279 | 280 | Output: 281 | ```scala 282 | res30: org.apache.spark.rdd.RDD[(Boolean, Int)] = 283 | MapPartitionsRDD[21] at map at :66 284 | ``` 285 | 286 | So this long expression (and it is a four-line expression - note the "."'s) returns an `RDD[(Boolean, Int)]`. Note that we can also express a tuple _type_ with the `(...)` syntax, just like for tuple _instances_. This type could also be written `RDD[Tuple2[Boolean, Int]]`. 287 | 288 | Put another way, we have an `RDD` where the records are key-value pairs of `Booleans` and `Ints`. 289 | 290 | I really like the extra safety that static typing provides, without the hassle of writing the types for almost everything, compared to Java. Furthermore, when I'm using an API with the Scala interpreter or a notebook like this one, the return value's type is shown, as in the previous example, so I know exactly what "kinds of things" I have. That also means I don't have to know _in advance_ what a method will return, in order to explicit add a required type, as in Java. 291 | 292 | What about the other languages? 293 | 294 | * **Python:** Uses dynamic typing, so no types are written explicitly, but you also don't get the feedback type inference provides, as in our `RDD[(Boolean, Int)]` example. 295 | * **R:** Also dynamically typed. 296 | * **Java:** Statically typed with explicit types required almost everywhere. 297 | 298 | ## 7. Unification of Primitives and Types 299 | In Java, there is a clear distinction between primitives, which are nice for performance (you can put them in registers, you can pass them on the stack, you don't heap allocate them), and instances of classes, which give you the expressiveness of OOP, but with the overhead of heap allocation, etc. 300 | 301 | Scala unifies the syntax, but in most cases, compiles optimal code. So, for example, `Float` acts like any other type, e.g., `String`, with methods you call, but the compiler uses JVM `float` primitives. `Float` and the other primitives are subtypes of `AnyVal` and include `Byte`, `Short`, `Int`, `Long`, `Float`, `Double`, `Char`, `Boolean`, and `Unit`. 302 | 303 | Another benefit is that the uniformity extends to parameterized types, like collections. If you implement your own `Tree[T]` type, `T` can be `Float`, `String`, `MyMassiveClass`, whatever. There's no mental burden of explicitly boxing and unboxing primitives. 304 | 305 | However, the downside is that your primitives will be boxed when used in a context like this. Scala does have an annotation `@specialized(a,b,c)` that's used to tell the compiler to generate optimal implementations for the primitives listed for `a,b,c`, but it's not a perfect solution. 306 | 307 | ```scala 308 | val listString: List[String] = List("one", "two", "three") 309 | val listInt: List[Int] = List(1, 2, 3) // No need to use Integer. 310 | ``` 311 | 312 | Output: 313 | ```scala 314 | listString: List[String] = List(one, two, three) 315 | listInt: List[Int] = List(1, 2, 3) 316 | ``` 317 | 318 | See also **Value Classes** below. 319 | 320 | ## 8. Elegant Tools to Create "Domain Specific Languages" 321 | The Spark [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame) API is a good example of DSL that mimics the original Python and R DataFrame APIs for single-node use. 322 | 323 | First, set up the API: 324 | ```scala 325 | import org.apache.spark.sql.SQLContext 326 | val sqlContext = new SQLContext(sparkContext) 327 | import sqlContext.implicits._ 328 | import org.apache.spark.sql.functions._ // for min, max, etc. column operations 329 | ``` 330 | 331 | Get the root directory of the notebooks: 332 | ```scala 333 | val root = sys.env("NOTEBOOKS_DIR") 334 | ``` 335 | 336 | Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame). 337 | 338 | ```scala 339 | val airportsDF = sqlContext.read.json(s"$root/airports.json") 340 | ``` 341 | 342 | Note the "schema" is inferred from the JSON and shown by the REPL (by calling `DataFrame.toString`). 343 | 344 | ```scala 345 | airportsDF: org.apache.spark.sql.DataFrame = [airport: string, city: string, country: string, iata: string, lat: double, long: double, state: string] 346 | ``` 347 | 348 | We cache the results, so Spark will keep the data in memory since we'll run a few queries over it. `DataFrame.show` is convenient for displaying the first `N` records (20 by default). 349 | 350 | ```scala 351 | airportsDF.cache 352 | airportsDF.show 353 | ``` 354 | 355 | Here's the output of `show`: 356 | ``` 357 | +--------------------+------------------+-------+----+-----------+------------+-----+ 358 | | airport| city|country|iata| lat| long|state| 359 | +--------------------+------------------+-------+----+-----------+------------+-----+ 360 | | Thigpen | Bay Springs| USA| 00M|31.95376472|-89.23450472| MS| 361 | |Livingston Municipal| Livingston| USA| 00R|30.68586111|-95.01792778| TX| 362 | | Meadow Lake| Colorado Springs| USA| 00V|38.94574889|-104.5698933| CO| 363 | | Perry-Warsaw| Perry| USA| 01G|42.74134667|-78.05208056| NY| 364 | | Hilliard Airpark| Hilliard| USA| 01J| 30.6880125|-81.90594389| FL| 365 | | Tishomingo County| Belmont| USA| 01M|34.49166667|-88.20111111| MS| 366 | | Gragg-Wade | Clanton| USA| 02A|32.85048667|-86.61145333| AL| 367 | | Capitol| Brookfield| USA| 02C| 43.08751|-88.17786917| WI| 368 | | Columbiana County| East Liverpool| USA| 02G|40.67331278|-80.64140639| OH| 369 | | Memphis Memorial| Memphis| USA| 03D|40.44725889|-92.22696056| MO| 370 | | Calhoun County| Pittsboro| USA| 04M|33.93011222|-89.34285194| MS| 371 | | Hawley Municipal| Hawley| USA| 04Y|46.88384889|-96.35089861| MN| 372 | |Griffith-Merrillv...| Griffith| USA| 05C|41.51961917|-87.40109333| IN| 373 | |Gatesville - City...| Gatesville| USA| 05F|31.42127556|-97.79696778| TX| 374 | | Eureka| Eureka| USA| 05U|39.60416667|-116.0050597| NV| 375 | | Moton Municipal| Tuskegee| USA| 06A|32.46047167|-85.68003611| AL| 376 | | Schaumburg|Chicago/Schaumburg| USA| 06C|41.98934083|-88.10124278| IL| 377 | | Rolla Municipal| Rolla| USA| 06D|48.88434111|-99.62087694| ND| 378 | | Eupora Municipal| Eupora| USA| 06M|33.53456583|-89.31256917| MS| 379 | | Randall | Middletown| USA| 06N|41.43156583|-74.39191722| NY| 380 | +--------------------+------------------+-------+----+-----------+------------+-----+ 381 | only showing top 20 rows 382 | ``` 383 | 384 | Now we can show the idiomatic DataFrame API (DSL) in action: 385 | 386 | ```scala 387 | val grouped = airportsDF.groupBy($"state", $"country").count.orderBy($"count".desc) 388 | grouped.printSchema 389 | grouped.show(100) // 50 states + territories < 100 390 | ``` 391 | 392 | Here is the output: 393 | 394 | ``` 395 | root 396 | |-- state: string (nullable = true) 397 | |-- country: string (nullable = true) 398 | |-- count: long (nullable = false) 399 | 400 | +-----+-------+-----+ 401 | |state|country|count| 402 | +-----+-------+-----+ 403 | | AK| USA| 263| 404 | | TX| USA| 209| 405 | | CA| USA| 205| 406 | | OK| USA| 102| 407 | | FL| USA| 100| 408 | | OH| USA| 100| 409 | | NY| USA| 97| 410 | | GA| USA| 97| 411 | | MI| USA| 94| 412 | | MN| USA| 89| 413 | | IL| USA| 88| 414 | | WI| USA| 84| 415 | | KS| USA| 78| 416 | | IA| USA| 78| 417 | | AR| USA| 74| 418 | | MO| USA| 74| 419 | | NE| USA| 73| 420 | | AL| USA| 73| 421 | | MS| USA| 72| 422 | | NC| USA| 72| 423 | | PA| USA| 71| 424 | | MT| USA| 71| 425 | | TN| USA| 70| 426 | | WA| USA| 65| 427 | | IN| USA| 65| 428 | | AZ| USA| 59| 429 | | SD| USA| 57| 430 | | OR| USA| 57| 431 | | LA| USA| 55| 432 | | ND| USA| 52| 433 | | SC| USA| 52| 434 | | NM| USA| 51| 435 | | KY| USA| 50| 436 | | CO| USA| 49| 437 | | VA| USA| 47| 438 | | ID| USA| 37| 439 | | UT| USA| 35| 440 | | NJ| USA| 35| 441 | | ME| USA| 34| 442 | | WY| USA| 32| 443 | | NV| USA| 32| 444 | | MA| USA| 30| 445 | | WV| USA| 24| 446 | | MD| USA| 18| 447 | | HI| USA| 16| 448 | | CT| USA| 15| 449 | | NH| USA| 14| 450 | | VT| USA| 13| 451 | | PR| USA| 11| 452 | | RI| USA| 6| 453 | | DE| USA| 5| 454 | | VI| USA| 5| 455 | | CQ| USA| 4| 456 | | AS| USA| 3| 457 | | GU| USA| 1| 458 | | DC| USA| 1| 459 | +-----+-------+-----+ 460 | 461 | grouped: org.apache.spark.sql.DataFrame = [state: string, country: string, count: bigint] 462 | ``` 463 | 464 | By the way, this DSL is essentially a programmatic version of SQL: 465 | 466 | ```scala 467 | airportsDF.registerTempTable("airports") 468 | val grouped2 = sqlContext.sql(""" 469 | SELECT state, country, COUNT(*) AS cnt FROM airports 470 | GROUP BY state, country 471 | ORDER BY cnt DESC 472 | """) 473 | ``` 474 | 475 | What about the other languages? 476 | 477 | * **Python:** Dynamically-typed languages often have features that make idiomatic DSLs easy to define. The Spark DataFrame API is inspired by the [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) API. 478 | * **R:** Less flexible for idiomatic DSLs, but syntax is designed for Mathematics. The Pandas DataFrame API is inspired by the [R Data Frame](http://www.r-tutor.com/r-introduction/data-frame) API. 479 | * **Java:** Limited to so-called _fluent_ APIs, similar to our collections and RDD examples above. 480 | 481 | ## 9. And a Few Other Things... 482 | There are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used less frequently in Spark code. Here they are, for completeness. 483 | 484 | ### 9A. Singletons Are a Built-in Feature 485 | Implement the _Singleton Design Pattern_ without special logic to ensure there's only one instance. 486 | 487 | ```scala 488 | object Foo { 489 | def main(args: Array[String]):Unit = { 490 | args.foreach(arg => println(s"arg = $arg")) 491 | } 492 | } 493 | Foo.main(Array("Scala", "is", "great!")) 494 | ``` 495 | 496 | The output is: 497 | ``` 498 | arg = Scala 499 | arg = is 500 | arg = great! 501 | defined object Foo 502 | ``` 503 | 504 | ### 9B. Named and Default Arguments 505 | Does a method have a long argument list? Provide defaults for some of them. Name the arguments when calling the method to document what you're doing. 506 | 507 | ```scala 508 | val airportsRDD = grouped.select($"count", $"state"). 509 | map(row => (row.getLong(0), row.getString(1))) 510 | 511 | val rdd1 = airportsRDD.sortByKey() // defaults: ascending = true, numPartitions = current # of partitions 512 | val rdd2 = airportsRDD.sortByKey(ascending = false) // name the ascending argument explicitly 513 | val rdd3 = airportsRDD.sortByKey(numPartitions = 4) // name the numPartitions argument explicitly 514 | val rdd4 = airportsRDD.sortByKey(ascending = false, numPartitions = 4) // Okay to do both... 515 | ``` 516 | 517 | All four variants return the same type: 518 | ```scala 519 | rdd1: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[60] at sortByKey at :74 520 | rdd2: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[63] at sortByKey at :75 521 | rdd3: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[66] at sortByKey at :76 522 | rdd4: org.apache.spark.rdd.RDD[(Long, String)] = ShuffledRDD[69] at sortByKey at :77 523 | ``` 524 | 525 | To see the impacts of the arguments: 526 | ```scala 527 | Seq(rdd1, rdd2, rdd3, rdd4).foreach { rdd => 528 | println(s"RDD (#partitions = ${rdd.partitions.length}):") 529 | rdd.take(10).foreach(println) 530 | } 531 | ``` 532 | 533 | Output: 534 | ``` 535 | RDD (#partitions = 41): 536 | (1,GU) 537 | (1,DC) 538 | (3,AS) 539 | (4,CQ) 540 | (5,VI) 541 | (5,DE) 542 | (6,RI) 543 | (11,PR) 544 | (13,VT) 545 | (14,NH) 546 | RDD (#partitions = 41): 547 | (263,AK) 548 | (209,TX) 549 | (205,CA) 550 | (102,OK) 551 | (100,OH) 552 | (100,FL) 553 | (97,NY) 554 | (97,GA) 555 | (94,MI) 556 | (89,MN) 557 | RDD (#partitions = 4): 558 | (1,GU) 559 | (1,DC) 560 | (3,AS) 561 | (4,CQ) 562 | (5,VI) 563 | (5,DE) 564 | (6,RI) 565 | (11,PR) 566 | (13,VT) 567 | (14,NH) 568 | RDD (#partitions = 4): 569 | (263,AK) 570 | (209,TX) 571 | (205,CA) 572 | (102,OK) 573 | (100,OH) 574 | (100,FL) 575 | (97,NY) 576 | (97,GA) 577 | (94,MI) 578 | (89,MN) 579 | ``` 580 | 581 | ### 9C. String Interpolation 582 | You've seen it used already: 583 | 584 | ```scala 585 | println(s"RDD #partitions = ${rdd4.partitions.length}") 586 | // prints: RDD #partitions = 4 587 | ``` 588 | 589 | ### 9D. Few Semicolons 590 | Semicolons are inferred, making your code just that much more concise. You can use them if you want to write more than one expression on a line: 591 | 592 | ```scala 593 | val result = "foo" match { 594 | case "foo" => println("Found foo!"); true 595 | case _ => false 596 | } 597 | // prints: Found foo! 598 | ``` 599 | 600 | ### 9E. Tail Recursion Optimization 601 | Recursion isn't used much in user code for Spark, but for general programming it's a powerful technique. Unfortunately, most OO languages (like Java) do not optimize [tail call recursion](https://en.wikipedia.org/wiki/Tail_call) by converting the recursion into a loop. Without this optimization, use of recursion is risky, because of the risk of stack overflow. Scala's compiler implements this optimization. 602 | 603 | ```scala 604 | def printSeq[T](seq: Seq[T]): Unit = seq match { 605 | case head +: tail => println(head); printSeq(tail) 606 | case Nil => // done 607 | } 608 | printSeq(Seq(1,2,3,4)) 609 | // prints: 610 | // 1 611 | // 2 612 | // 3 613 | // 4 614 | ``` 615 | 616 | ### 9F. Everything Is an Expression 617 | Some constructs are _statements_ (meaning they return nothing) in some languages, like `if ... then ... else`, `for` loops, etc. Almost everything is an expression in Scala which means you can assign results of the `if` or `for` expression. The alternative in the other languages is that you have to declare a mutable variable, then set its value inside the statement. 618 | 619 | ```scala 620 | val worldRocked = if (true == false) "yes!" else "no" 621 | ``` 622 | 623 | As you might expect, the output is: 624 | ``` 625 | worldRocked: String = no 626 | ``` 627 | 628 | ```scala 629 | val primes = for { 630 | i <- 0 until 100 631 | if isPrime(i) 632 | } yield i 633 | ``` 634 | 635 | The output is: 636 | ```scala 637 | primes: scala.collection.immutable.IndexedSeq[Int] = 638 | Vector(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 639 | 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97) 640 | ``` 641 | 642 | ### 9G. Implicits 643 | One of Scala's most powerful features is the _implicits_ mechanism. It's used (or misused) for several capabilities, but one of the most useful is the ability to "add" methods to existing types that don't already have the methods. What actually happens is the compiler invokes an _implicit conversion_ from an instance of the type to a wrapper type that has the desired method. 644 | 645 | For example, suppose I want to add a `toJSON` method to my `Person` type above, but I don't want this added to the class itself. Maybe it's from a library that I can't modify. Maybe I only want this method in some contexts, but I don't want its baggage everywhere. Here's how to do it. 646 | 647 | ```scala 648 | // repeat definition of Person: 649 | case class Person(firstName: String, lastName: String, age: Int) 650 | 651 | implicit class PersonToJSON(person: Person) { 652 | // Just return a JSON-formatted string, for simplicity of the example: 653 | def toJSON: String = 654 | s"""{ "firstName": ${person.firstName}, "lastName": ${person.lastName}, "age": ${person.age} }""" 655 | } 656 | 657 | val p = Person("Dean", "Wampler", 39) 658 | p.toJSON // Like magic!! 659 | // returns: { "firstName": Dean, "lastName": Wampler, "age": 39 } 660 | ``` 661 | 662 | The `implicit` keyword tells the compiler to consider `PersonToJSON` when I attempt to call `toJSON` on a `Person` instance. The compiler finds this implicit class and does the conversion implicitly, then calls the `toJSON` method. 663 | 664 | There are many other uses for implicits. They are a powerful implementation tool for various design problems, but they have to be used wisely, because it can be difficult for the reader to know what's going on. 665 | 666 | ### 9H. Sealed Type Hierarchies 667 | An important concept in modern languages is _sum types_, where there is a finite set of possible instances. Two examples from Scala are `Option[T]` and its allowed subtypes `Some[T]` and `None`, and `Either[L,R]` and its subtypes `Left[L]` and `Right`[R]`. 668 | 669 | Note that `Option[T]` represents two and only two possible states, either I have something, a `T` inside a `Some[T]`, or I don't anything, a `None`. There are no additional "states" that are logically possible for the `Option[T]` "abstraction". Similarly, `Either[L,R]` encapsulates a similar dichotomy, often used for "failure" (e.g., `Left[Throwable]` by convention) and "successful result" (`Right[T]` for some "expected" `T`). 670 | 671 | The term _sum type_ comes from an analog between types and arithmetic. For `Option`, the number of allowed intances (ignoring the type parameter `T`) is just the sum, _two_. Similarly for `Either`. 672 | 673 | There are also _product types_, like tuples, where combining types together _multiplies_ the number of instances. For example, a tuple of `(Option,Either)` would have 2*2 instances. A tuple `(Boolean,Option,HTTP_Commands)` has 2*2*7 possible instances (there are 7 HTTP 1.1 commands, like `GET`, `POST`, etc.) 674 | 675 | Scala uses type hierarchies for sum types, where an abstract _sealed_ trait or class is used for the base type, e.g., `Option[T]` and `Either[L,R]`, and subtypes represent the concrete types. The `sealed` keyword is used on the base type and it is crucial; it tells the compiler to only allow subtypes to be defined in the same _file_, which means users can't add their own subtypes, breaking the logic of the type. 676 | 677 | Some other languages implement sum types using a variation of _enumerations_. Java has that, but it's a much more limited concept than true subtypes. 678 | 679 | Here's an example, sort of like `Either`, but oriented more towards the usage of encapsulating success or failure. However, we'll put "success" on the left instead of the right, which is the convention when using `Either`. 680 | 681 | We'll have one type parameter `Result`. On `Success`, it will hold an instance of the type `Result`. On Failure, it will hold no successful result, so we'll use the "bottom" type `Nothing` for the type parameter, 682 | and expect the error information to be returned in a `RuntimeException`. 683 | 684 | ```scala 685 | import scala.util.control.NonFatal 686 | 687 | // The + means "contravariant"; we can use subtypes of the declared "Result". 688 | // See also the **Definition Site Invariance...** section below. 689 | sealed trait SuccessOrFailure[+Result] 690 | case class Success[Result](result: Result) extends SuccessOrFailure[Result] 691 | case class Failure(error: RuntimeException) extends SuccessOrFailure[Nothing] 692 | ``` 693 | 694 | The `sealed` keyword is actually less useful in the context of this notebook; we can keep on defining subclasses below. However, in library code, you would put the three declarations in a separate file and then the compiler would prevent anyone from defining a third subclass in a different location. 695 | 696 | Let's try it out. 697 | 698 | ```scala 699 | def parseInt(string: String): SuccessOrFailure[Int] = try { 700 | Success(Integer.parseInt(string)) 701 | } catch { 702 | case nfe: NumberFormatException => Failure(new RuntimeException(s"""Invalid integer string: "$string" """)) 703 | } 704 | 705 | Seq("1", "202", "three").map(parseInt) 706 | // Seq[SuccessOrFailure[Int]] = List( 707 | // Success(1), 708 | // Success(202), 709 | // Failure(java.lang.RuntimeException: Invalid integer string: "three" )) 710 | ``` 711 | 712 | ### 9I. Option Type Broken in Java 713 | Speaking of `Option[T]`, Java 8 introduced a similar type called `Optional`. (The name `Option` was already used for something else.) However, its design has some subtleties that make the behavior not straightforward when `nulls` are involved. For details, see [this blog post](https://developer.atlassian.com/blog/2015/08/optional-broken/). 714 | 715 | ### 9J: Definition-site Variance vs. Call-site Variance 716 | This is a technical point. In Java, when you define a type with a type parameter, like our `SuccessOrFailure[T]` previously, to hold items of some type `T`, you can't specify in the declaration whether it's okay to substitute a subtype of `SuccessOrFailure` with a subtype of `T`. For example, is the following okay?: 717 | 718 | ```java 719 | // Java 720 | SuccessOrFailure sof = null; 721 | ... 722 | sof = new Success("foo"); 723 | ``` 724 | 725 | This substitutability is called _variance_, referring to the variance allowed in `T` if we use a subtype of the outer type, `SuccessOrFailure`. Notice that we want to assign a subclass of `SuccessOrFailure` _and_ a subtype of `Object`. In this case, we're doing _covariant substitution_, because the subtyping "moves in the same direction", from parent to child for both types. There's also _contravariant_, where the type parameter moves "up" while the outer type moves "down", and _invariant_ typing, where you can't change the type parameter. That is, in the invariant case, we could only assign `Success(...)` to `sof`. 726 | 727 | Java does not let the type _designer_ specify the correct behavior. This means Java forces the _user_ of the type to specify the variance at the _call site_: 728 | 729 | ```java 730 | SuccessOrFailure sof = null; 731 | ... 732 | sof = new Success("foo"); 733 | 734 | ``` 735 | 736 | This is harder for the user, who has to understand what's okay in this case, both what the designer intended and some technical rules of type theory. 737 | 738 | It's much better if the _designer_ of `SuccessOrFailure[T]`, who understands the desired behavior, defines the allowed variance behavior at the _definition site_, which Scala supports. Recall from above: 739 | 740 | ```scala 741 | // Scala 742 | sealed trait SuccessOrFailure[+Result] 743 | case class Success[Result](result: Result) extends SuccessOrFailure[Result] 744 | case class Failure(error: RuntimeException) extends SuccessOrFailure[Nothing] 745 | 746 | ... 747 | // usage: 748 | val sof: SuccessOrFailure[AnyRef] = new Success[String]("Yea!") 749 | ``` 750 | 751 | ### 9K: Value Classes 752 | Scala's built-in _value types_ `Int`, `Long`, `Float`, `Double`, `Boolean`, and `Unit` are implemented with the corresponding JVM primitive values, eliminating the overhead of allocating an instance on the heap. What if you define a class that wraps _one_ of these values? 753 | ```scala 754 | class Celsius(value: Float) { 755 | // methods 756 | } 757 | ``` 758 | 759 | Unfortunately, instances are allocated on the heap, even though all instance "state" is held by a single primitive `Float`. Scala now has an `AnyVal` trait. If you use it as a parent of types like `Celsius`, they will enjoy the same optimization that the built-in value types enjoy. That is, the single primitive field (`value` here) will be pushed on the stack, etc., and no instance of `Celsius` will be heap allocated, _in most cases._ 760 | 761 | ```scala 762 | class Celsius(value: Float) extends AnyVal { 763 | // methods 764 | } 765 | ``` 766 | 767 | So, why doesn't Scala make this optimization automatically? There are some limitations, which are described [here](http://docs.scala-lang.org/overviews/core/value-classes.html) and in [my book](http://shop.oreilly.com/product/0636920033073.do). 768 | 769 | ### 9L. Lazy Vals 770 | Sometimes you don't want to initialize a value if doing so is expensive and you won't always need it. Or, sometimes you just want to delay the "hit" so you're up and running more quickly. For example, a database connection is expensive. 771 | 772 | ```scala 773 | lazy val jdbcConnection = new JDBCConnection(...) 774 | ``` 775 | 776 | Use the `lazy` keyword to delay initialization until it's actually needed (if ever). This feature can also be used to solve some tricky "order of initialization" problems. It has one drawback; there will be extra overhead for every access to check if it has already been initialized, so don't do this if the value will be read a lot. A future version of Scala will remove this overhead. 777 | 778 | What about the other languages? 779 | 780 | * **Python:** Offers equivalents for some these features. 781 | * **R:** Supports some of these features. 782 | * **Java:** Supports none of these features. 783 | 784 | # But Scala Has Some Disadvantages... 785 | 786 | All of the advantages discussed above make Scala code quite concise, especially compared to Java code. There are lots of nifty features available to solve particular design problems. 787 | 788 | However, no language is perfect. You should know about the disadvantages of Scala, too. 789 | 790 | Here, I'll briefly summarize some Scala and JVM issues, especially for Spark, but Dean's talk at [Strata San Jose](http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47105) ([extended slides](http://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons-extended.pdf)) goes into more details. 791 | 792 | ## 1. Data-centric Tools and Libraries 793 | 794 | The R and Python communities have a much wider selection of data-centric tools and libraries. Python is great for general data science. R was developed by statisticians, so it has a very rich library of statistical algorithms and rich options for charting, like [ggplot2](http://ggplot2.org/). 795 | 796 | ## 2. The JVM Has Some Issues 797 | 798 | Big Data has pushed the limits of the JVM in interesting ways. 799 | 800 | ### 2a. Integer indexing of arrays 801 | 802 | Because Java has _signed_ integers only and because arrays are indexed by integers instead of longs, array sizes are limited to 2 billion elements. Therefore, _byte_ arrays, which are often used for holding serialized data, are limited to 2GB. This is in an era when _terabyte_ heaps (TB) are becoming viable! 803 | 804 | There's no real workaround when you want the efficiency of arrays, except to implement logic that can split a large object into "chunks" and manage them accordingly. 805 | 806 | ### 2b. Inefficiency of the JVM Memory Model 807 | The JVM has a very flexible, general-purpose model of organizing data into memory and managing garbage collection. However, for massive data sets of records with the same or nearly the same schema, the model is very inefficient. Spark's [Tungsten Project](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html) is addressing this problem by introducing custom object-layouts, managed memory, as well as code generation for other performance bottlenecks. 808 | 809 | Here is an example of how Java typically lays out objects in memory. Note the references to small, discontiguous chunks of memory. Now imagine billions of these little bits of memory. That means a lot of garbage to manage. Also, the discontinuities cause poor CPU cache performance. 810 | 811 | Typical Java Object Layout 812 | 813 | Instead, Tungsten uses a more efficient, cache-friendly encoding in a contiguous byte array. The first few bytes are bit flags to indicate which fields if any are null. Then comes 8 bytes/field for the non-null fields. If the field's value fits in 8 bytes (e.g., longs and doubles), then the value is inlined here. Otherwise, the value holds an offset to the final section, a variable-length sequence of bytes where longer objects, like ASCII strings, are stored. 814 | 815 | Tungsten Object Layout 816 | 817 | ## Scala REPL Weirdness 818 | 819 | The way the Scala REPL (interpreter) compiles code leads to memory leaks, which cause problems when working with big data sets and long sessions. Imagine you write the following code in the REPL: 820 | 821 | ```scala 822 | scala> val massiveArray = get8GBarray(...) 823 | scala> // do some work 824 | scala> massiveArray = getDifferent8GBarray(...) 825 | ``` 826 | 827 | You might think that the first "8GB array" will be nicely garbage collected when you reassign `massiveArray`. Not so. Here's a simplified view of the code the REPL generates for the _last_ line to pass to the compiler. 828 | 829 | ```scala 830 | class LineN { 831 | class LineN_minus_1 { 832 | class LineN_minus_2 { 833 | ... 834 | class Line1 { 835 | val massiveArray = get8GBarray(...) 836 | } 837 | ... 838 | } 839 | } 840 | val massiveArray = getDifferent8GBarray(...) 841 | } 842 | ``` 843 | 844 | Why? The JVM expects classes to be compiled into byte code, so the REPL synthesizes classes for each line you evaluate (or group of lines when you use the `:paste ... ^D` feature). 845 | 846 | Note that the overridden `massiveArray` shadows the original one, which is the trick the REPL uses to let you redefine variables, which would be prohibited by the compiler otherwise. Unfortunately, that leaves the shadowed reference attached to old data, so it can't be garbage collected, even though the REPL provides no way to ever refer to it again! 847 | -------------------------------------------------------------------------------- /notebooks/WhyScala.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kensuio-oss/scala-for-data-science/88a896f4ce4b2584a3b0b5f49a53bc2f861982ef/notebooks/WhyScala.pdf -------------------------------------------------------------------------------- /notebooks/WhyScala.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "WhyScala", 4 | "user_save_timestamp": "1969-12-31T18:00:00.000Z", 5 | "auto_save_timestamp": "1969-12-31T18:00:00.000Z", 6 | "language_info": { 7 | "name": "scala", 8 | "file_extension": "scala", 9 | "codemirror_mode": "text/x-scala" 10 | }, 11 | "trusted": true, 12 | "customLocalRepo": "/tmp/repo", 13 | "customRepos": null, 14 | "customDeps": null, 15 | "customImports": null, 16 | "customArgs": null, 17 | "customSparkConf": null 18 | }, 19 | "cells": [ 20 | { 21 | "metadata": { 22 | "id": "FFD1A3B3C876454DA66D87714D2AE706" 23 | }, 24 | "cell_type": "markdown", 25 | "source": "# Scala: the Unpredicted Lingua Franca for Data Science" 26 | }, 27 | { 28 | "metadata": { 29 | "id": "A032770187B947768ADE83D1C76C6D8B" 30 | }, 31 | "cell_type": "markdown", 32 | "source": " **Andy Petrella**
[noootsab@data-fellas.guru](mailto:noootsab@data-fellas.guru)
\n **Dean Wampler**
[dean.wampler@lightbend.com](mailto:dean.wampler@lightbend.com)\n\n* Scala Days NYC, May 5th, 2016\n* GOTO Chicago, May 24, 2016\n* Strata + Hadoop World London, June 3, 2016\n* Scala Days Berlin, June 16th, 2016\n\nThis notebook available at [github.com/data-fellas/scala-for-data-science](https://github.com/data-fellas/scala-for-data-science)." 33 | }, 34 | { 35 | "metadata": { 36 | "id": "89E60F5356A24745AD7FFC3ED747D7E4" 37 | }, 38 | "cell_type": "markdown", 39 | "source": "## Why Scala for Data Science with Spark?" 40 | }, 41 | { 42 | "metadata": { 43 | "id": "3B4411F9D439469AB4992A6E8D178757" 44 | }, 45 | "cell_type": "markdown", 46 | "source": "While Python and R are traditional languages of choice for Data Science, [Spark](http://spark.apache.org) also supports Scala (the language in which it's written) and Java.\n\nHowever, using one language for all work has advantages like simplifying the software development process, such as build and deployment tools, coding conventions, etc.\n\nIf you want a thorough introduction to Scala, see [Dean's book](http://shop.oreilly.com/product/0636920033073.do).\n\nSo, what are the advantages, as well as disadvantages of Scala?" 47 | }, 48 | { 49 | "metadata": { 50 | "id": "058C9660033E42EAB7089F7B95B9441A" 51 | }, 52 | "cell_type": "markdown", 53 | "source": "## 1. Functional Programming Plus Objects\n\nScala is a _multi-paradigm_ language. Code can look a lot like traditional Java code using _Object-Oriented Programming_ (OOP), but it also embraces _Function Programming_ (FP), which emphasizes the virtues of:\n1. **Immutable values:** Mutability is a common source of bugs.\n1. **Functions with no _side effects_:** All the information they need is passed in and all the \"work\" is returned. No external state is modified.\n1. **Referential transparency:** You can replace a function call with a cached value that was returned from a previous invocation with the same arguments. (This is a benefit enabled by functions without side effects.)\n1. **Higher-order functions:** Functions that take other functions as arguments are return functions as results.\n1. **Structure separated from operations:** A core set of collections meets most needs. An operation applicable to one data structure is applicable to all." 54 | }, 55 | { 56 | "metadata": { 57 | "id": "0DDBE39BA9264FE387FA4721718608F2" 58 | }, 59 | "cell_type": "markdown", 60 | "source": "However, objects are still useful as an _encapsulation_ mechanism. This is valuable for projects with large teams and code bases. \nScala also implements some _functional_ features using _object-oriented inheritance_ (e.g., \"abstract data types\" and \"type classes\", for you experts...)." 61 | }, 62 | { 63 | "metadata": { 64 | "id": "F76E6ACFA660449C8B4CEDE9289C0DD9" 65 | }, 66 | "cell_type": "markdown", 67 | "source": "What about the other languages? \n* **Python:** Supports mixed FP-OOP programming, too, but isn't as \"rigorous\". \n* **R:** As a Statistics language, R is more functional than object-oriented.\n* **Java:** An object-oriented language, but with recently introduced functional constructs, _lambdas_ (anonymous functions) and collection operations that follow a more _functional_ style, rather than _imperative_ (i.e., where mutating the collection is embraced)." 68 | }, 69 | { 70 | "metadata": { 71 | "id": "84C5C921657D4A0B821CD106634135A4" 72 | }, 73 | "cell_type": "markdown", 74 | "source": "There are a few differences with Java's vs. Scala's approaches to OOP and FP that are worth mentioning specifically:" 75 | }, 76 | { 77 | "metadata": { 78 | "id": "584F4A92CB454F0586F5EBC2F4ACD135" 79 | }, 80 | "cell_type": "markdown", 81 | "source": "### 1a. Traits vs. Interfaces\nScala's object model adds a _trait_ feature, which is a more powerful concept than Java 8 interfaces. Before Java 8, there was no [mixin composition](https://en.wikipedia.org/wiki/Mixin) capability in Java, where composition is generally [preferred over inheritance](https://en.wikipedia.org/wiki/Composition_over_inheritance). \n\nImagine that you want to define reusable logging code and mix it into other classes declaratively. Before Java 8, you could define the abstraction for logging in an interface, but you had to use some ad hoc mechanism to implement it (like implementing all methods to delegate to a helper object). Java 8 added the ability to provide default method definitions, as well as declarations in interfaces. This makes mixin composition easier, but you still can't add fields (for state), so the capability is limited. \n\nScala traits fully support mixin composition by supporting both field and method definitions with flexibility rules for overriding behavior, once the traits are mixed into classes." 82 | }, 83 | { 84 | "metadata": { 85 | "id": "DC6C322EAB2244B9844CB0FDABCE6BA3" 86 | }, 87 | "cell_type": "markdown", 88 | "source": "### 1b. Java Streams\nWhen you use the Java 8 collections, you can convert the traditional collections to a \"stream\", which is lazy and gives you more functional operations. However, sometimes, the conversions back and forth can be tedious, e.g., converting to a stream for functional processing, then converting pass them to older APIs, etc. Scala collections are more consistently functional." 89 | }, 90 | { 91 | "metadata": { 92 | "id": "E706C1FC94B34C5E86C25040BDF9263C" 93 | }, 94 | "cell_type": "markdown", 95 | "source": "### The Virtue of Functional Collections\nLet's examine how concisely we can operate on a collection of values in Scala and Spark." 96 | }, 97 | { 98 | "metadata": { 99 | "id": "358539665C7D40479DC6FB0910C08FBE" 100 | }, 101 | "cell_type": "markdown", 102 | "source": "First, let's define a helper function: is an integer a prime? (Naïve algorithm from [Wikipedia](https://en.wikipedia.org/wiki/Primality_test).)" 103 | }, 104 | { 105 | "metadata": { 106 | "trusted": true, 107 | "input_collapsed": false, 108 | "collapsed": false, 109 | "id": "EB02D7EAB7F8406A849C22CE2528B443" 110 | }, 111 | "cell_type": "code", 112 | "source": "def isPrime(n: Int): Boolean = {\n def test(i: Int, n2: Int): Boolean = {\n if (i*i > n2) true\n else if (n2 % i == 0 || n2 % (i + 2) == 0) false\n else test(i+6, n2)\n }\n if (n <= 1) false\n else if (n <= 3) true\n else if (n % 2 == 0 || n % 3 == 0) false\n else test(5, n)\n}", 113 | "outputs": [] 114 | }, 115 | { 116 | "metadata": { 117 | "id": "81C67BC34835425582115F8FD36BCEC3" 118 | }, 119 | "cell_type": "markdown", 120 | "source": "Note that no values are mutated here (\"virtue\" #1 listed above) and `isPrime` has no side effects (#2), which means we could cache previous invocations for a given `n` for better performance if we called this a lot (#3)!" 121 | }, 122 | { 123 | "metadata": { 124 | "id": "1B606E0028B245438CA7BE870A1AF082" 125 | }, 126 | "cell_type": "markdown", 127 | "source": "#### Scala Collections Example\nLet's compare a Scala collections calculation vs. the same thing in Spark; how many prime numbers are there between 1 and 100, inclusive?" 128 | }, 129 | { 130 | "metadata": { 131 | "trusted": true, 132 | "input_collapsed": false, 133 | "collapsed": false, 134 | "presentation": { 135 | "tabs_state": "{\n \"tab_id\": \"#tab239229674-0\"\n}", 136 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 137 | }, 138 | "id": "5E9DCA78D8DB4E5E8362EB9138F380F5" 139 | }, 140 | "cell_type": "code", 141 | "source": "(1 to 100). // Range of integers from 1 to 100, inclusive.\n map(i => (i, isPrime(i))). // `map` is a higher-order method; we pass it a function (#4)\n groupBy(tuple => tuple._2). // ... and so is `groupBy`, etc.\n map(tuple => (tuple._1, tuple._2.size))", 142 | "outputs": [] 143 | }, 144 | { 145 | "metadata": { 146 | "id": "EA0A325F36A14A3489F3652278A1E199" 147 | }, 148 | "cell_type": "markdown", 149 | "source": "Note that for the numbers between 1 and 100, inclusive, exactly 1/4 of them are prime!" 150 | }, 151 | { 152 | "metadata": { 153 | "id": "5638A2D79A1E4073AE3DBA7342623D09" 154 | }, 155 | "cell_type": "markdown", 156 | "source": "#### Spark Example\n\nNote how similar the following code is to the previous example. After constructing the data set, the \"core\" three lines are _identical_, even though they are operating on completely different underlying collections (#5 above). \n\nHowever, because Spark collections are \"lazy\" by default (i.e., not evaluated until we ask for results), we explicitly print the results so Spark evaluates them!" 157 | }, 158 | { 159 | "metadata": { 160 | "trusted": true, 161 | "input_collapsed": false, 162 | "collapsed": false, 163 | "presentation": { 164 | "tabs_state": "{\n \"tab_id\": \"#tab1714336521-0\"\n}", 165 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 166 | }, 167 | "id": "5B884B3F73BF4D0699067F8850A499AD" 168 | }, 169 | "cell_type": "code", 170 | "source": "val rddPrimes = sparkContext.parallelize(1 to 100).\n map(i => (i, isPrime(i))).\n groupBy(tuple => tuple._2).\n map(tuple => (tuple._1, tuple._2.size))\nrddPrimes.collect", 171 | "outputs": [] 172 | }, 173 | { 174 | "metadata": { 175 | "id": "DBAEF93047F04B438C18674E104C0BA0" 176 | }, 177 | "cell_type": "markdown", 178 | "source": "Note the inferred type, an `RDD` with records of type `(Boolean, Int)`, meaning two-element tuples.\n\nSpark's RDD API is inspired by the Scala collections API, which is inspired by classic _functional programming_ operations on data collections, i.e., using a series of transformations from one form to the next, without mutating any of the collections. (Spark is very efficient about avoiding the materialization of intermediate outputs.)\n\nOnce you know these operations, it's quick and effective to implement robust, non-trivial transformations." 179 | }, 180 | { 181 | "metadata": { 182 | "id": "59060F32E3C34640857B9E323B062DB6" 183 | }, 184 | "cell_type": "markdown", 185 | "source": "What about the other languages? \n* **Python:** Supports very similar functional programming. In fact, Spark Python code looks very similar to Spark Scala code. \n* **R:** More idiomatic (see below).\n* **Java:** Looks similar when _lambdas_ are used, but missing features (see below) limit concision and flexibility." 186 | }, 187 | { 188 | "metadata": { 189 | "id": "898023452F2442E38BD08B35E2F3B05A" 190 | }, 191 | "cell_type": "markdown", 192 | "source": "## 2. Interpreter (REPL)" 193 | }, 194 | { 195 | "metadata": { 196 | "id": "72A05DBF93454EAB840C4D9E25929017" 197 | }, 198 | "cell_type": "markdown", 199 | "source": "We've been using the Scala interpreter (a.k.a., the REPL - Read Eval, Print, Loop) already behind the scenes. It makes notebooks like this one possible!" 200 | }, 201 | { 202 | "metadata": { 203 | "id": "DA296C929D1B43A191C22FAEB538458F" 204 | }, 205 | "cell_type": "markdown", 206 | "source": "What about the other languages? \n* **Python:** Also has an interpreter and [iPython/Jupyter](https://ipython.org/) was one of the first, widely-used notebook environments.\n* **R:** Also has an interpreter and notebook/IDE environments.\n* **Java:** Does _not_ have an interpreter and can't be programmed in a notebook environment. However, Java 9 will have a REPL, after 20+ years!" 207 | }, 208 | { 209 | "metadata": { 210 | "id": "32F89F305EFE4D2B9FF61F0B357A0BA7" 211 | }, 212 | "cell_type": "markdown", 213 | "source": "## 3. Tuple Syntax\nIn data, you work with records of `n` fields (for some value of `n`) all the time. Support for `n`-element _tuples_ is very convenient and Scala has a shorthand syntax for instantiating tuples. We used it twice previously to return two-element tuples in the anonymous functions passed to the `map` methods above:" 214 | }, 215 | { 216 | "metadata": { 217 | "trusted": true, 218 | "input_collapsed": false, 219 | "collapsed": false, 220 | "id": "2C28CB38977246F8AC8F202459D99B43" 221 | }, 222 | "cell_type": "code", 223 | "source": "sparkContext.parallelize(1 to 100).\n map(i => (i, isPrime(i))). // <-- here\n groupBy(tuple => tuple._2).\n map(tuple => (tuple._1, tuple._2.size)) // <-- here", 224 | "outputs": [] 225 | }, 226 | { 227 | "metadata": { 228 | "id": "4BA733B4AA944F969B5F1DE83B7D3220" 229 | }, 230 | "cell_type": "markdown", 231 | "source": "**Tuples are used all the time** in Spark Scala RDD code, where it's common to use key-value pairs." 232 | }, 233 | { 234 | "metadata": { 235 | "id": "445D28CB9F5347088A038275A7390447" 236 | }, 237 | "cell_type": "markdown", 238 | "source": "What about the other languages? \n* **Python:** Also has some support for the same tuple syntax.\n* **R:** Also has tuple types, but a less convenient syntax for instantiating them.\n* **Java:** Does _not_ have tuple types, not even the special case of two-element tuples (pairs), much less a convenient syntax for them. However, Spark defines a [MutablePair](http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/MutablePair.html) type for this purpose:" 239 | }, 240 | { 241 | "metadata": { 242 | "trusted": true, 243 | "input_collapsed": false, 244 | "collapsed": false, 245 | "id": "C4162770FDDA49438DE4E54638979322" 246 | }, 247 | "cell_type": "code", 248 | "source": "// Using Scala syntax here:\nimport org.apache.spark.util.MutablePair\nval pair = new MutablePair[Int,String](1, \"one\")", 249 | "outputs": [] 250 | }, 251 | { 252 | "metadata": { 253 | "id": "787A952B7EDD45329BFE71BDF82617B4" 254 | }, 255 | "cell_type": "markdown", 256 | "source": "## 4. Pattern Matching\nThis is one of the most powerful features you'll find in most functional languages, Scala included. It has no equivalent in Python, R, or Java.\n\nLet's rewrite our previous primes example:" 257 | }, 258 | { 259 | "metadata": { 260 | "trusted": true, 261 | "input_collapsed": false, 262 | "collapsed": false, 263 | "id": "48179E0BC6C3404289378596471F1D11" 264 | }, 265 | "cell_type": "code", 266 | "source": "sparkContext.parallelize(1 to 100).\n map(i => (i, isPrime(i))).\n groupBy{ case (_, primality) => primality}. // Syntax: { case pattern => body }\n map{ case (primality, values) => (primality, values.size) } . // used here, too\n foreach(println)", 267 | "outputs": [] 268 | }, 269 | { 270 | "metadata": { 271 | "id": "8E8F39DFB50F44118B9095470C560E2B" 272 | }, 273 | "cell_type": "markdown", 274 | "source": "Note the `case` keyword and `=>` separating the pattern from the body to execute if the pattern matches.\n\nIn the first pattern, `(_, primality)`, we didn't need the first tuple element, so we used the \"don't care\" placeholder, `_`. Note also that `{...}` must be used instead of `(...)`. (The extra whitespace after the `{` and before the `}` is not required; it's here for legibility.)\n\nPattern matching is much richer, while more concise than `if ... else ...` constructs in the other languages and we can use it on nearly anything to match what it is and then decompose it into its constituent parts, which are assigned to variables with meaningful names, e.g., `primality`, `values`, etc. " 275 | }, 276 | { 277 | "metadata": { 278 | "id": "398B4FFC54784BD18F7C2F38EF8F9223" 279 | }, 280 | "cell_type": "markdown", 281 | "source": "Here's another example, where we _deconstruct_ a nested tuple. We also show that you can use pattern matching for assignment, too!" 282 | }, 283 | { 284 | "metadata": { 285 | "trusted": true, 286 | "input_collapsed": false, 287 | "collapsed": false, 288 | "id": "722539C191BC4E09B50334373DE61C7E" 289 | }, 290 | "cell_type": "code", 291 | "source": "val (a, (b1, (b21, b22)), c) = (\"A\", (\"B1\", (\"B21\", \"B22\")), \"C\")", 292 | "outputs": [] 293 | }, 294 | { 295 | "metadata": { 296 | "id": "288036612A4D4267890DC7DDA156C8CA" 297 | }, 298 | "cell_type": "markdown", 299 | "source": "## 5. Case Classes \nNow is a good time to introduce a convenient way to declare classes that encapsulate some state that is composed of some values, called _case classes_." 300 | }, 301 | { 302 | "metadata": { 303 | "trusted": true, 304 | "input_collapsed": false, 305 | "collapsed": false, 306 | "id": "BFE7303C687B473885B74C2784C73B72" 307 | }, 308 | "cell_type": "code", 309 | "source": "case class Person(firstName: String, lastName: String, age: Int)", 310 | "outputs": [] 311 | }, 312 | { 313 | "metadata": { 314 | "id": "9901608E6BBA45CCBBAEC1F0EDDD8B99" 315 | }, 316 | "cell_type": "markdown", 317 | "source": "The `case` keyword tells the compiler to:\n* Make immutable instance fields out of the constructor arguments (the list after the name).\n* Add `equals`, `hashCode`, and `toString` methods (which you can explicitly define yourself, if you want).\n* Add a _companion object_ with the same name, which holds methods for constructing instances and \"destructuring\" instances through patterning matching.\n* Add `copy` (constructor-)methods with default values for each field being their current value (for `this` instance).\n* etc.\n\nCase classes are useful for implementing records in RDDs.\n\nLet's see case class pattern matching in action:" 318 | }, 319 | { 320 | "metadata": { 321 | "trusted": true, 322 | "input_collapsed": false, 323 | "collapsed": false, 324 | "id": "D3745DD75E024E7C97A653963381E756" 325 | }, 326 | "cell_type": "code", 327 | "source": "sparkSession.\n createDataFrame(Seq(Person(\"Dean\", \"Wampler\", 39), Person(\"Andy\", \"Petrella\", 29))).\n as[Person].\n map { \n case c@Person(first, last, a) => c.copy(age = a + 1) // happy birthday\n }", 328 | "outputs": [] 329 | }, 330 | { 331 | "metadata": { 332 | "id": "8758F66233004E7A9DDA3CF46A6D586D" 333 | }, 334 | "cell_type": "markdown", 335 | "source": "What about the other languages? \n* **Python:** Regular expression matching for strings is built in. Pattern matching as shown requires a third-party library with an idiomatic syntax. Nothing like case classes.\n* **R:** Only supports regular expression matching for strings. Nothing like case classes.\n* **Java:** Only supports regular expression matching for strings. Nothing like case classes." 336 | }, 337 | { 338 | "metadata": { 339 | "id": "DB4A6293E81B4C748AE7C26E2B393EA2" 340 | }, 341 | "cell_type": "markdown", 342 | "source": "## 6. Type Inference\nMost languages associate a type with values, but they fall into two categories, crudely speaking, those which evaluate the type of expressions and variables at compile time (like Scala and Java) and those which do so at runtime (Python and R). This is called _static typing_ and _dynamic typing_, respectively.\n\nSo, languages with static typing either have to be told the type of every expression or variable, or they can _infer_ types in some or all cases. Scala can infer types most of the time, while Java can do so only in limited cases. Here are some examples for Scala. Note the results shown for each expression:" 343 | }, 344 | { 345 | "metadata": { 346 | "trusted": true, 347 | "input_collapsed": false, 348 | "collapsed": false, 349 | "id": "BB8E2DBE4DB649138913CFA2BE9016D5" 350 | }, 351 | "cell_type": "code", 352 | "source": "val i = 100 // <- infer that i is an integer\nval j = i*i % 27 // <- since i is an integer, j must be one, too.", 353 | "outputs": [] 354 | }, 355 | { 356 | "metadata": { 357 | "id": "E5A4DAF4682144A7B07B20E2B334B422" 358 | }, 359 | "cell_type": "markdown", 360 | "source": "Recall our previous Spark example, where we wrote nothing about types, but they were inferred: " 361 | }, 362 | { 363 | "metadata": { 364 | "trusted": true, 365 | "input_collapsed": false, 366 | "collapsed": false, 367 | "id": "261F068BDF344DA6B96DC784AEB86F19" 368 | }, 369 | "cell_type": "code", 370 | "source": "sparkContext.parallelize(1 to 100).\n map(i => (i, isPrime(i))).\n groupBy{ case(_, primality) => primality }. // Syntax: { case pattern => body }\n map{ case (primality, values) => (primality, values.size) } // used here, too", 371 | "outputs": [] 372 | }, 373 | { 374 | "metadata": { 375 | "id": "B62A3D05041F4E24860434EE37D1AE81" 376 | }, 377 | "cell_type": "markdown", 378 | "source": "So this long expression (and it is a four-line expression - note the \".\"'s) returns an `RDD[(Boolean, Int)]`. Note that we can also express a tuple _type_ with the `(...)` syntax, just like for tuple _instances_. This type could also be written `RDD[Tuple2[Boolean, Int]]`.\n\nPut another way, we have an `RDD` where the records are key-value pairs of `Booleans` and `Ints`." 379 | }, 380 | { 381 | "metadata": { 382 | "id": "D2DA2C79231F4685908D8F9A211FD8E2" 383 | }, 384 | "cell_type": "markdown", 385 | "source": "I really like the extra safety that static typing provides, without the hassle of writing the types for almost everything, compared to Java. Furthermore, when I'm using an API with the Scala interpreter or a notebook like this one, the return value's type is shown, as in the previous example, so I know exactly what \"kinds of things\" I have. That also means I don't have to know _in advance_ what a method will return, in order to explicit add a required type, as in Java." 386 | }, 387 | { 388 | "metadata": { 389 | "id": "AF7FAC15A7984A878800DC544E1D663F" 390 | }, 391 | "cell_type": "markdown", 392 | "source": "What about the other languages? \n* **Python:** Uses dynamic typing, so no types are written explicitly, but you also don't get the feedback type inference provides, as in our `RDD[(Boolean, Int)]` example.\n* **R:** Also dynamically typed.\n* **Java:** Statically typed with explicit types required almost everywhere." 393 | }, 394 | { 395 | "metadata": { 396 | "id": "115056A879A9452F97AECC2A32A2CEBA" 397 | }, 398 | "cell_type": "markdown", 399 | "source": "## 7. Unification of Primitives and Types\nIn Java, there is a clear distinction between primitives, which are nice for performance (you can put them in registers, you can pass them on the stack, you don't heap allocate them), and instances of classes, which give you the expressiveness of OOP, but with the overhead of heap allocation, etc.\n\nScala unifies the syntax, but in most cases, compiles optimal code. So, for example, `Float` acts like any other type, e.g., `String`, with methods you call, but the compiler uses JVM `float` primitives. `Float` and the other primitives are subtypes of `AnyVal` and include `Byte`, `Short`, `Int`, `Long`, `Float`, `Double`, `Char`, `Boolean`, and `Unit`.\n\nAnother benefit is that the uniformity extends to parameterized types, like collections. If you implement your own `Tree[T]` type, `T` can be `Float`, `String`, `MyMassiveClass`, whatever. There's no mental burden of explicitly boxing and unboxing primitives.\n\nHowever, the downside is that your primitives will be boxed when used in a context like this. Scala does have an annotation `@specialized(a,b,c)` that's used to tell the compiler to generate optimal implementations for the primitives listed for `a,b,c`, but it's not a perfect solution." 400 | }, 401 | { 402 | "metadata": { 403 | "trusted": true, 404 | "input_collapsed": false, 405 | "collapsed": false, 406 | "id": "FE1190855EE24235B43678F98CA1D3FB" 407 | }, 408 | "cell_type": "code", 409 | "source": "val listString: List[String] = List(\"one\", \"two\", \"three\")\nval listInt: List[Int] = List(1, 2, 3) // No need to use Integer.", 410 | "outputs": [] 411 | }, 412 | { 413 | "metadata": { 414 | "id": "37172DAD447A41DC8AB60C19A92B64E5" 415 | }, 416 | "cell_type": "markdown", 417 | "source": "See also **Value Classes** below." 418 | }, 419 | { 420 | "metadata": { 421 | "id": "F9B2730EB3D54267B924086416959197" 422 | }, 423 | "cell_type": "markdown", 424 | "source": "## 8. Elegant Tools to Create \"Domain Specific Languages\"\nThe Spark [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame) API is a good example of DSL that mimics the original Python and R DataFrame APIs for single-node use. \n\nFirst, set up the API:" 425 | }, 426 | { 427 | "metadata": { 428 | "id": "C4E214979E53400AA22437CAD15B8801" 429 | }, 430 | "cell_type": "markdown", 431 | "source": "Get the root directory of the notebooks:" 432 | }, 433 | { 434 | "metadata": { 435 | "trusted": true, 436 | "input_collapsed": false, 437 | "collapsed": false, 438 | "id": "B32809D4A5EE4F29846A6CCFD9B19BCE" 439 | }, 440 | "cell_type": "code", 441 | "source": "val root = sys.env(\"NOTEBOOKS_DIR\")", 442 | "outputs": [] 443 | }, 444 | { 445 | "metadata": { 446 | "id": "F57FBE6DCC34418B8AD158423F1A602C" 447 | }, 448 | "cell_type": "markdown", 449 | "source": "Load the airports data into a [DataFrame](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame). The cell returns Scala \"Unit\", `()`, which is sort of like `void`, to avoid an annoying bug in the output:" 450 | }, 451 | { 452 | "metadata": { 453 | "trusted": true, 454 | "input_collapsed": false, 455 | "collapsed": false, 456 | "id": "78D3E11A8F8F45EE86E884812BCE9564" 457 | }, 458 | "cell_type": "code", 459 | "source": "val airportsDF = sparkSession.read.json(s\"$root/notebooks/airports.json\")", 460 | "outputs": [] 461 | }, 462 | { 463 | "metadata": { 464 | "id": "6974DAF449AB43158E79982E35D3C100" 465 | }, 466 | "cell_type": "markdown", 467 | "source": "Note the \"schema\" is inferred from the JSON and shown by the REPL (by calling `DataFrame.toString`).\n\nWe cache the results, so Spark will keep the data in memory since we'll run a few queries over it. `DataFrame.show` is convenient for displaying the first `N` records (20 by default)." 468 | }, 469 | { 470 | "metadata": { 471 | "trusted": true, 472 | "input_collapsed": false, 473 | "collapsed": false, 474 | "id": "C4943349BC644D468A72E623BE10F17A" 475 | }, 476 | "cell_type": "code", 477 | "source": "airportsDF.cache\nairportsDF", 478 | "outputs": [] 479 | }, 480 | { 481 | "metadata": { 482 | "id": "654B4596BCFE41FD99E64DCFDC9314C7" 483 | }, 484 | "cell_type": "markdown", 485 | "source": "Now we can show the idiomatic DataFrame API (DSL) in action:" 486 | }, 487 | { 488 | "metadata": { 489 | "trusted": true, 490 | "input_collapsed": false, 491 | "collapsed": false, 492 | "id": "2DF9D1306DAB41319E03DD8322721545" 493 | }, 494 | "cell_type": "code", 495 | "source": "val grouped = airportsDF.groupBy($\"state\", $\"country\").count.orderBy($\"count\".desc)\ngrouped.printSchema\ngrouped.limit(100) // all 50 states + territories < 100", 496 | "outputs": [] 497 | }, 498 | { 499 | "metadata": { 500 | "id": "62D7351CF14F4C668DD1FA5AB4B2E600" 501 | }, 502 | "cell_type": "markdown", 503 | "source": "By the way, this DSL is essentially a programmatic version of SQL:" 504 | }, 505 | { 506 | "metadata": { 507 | "trusted": true, 508 | "input_collapsed": false, 509 | "collapsed": false, 510 | "id": "322C5E92AF6741E68445C244E43A5F20" 511 | }, 512 | "cell_type": "code", 513 | "source": "airportsDF.registerTempTable(\"airports\")\nval grouped2 = sparkSession.sqlContext.sql(\"\"\"\n SELECT state, country, COUNT(*) AS cnt FROM airports\n GROUP BY state, country\n ORDER BY cnt DESC\n\"\"\")", 514 | "outputs": [] 515 | }, 516 | { 517 | "metadata": { 518 | "trusted": true, 519 | "input_collapsed": false, 520 | "collapsed": false, 521 | "id": "D30F0C37ECDA4E7CBE020A874AC53E68" 522 | }, 523 | "cell_type": "code", 524 | "source": "grouped2.printSchema\ngrouped2", 525 | "outputs": [] 526 | }, 527 | { 528 | "metadata": { 529 | "id": "5B5B48609FAB4F18BA0AD49AE547F028" 530 | }, 531 | "cell_type": "markdown", 532 | "source": "What about the other languages? \n* **Python:** Dynamically-typed languages often have features that make idiomatic DSLs easy to define. The Spark DataFrame API is inspired by the [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) API.\n* **R:** Less flexible for idiomatic DSLs, but syntax is designed for Mathematics. The Pandas DataFrame API is inspired by the [R Data Frame](http://www.r-tutor.com/r-introduction/data-frame) API.\n* **Java:** Limited to so-called _fluent_ APIs, similar to our collections and RDD examples above." 533 | }, 534 | { 535 | "metadata": { 536 | "id": "D098E3AAA94942EDBEC629B643C53F4A" 537 | }, 538 | "cell_type": "markdown", 539 | "source": "## 9. And a Few Other Things...\nThere are many more Scala features that the other languages don't have or don't support as nicely. Some are actually quite significant for general programming tasks, but they are used less frequently in Spark code. Here they are, for completeness." 540 | }, 541 | { 542 | "metadata": { 543 | "id": "563FAC0C5F894273839564136EFD4287" 544 | }, 545 | "cell_type": "markdown", 546 | "source": "### 9A. Singletons Are a Built-in Feature\nImplement the _Singleton Design Pattern_ without special logic to ensure there's only one instance." 547 | }, 548 | { 549 | "metadata": { 550 | "trusted": true, 551 | "input_collapsed": false, 552 | "collapsed": false, 553 | "id": "D5EFC6E128DC4BD9840CAE8A94EE1DA7" 554 | }, 555 | "cell_type": "code", 556 | "source": "object Foo {\n def main(args: Array[String]):Unit = {\n args.foreach(arg => println(s\"arg = $arg\"))\n }\n}\nFoo.main(Array(\"Scala\", \"is\", \"great!\"))", 557 | "outputs": [] 558 | }, 559 | { 560 | "metadata": { 561 | "id": "CF2FC2CC60AA4D1982EF940CDDF28A67" 562 | }, 563 | "cell_type": "markdown", 564 | "source": "### 9B. Named and Default Arguments\nDoes a method have a long argument list? Provide defaults for some of them. Name the arguments when calling the method to document what you're doing." 565 | }, 566 | { 567 | "metadata": { 568 | "trusted": true, 569 | "input_collapsed": false, 570 | "collapsed": false, 571 | "id": "63781B531D29467EAA44AC81CF96E7DF" 572 | }, 573 | "cell_type": "code", 574 | "source": "val airportsRDD = grouped.select($\"count\", $\"state\").map(row => (row.getLong(0), row.getString(1))).rdd", 575 | "outputs": [] 576 | }, 577 | { 578 | "metadata": { 579 | "trusted": true, 580 | "input_collapsed": false, 581 | "collapsed": false, 582 | "id": "4AB3F2B4D4DD4E70A20D31E4568CE074" 583 | }, 584 | "cell_type": "code", 585 | "source": "val rdd1 = airportsRDD.sortByKey() // defaults: ascending = true, numPartitions = \nval rdd2 = airportsRDD.sortByKey(ascending = false) // name the ascending argument explicitly\nval rdd3 = airportsRDD.sortByKey(numPartitions = 4) // name the numPartitions argument explicitly\nval rdd4 = airportsRDD.sortByKey(ascending = false, numPartitions = 4) // Okay to do both...", 586 | "outputs": [] 587 | }, 588 | { 589 | "metadata": { 590 | "trusted": true, 591 | "input_collapsed": false, 592 | "collapsed": false, 593 | "presentation": { 594 | "tabs_state": "{\n \"tab_id\": \"#tab1954373211-0\"\n}", 595 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 596 | }, 597 | "id": "5D959F19FE7E44588FDC45A18368A06C" 598 | }, 599 | "cell_type": "code", 600 | "source": "containerFluid(List(\n List(rdd1, rdd2, rdd3, rdd4).map { rdd => \n text(s\"#partitions = ${rdd.partitions.length}\") ++\n TableChart(rdd.take(10), sizes=(150,150))//.foreach(println)\n }.map(table => (table, 3))\n))", 601 | "outputs": [] 602 | }, 603 | { 604 | "metadata": { 605 | "id": "6F439C6371434BD39E75E6753164273A" 606 | }, 607 | "cell_type": "markdown", 608 | "source": "### 9C. String Interpolation\nYou've seen it used already:" 609 | }, 610 | { 611 | "metadata": { 612 | "trusted": true, 613 | "input_collapsed": false, 614 | "collapsed": false, 615 | "id": "6BEA262B490540E79A21B1A1282F54C3" 616 | }, 617 | "cell_type": "code", 618 | "source": "s\"RDD #partitions = ${rdd4.partitions.length}\"", 619 | "outputs": [] 620 | }, 621 | { 622 | "metadata": { 623 | "id": "17123565F5964C05A302350950C80E2A" 624 | }, 625 | "cell_type": "markdown", 626 | "source": "### 9D. Few Semicolons\nSemicolons are inferred, making your code just that much more concise. You can use them if you want to write more than one expression on a line:" 627 | }, 628 | { 629 | "metadata": { 630 | "trusted": true, 631 | "input_collapsed": false, 632 | "collapsed": false, 633 | "id": "BF39CCF4418042FCBDE25C3E3CFD7008" 634 | }, 635 | "cell_type": "code", 636 | "source": "val result = \"foo\" match {\n case \"foo\" => println(\"Found foo!\"); true\n case _ => false\n}", 637 | "outputs": [] 638 | }, 639 | { 640 | "metadata": { 641 | "id": "565ED52438AC46ADAC0BC8C82FDC8212" 642 | }, 643 | "cell_type": "markdown", 644 | "source": "### 9E. Tail Recursion Optimization\n\nRecursion isn't used much in user code for Spark, but for general programming it's a powerful technique. Unfortunately, most OO languages (like Java) do not optimize [tail call recursion](https://en.wikipedia.org/wiki/Tail_call) by converting the recursion into a loop. Without this optimization, use of recursion is risky, because of the risk of stack overflow. Scala's compiler implements this optimization. " 645 | }, 646 | { 647 | "metadata": { 648 | "trusted": true, 649 | "input_collapsed": false, 650 | "collapsed": false, 651 | "id": "ECAEC625BFBD4B82957F5D81EE524341" 652 | }, 653 | "cell_type": "code", 654 | "source": "def printSeq[T](seq: Seq[T]): Unit = seq match {\n case head +: tail => println(head); printSeq(tail)\n case Nil => // done\n}\nprintSeq(Seq(1,2,3,4))", 655 | "outputs": [] 656 | }, 657 | { 658 | "metadata": { 659 | "id": "FAD283C29E41425E89594C7DBC3122E7" 660 | }, 661 | "cell_type": "markdown", 662 | "source": "### 9F. Everything Is an Expression\nSome constructs are _statements_ (meaning they return nothing) in some languages, like `if ... then ... else`, `for` loops, etc. Almost everything is an expression in Scala which means you can assign results of the `if` or `for` expression. The alternative in the other languages is that you have to declare a mutable variable, then set its value inside the statement." 663 | }, 664 | { 665 | "metadata": { 666 | "trusted": true, 667 | "input_collapsed": false, 668 | "collapsed": false, 669 | "id": "EF9F87A5DF6A4589AE09E2F8FD05279E" 670 | }, 671 | "cell_type": "code", 672 | "source": "val worldRocked = if (true == false) \"yes!\" else \"no\"", 673 | "outputs": [] 674 | }, 675 | { 676 | "metadata": { 677 | "trusted": true, 678 | "input_collapsed": false, 679 | "collapsed": false, 680 | "presentation": { 681 | "tabs_state": "{\n \"tab_id\": \"#tab478386907-0\"\n}" 682 | }, 683 | "id": "9351539BEF5F42369638265DCD40C0FD" 684 | }, 685 | "cell_type": "code", 686 | "source": "val primes = for {\n i <- 0 until 100\n if isPrime(i)\n} yield i", 687 | "outputs": [] 688 | }, 689 | { 690 | "metadata": { 691 | "id": "BA85DCC039794BFC804D47C1D0CBBFEF" 692 | }, 693 | "cell_type": "markdown", 694 | "source": "### 9G. Implicits\nOne of Scala's most powerful features is the _implicits_ mechanism. It's used (or misused) for several capabilities, but one of the most useful is the ability to \"add\" methods to existing types that don't already have the methods. What actually happens is the compiler invokes an _implicit conversion_ from an instance of the type to a wrapper type that has the desired method. \n\nFor example, suppose I want to add a `toJSON` method to my `Person` type above, but I don't want this added to the class itself. Maybe it's from a library that I can't modify. Maybe I only want this method in some contexts, but I don't want its baggage everywhere. Here's how to do it." 695 | }, 696 | { 697 | "metadata": { 698 | "trusted": true, 699 | "input_collapsed": false, 700 | "collapsed": false, 701 | "id": "3A3561A859AB4205BABA7D1C2EE76CEF" 702 | }, 703 | "cell_type": "code", 704 | "source": "// repeat definition of Person: \ncase class Person(firstName: String, lastName: String, age: Int)\n\nimplicit class PersonToJSON(person: Person) {\n // Just return a JSON-formatted string, for simplicity of the example:\n def toJSON: String = \n s\"\"\"{ \"firstName\": ${person.firstName}, \"lastName\": ${person.lastName}, \"age\": ${person.age} }\"\"\"\n}\n\nval p = Person(\"Dean\", \"Wampler\", 39)\np.toJSON // Like magic!!", 705 | "outputs": [] 706 | }, 707 | { 708 | "metadata": { 709 | "id": "12E785FDCA514870A0DE2CFA8F739388" 710 | }, 711 | "cell_type": "markdown", 712 | "source": "The `implicit` keyword tells the compiler to consider `PersonToJSON` when I attempt to call `toJSON` on a `Person` instance. The compiler finds this implicit class and does the conversion implicitly, then calls the `toJSON` method.\n\nThere are many other uses for implicits. They are a powerful implementation tool for various design problems, but they have to be used wisely, because it can be difficult for the reader to know what's going on." 713 | }, 714 | { 715 | "metadata": { 716 | "id": "43D51BE94F464AE8B794F673544EF5D3" 717 | }, 718 | "cell_type": "markdown", 719 | "source": "### 9H. Sealed Type Hierarchies\nAn important concept in modern languages is _sum types_, where there is a finite set of possible instances. Two examples from Scala are `Option[T]` and its allowed subtypes `Some[T]` and `None`, and `Either[L,R]` and its subtypes `Left[L]` and `Right`[R]`.\n\nNote that `Option[T]` represents two and only two possible states, either I have something, a `T` inside a `Some[T]`, or I don't anything, a `None`. There are no additional \"states\" that are logically possible for the `Option[T]` \"abstraction\". Similarly, `Either[L,R]` encapsulates a similar dichotomy, often used for \"failure\" (e.g., `Left[Throwable]` by convention) and \"successful result\" (`Right[T]` for some \"expected\" `T`).\n\nThe term _sum type_ comes from an analog between types and arithmetic. For `Option`, the number of allowed intances (ignoring the type parameter `T`) is just the sum, _two_. Similarly for `Either`.\n\nThere are also _product types_, like tuples, where combining types together _multiplies_ the number of instances. For example, a tuple of `(Option,Either)` would have 2*2 instances. A tuple `(Boolean,Option,HTTP_Commands)` has 2*2*7 possible instances (there are 7 HTTP 1.1 commands, like `GET`, `POST`, etc.)\n\nScala uses type hierarchies for sum types, where an abstract _sealed_ trait or class is used for the base type, e.g., `Option[T]` and `Either[L,R]`, and subtypes represent the concrete types. The `sealed` keyword is used on the base type and it is crucial; it tells the compiler to only allow subtypes to be defined in the same _file_, which means users can't add their own subtypes, breaking the logic of the type.\n\nSome other languages implement sum types using a variation of _enumerations_. Java has that, but it's a much more limited concept than true subtypes." 720 | }, 721 | { 722 | "metadata": { 723 | "id": "ED60872A44504B0C8B3726352EBD3CD4" 724 | }, 725 | "cell_type": "markdown", 726 | "source": "Here's an example, sort of like `Either`, but oriented more towards the usage of encapsulating success or failure.\nHowever, we'll put \"success\" on the left instead of the right, which is the convention when using `Either`.\n\nWe'll have one type parameter `Result`; on `Success`, it will hold an instance of the type `Result`.\nOn Failure, it will hold no successful result, so we'll use the \"bottom\" type `Nothing` for the type parameter,\nand expect the error information to be returned in a `RuntimeException`." 727 | }, 728 | { 729 | "metadata": { 730 | "trusted": true, 731 | "input_collapsed": false, 732 | "collapsed": false, 733 | "id": "7A40CA303C9443A3828EEAFBAC0186AA" 734 | }, 735 | "cell_type": "code", 736 | "source": "// The + means \"contravariant\"; we can use subtypes of the declared \"Result\". \n// See also the **Definition Site Invariance...** section below.\nsealed trait SuccessOrFailure[+Result] \ncase class Success[Result](result: Result) extends SuccessOrFailure[Result]\ncase class Failure(error: RuntimeException) extends SuccessOrFailure[Nothing]", 737 | "outputs": [] 738 | }, 739 | { 740 | "metadata": { 741 | "id": "BF40808FF5E5411A871EF3E75080DCF6" 742 | }, 743 | "cell_type": "markdown", 744 | "source": "The `sealed` keyword is actually less useful in the context of this notebook; we can keep on defining subclasses below. However, in library code, you would put the three declarations in a separate file and then the compiler would prevent anyone from defining a third subclass in a different location." 745 | }, 746 | { 747 | "metadata": { 748 | "id": "A77A46BDA4A24B1ABA20F6CE9D824156" 749 | }, 750 | "cell_type": "markdown", 751 | "source": "Let's try it out." 752 | }, 753 | { 754 | "metadata": { 755 | "trusted": true, 756 | "input_collapsed": false, 757 | "collapsed": false, 758 | "id": "674289AA57C64BF08E71F8EEF3F222D6" 759 | }, 760 | "cell_type": "code", 761 | "source": "def parseInt(string: String): SuccessOrFailure[Int] = try {\n Success(Integer.parseInt(string))\n} catch {\n case nfe: NumberFormatException => Failure(new RuntimeException(s\"\"\"Invalid integer string: \"$string\" \"\"\"))\n}", 762 | "outputs": [] 763 | }, 764 | { 765 | "metadata": { 766 | "trusted": true, 767 | "input_collapsed": false, 768 | "collapsed": false, 769 | "presentation": { 770 | "tabs_state": "{\n \"tab_id\": \"#tab1369006273-0\"\n}" 771 | }, 772 | "id": "30C338D63C5641BC820C4E19C4FC74F0" 773 | }, 774 | "cell_type": "code", 775 | "source": "Seq(\"1\", \"202\", \"three\").map(parseInt)", 776 | "outputs": [] 777 | }, 778 | { 779 | "metadata": { 780 | "id": "EEA7210546F049D58D92A8C97F4A69E7" 781 | }, 782 | "cell_type": "markdown", 783 | "source": "### 9I. Option Type Broken in Java\nSpeaking of `Option[T]`, Java 8 introduced a similar type called `Optional`. (The name `Option` was already used for something else.) However, its design has some subtleties that make the behavior not straightforward when `nulls` are involved. For details, see [this blog post](https://developer.atlassian.com/blog/2015/08/optional-broken/)." 784 | }, 785 | { 786 | "metadata": { 787 | "id": "7C1D0E4BCDD24C9287BF14D74FDFE97D" 788 | }, 789 | "cell_type": "markdown", 790 | "source": "### 9J: Definition-site Variance vs. Call-site Variance\nThis is a technical point. In Java, when you define a type with a type parameter, like our `SuccessOrFailure[T]` previously, to hold items of some type `T`, you can't specify in the declaration whether it's okay to substitute a subtype of `SuccessOrFailure` with a subtype of `T`. For example, is the following okay?:\n\n```java\n// Java\nSuccessOrFailure sof = null;\n...\nsof = new Success(\"foo\");\n```\n\nThis substitutability is called _variance_, referring to the variance allowed in `T` if we use a subtype of the outer type, `SuccessOrFailure`. Notice that we want to assign a subclass of `SuccessOrFailure` _and_ a subtype of `Object`. In this case, we're doing _covariant substitution_, because the subtyping \"moves in the same direction\", from parent to child for both types. There's also _contravariant_, where the type parameter moves \"up\" while the outer type moves \"down\", and _invariant_ typing, where you can't change the type parameter. That is, in the invariant case, we could only assign `Success(...)` to `sof`.\n\nJava does not let the type _designer_ specify the correct behavior. This means Java forces the _user_ of the type to specify the variance at the _call site_:\n\n```java\nSuccessOrFailure sof = null;\n...\nsof = new Success(\"foo\");\n\n```\nThis is harder for the user, who has to understand what's okay in this case, both what the designer intended and some technical rules of type theory. \n\nIt's much better if the _designer_ of `SuccessOrFailure[T]`, who understands the desired behavior, defines the allowed variance behavior at the _definition site_, which Scala supports. Recall from above:\n\n```scala\n// Scala\nsealed trait SuccessOrFailure[+Result] \ncase class Success[Result](result: Result) extends SuccessOrFailure[Result]\ncase class Failure(error: RuntimeException) extends SuccessOrFailure[Nothing]\n\n...\n// usage:\nval sof: SuccessOrFailure[AnyRef] = new Success[String](\"Yea!\")\n```" 791 | }, 792 | { 793 | "metadata": { 794 | "id": "069B350306444D689D66BE85E7C96B21" 795 | }, 796 | "cell_type": "markdown", 797 | "source": "### 9K: Value Classes\nScala's built-in _value types_ `Int`, `Long`, `Float`, `Double`, `Boolean`, and `Unit` are implemented with the corresponding JVM primitive values, eliminating the overhead of allocating an instance on the heap. What if you define a class that wraps _one_ of these values?\n```scala\nclass Celsius(value: Float) {\n // methods\n}\n```\n Unfortunately, instances are allocated on the heap, even though all instance \"state\" is held by a single primitive `Float`. Scala now has an `AnyVal` trait. If you use it as a parent of types like `Celsius`, they will enjoy the same optimization that the built-in value types enjoy. That is, the single primitive field (`value` here) will be pushed on the stack, etc., and no instance of `Celsius` will be heap allocated, _in most cases._\n```scala\nclass Celsius(value: Float) extends AnyVal {\n // methods\n}\n```\n\nSo, why doesn't Scala make this optimization automatically? There are some limitations, which are described [here](http://docs.scala-lang.org/overviews/core/value-classes.html) and in [my book](http://shop.oreilly.com/product/0636920033073.do)." 798 | }, 799 | { 800 | "metadata": { 801 | "id": "BF296220FD9B4E1987610CDFE7D86C91" 802 | }, 803 | "cell_type": "markdown", 804 | "source": "### 9L. Lazy Vals\nSometimes you don't want to initialize a value if doing so is expensive and you won't always need it. Or, sometimes you just want to delay the \"hit\" so you're up and running more quickly. For example, a database connection is expensive.\n```scala\nlazy val jdbcConnection = new JDBCConnection(...)\n```\nUse the `lazy` keyword to delay initialization until it's actually needed (if ever). This feature can also be used to solve some tricky \"order of initialization\" problems. It has one drawback; there will be extra overhead for every access to check if it has already been initialized, so don't do this if the value will be read a lot. A future version of Scala will remove this overhead." 805 | }, 806 | { 807 | "metadata": { 808 | "id": "E1AC9A4F004249AAAD14813AF7AAC1BC" 809 | }, 810 | "cell_type": "markdown", 811 | "source": "What about the other languages? \n* **Python:** Offers equivalents for some these features.\n* **R:** Supports some of these features.\n* **Java:** Supports none of these features." 812 | }, 813 | { 814 | "metadata": { 815 | "id": "455487BFC1784B8292289A40B70BFAA5" 816 | }, 817 | "cell_type": "markdown", 818 | "source": "# But Scala Has Some Disadvantages...\n\nAll of the advantages discussed above make Scala code quite concise, especially compared to Java code. There are lots of nifty features available to solve particular design problems.\n\nHowever, no language is perfect. You should know about the disadvantages of Scala, too." 819 | }, 820 | { 821 | "metadata": { 822 | "id": "9EF11FF9CCDA43CE838105419BACAA0C" 823 | }, 824 | "cell_type": "markdown", 825 | "source": "Here, I'll briefly summarize some Scala and JVM issues, especially for Spark, but Dean's talk at [Strata San Jose](http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47105) ([extended slides](http://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons-extended.pdf)) goes into more details." 826 | }, 827 | { 828 | "metadata": { 829 | "id": "D8DD3106FCBF435FAC27E3D75850AA16" 830 | }, 831 | "cell_type": "markdown", 832 | "source": "## 1. Data-centric Tools and Libraries\n\nThe R and Python communities have a much wider selection of data-centric tools and libraries. Python is great for general data science. R was developed by statisticians, so it has a very rich library of statistical algorithms and rich options for charting, like [ggplot2](http://ggplot2.org/)." 833 | }, 834 | { 835 | "metadata": { 836 | "id": "E23284F1547540108DEC21B524BF8F3C" 837 | }, 838 | "cell_type": "markdown", 839 | "source": "## 2. The JVM Has Some Issues \nBig Data has pushed the limits of the JVM in interesting ways." 840 | }, 841 | { 842 | "metadata": { 843 | "id": "AA07596741F84C4DBDBD223B5D47B02C" 844 | }, 845 | "cell_type": "markdown", 846 | "source": "### 2a. Integer indexing of arrays\n\nBecause Java has _signed_ integers only and because arrays are indexed by integers instead of longs, array sizes are limited to 2 billion elements. Therefore, _byte_ arrays, which are often used for holding serialized data, are limited to 2GB. This is in an era when _terabyte_ heaps (TB) are becoming viable!\n\nThere's no real workaround when you want the efficiency of arrays, except to implement logic that can split a large object into \"chunks\" and manage them accordingly." 847 | }, 848 | { 849 | "metadata": { 850 | "id": "66A59310099B46E985F8B0761076E1E3" 851 | }, 852 | "cell_type": "markdown", 853 | "source": "### 2b. Inefficiency of the JVM Memory Model\nThe JVM has a very flexible, general-purpose model of organizing data into memory and managing garbage collection. However, for massive data sets of records with the same or nearly the same schema, the model is very inefficient. Spark's [Tungsten Project](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html) is addressing this problem by introducing custom object-layouts, managed memory, as well as code generation for other performance bottlenecks. " 854 | }, 855 | { 856 | "metadata": { 857 | "id": "75724F7135BA48E985A5D39CB00CEB42" 858 | }, 859 | "cell_type": "markdown", 860 | "source": "Here is an example of is how Java typically lays out objects in memory. Note the references to small, discontiguous chunks of memory. Now imagine billions of these little bits of memory. That means a lot of garbage to manage. Also, the discontinuities cause poor CPU cache performance." 861 | }, 862 | { 863 | "metadata": { 864 | "id": "64B3B23B30AF4BE78CF40AC489846064" 865 | }, 866 | "cell_type": "markdown", 867 | "source": "Typical Java Object Layout" 868 | }, 869 | { 870 | "metadata": { 871 | "id": "283CC816BE0B4CA280FC20E2D2A7F412" 872 | }, 873 | "cell_type": "markdown", 874 | "source": "Instead, Tungsten uses a more efficient, cache-friendly encoding in a contiguous byte array. The first few bytes are bit flags to indicate which fields if any are null. Then comes 8 bytes/field for the non-null fields. If the field's value fits in 8 bytes (e.g., longs and doubles), then the value is inlined here. Otherwise, the value holds an offset to the final section, a variable-length sequence of bytes where longer objects, like ASCII strings, are stored." 875 | }, 876 | { 877 | "metadata": { 878 | "id": "E42E5F36D8714FDBA039F6179D45DFA8" 879 | }, 880 | "cell_type": "markdown", 881 | "source": "Tungsten Object Layout" 882 | }, 883 | { 884 | "metadata": { 885 | "id": "9FFE7D0E67BF422C8D0995383E1F23BF" 886 | }, 887 | "cell_type": "markdown", 888 | "source": "## Scala REPL Weirdness\n\nThe way the Scala REPL (interpreter) compiles code leads to memory leaks, which cause problems when working with big data sets and long sessions. Imagine you write the following code in the REPL:\n```scala\nscala> val massiveArray = get8GBarray(...)\nscala> // do some work\nscala> massiveArray = getDifferent8GBarray(...)\n```\nYou might think that the first \"8GB array\" will be nicely garbage collected when you reassign `massiveArray`. Not so. Here's a simplified view of the code the REPL generates for the _last_ line to pass to the compiler.\n\n```scala\nclass LineN {\n class LineN_minus_1 {\n class LineN_minus_2 {\n ...\n class Line1 {\n val massiveArray = get8GBarray(...)\n }\n ...\n }\n }\n val massiveArray = getDifferent8GBarray(...)\n}\n```\nWhy? The JVM expects classes to be compiled into byte code, so the REPL synthesizes classes for each line you evaluate (or group of lines when you use the `:paste ... ^D` feature).\n\nNote that the overridden `massiveArray` shadows the original one, which is the trick the REPL uses to let you redefine variables, which would be prohibited by the compiler otherwise. Unfortunately, that leaves the shadowed reference attached to old data, so it can't be garbage collected, even though the REPL provides no way to ever refer to it again!" 889 | } 890 | ], 891 | "nbformat": 4 892 | } -------------------------------------------------------------------------------- /notebooks/images/JavaMemory.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kensuio-oss/scala-for-data-science/88a896f4ce4b2584a3b0b5f49a53bc2f861982ef/notebooks/images/JavaMemory.jpg -------------------------------------------------------------------------------- /notebooks/images/TungstenMemory.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kensuio-oss/scala-for-data-science/88a896f4ce4b2584a3b0b5f49a53bc2f861982ef/notebooks/images/TungstenMemory.jpg --------------------------------------------------------------------------------