├── .gitignore ├── LICENSE ├── README.md ├── chapter-13 ├── flatmap_groups_with_state.snb.ipynb └── map_groups_with_state.snb.ipynb ├── chapter-15 ├── occupancy_detection_model.snb.ipynb └── occupancy_streaming_prediction.snb.ipynb ├── chapter-16 └── simple_network_stream_example.repl ├── chapter-23 ├── TrackStateByKey_1.snb ├── better-joins-streaming-data.snb ├── enriching-streaming-data.snb ├── reference-data-generator.snb ├── refresh-reference-data-streaming.snb └── streaming-data-to-parquet.snb ├── chapter-25 └── streaming-listener-example.snb.ipynb ├── chapter-27 └── counting-unique-users.snb ├── chapter-7 ├── Structured-Streaming-in-action.snb.ipynb ├── batch_weblogs.snb.ipynb ├── streaming_weblogs.snb.ipynb └── weblogs_TCP_server.snb.ipynb ├── chapter-9 ├── Structured-Streaming-in-action.snb.ipynb ├── kafka-sensor-data-generator.snb.ipynb └── reference-data-generator.snb ├── datasets └── NASA-weblogs │ └── nasa_dataset_july_1995.tgz └── extras-twitter ├── LearningStreaming-GeoTwitter.snb └── LearningStreaming.snb /.gitignore: -------------------------------------------------------------------------------- 1 | *.class 2 | *.log 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Notebooks 2 | This repository contains interactive notebooks that support the code materials found in the book. 3 | Notebooks provide us with an interactive environement where we can directly execute code and inspect the result of our operations. 4 | 5 | --- 6 | **NOTE** 7 | 8 | We are aware of an issue of the Spark Notebook not correctly loading the Kafka dependencies. 9 | We are working on a solution for this. 10 | 11 | --- 12 | 13 | ## Using the Spark Notebook 14 | The Spark Notebook is a reactive notebook implementation based on the IPython UI. 15 | It provides dedicated support for Spark using Scala, making is an easy-to-use choice to work with Spark and Scala. 16 | 17 | To install the Spark Notebook to use with the notebooks in this repository, download a `master` version of the notebook from the distribution site: http://spark-notebook.io/ 18 | Clone this repository somewhere on your Linux or Mac machine (Windows support is precarious. When using Windows, we recommend to use a Linux VM instead) 19 | With this repository cloned, set an environment variable to point to it: 20 | ``` 21 | export NOTEBOOKS_DIR=`pwd`/notebooks 22 | ``` 23 | and start the Spark Notebook: 24 | ``` 25 | cd 26 | ./bin/spark-notebook 27 | ``` 28 | 29 | If you need to ask a question or require additional support, please open an [issue](https://github.com/stream-processing-with-spark/notebooks/issues) in this project 30 | -------------------------------------------------------------------------------- /chapter-15/occupancy_detection_model.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "f55628a5-d546-45fa-b827-56d8e0317801", 4 | "name" : "occupancy_detection_model", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "F7BA3781875B40E98F4E3031A06C4EA4" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# Inference of Room Occupancy using Environment Sensors\n## Part I. Training a Model" 28 | }, { 29 | "metadata" : { 30 | "id" : "A4B2A3EAD73342F89B3D44BA0FFF8C2C" 31 | }, 32 | "cell_type" : "markdown", 33 | "source" : "In this notebook, we are going to explore the use of machine learning techniques \nto estimate the occupance of rooms based on environment sensors present in the rooms.\n\nFor this exercise we are going to use the dataset available at: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+\n\nFirst, we will to build a model that learns the relations between the environment factors and the occupancy state by using a dataset of labeled observations.\nThis is also known as _supervised learning_.\n\nOur dataset consists of timestamped records with the _humidity_, _light levels_, _temperature_ and _CO2 levels_. \nOur training dataset also contains a label that indicates whether the room was occupied or not at the moment of the measurements." 34 | }, { 35 | "metadata" : { 36 | "id" : "BE9D03E4D6474A7C842A164ECE2944B5" 37 | }, 38 | "cell_type" : "markdown", 39 | "source" : "### Loading and Parsing the Data" 40 | }, { 41 | "metadata" : { 42 | "id" : "B02DCE024FC4489691085544F9435DBA" 43 | }, 44 | "cell_type" : "markdown", 45 | "source" : "In preparation to use this notebook, download the zip file containing the dataset to a local folder in your machine. \n\nWe will call this folder `dataDir` in the notebook." 46 | }, { 47 | "metadata" : { 48 | "trusted" : true, 49 | "input_collapsed" : false, 50 | "collapsed" : false, 51 | "id" : "007E35315190479B84C2C284D4A9EFE8" 52 | }, 53 | "cell_type" : "code", 54 | "source" : [ "val baseDir = \"/tmp\" // Change this to an appropriate location\n", "val dataDir = s\"$baseDir/data\"\n", "val modelDir = s\"$baseDir/model\"\n", "val modelFile = s\"$modelDir/occupancy-lg.model\"" ], 55 | "outputs" : [ { 56 | "name" : "stdout", 57 | "output_type" : "stream", 58 | "text" : "baseDir: String = /tmp\ndataDir: String = /tmp/data\nmodelDir: String = /tmp/model\nmodelFile: String = /tmp/model/occupancy-lg.model\n" 59 | }, { 60 | "metadata" : { }, 61 | "data" : { 62 | "text/html" : "" 63 | }, 64 | "output_type" : "execute_result", 65 | "execution_count" : 1, 66 | "time" : "Took: 1.209s, at 2018-08-29 14:43" 67 | } ] 68 | }, { 69 | "metadata" : { 70 | "id" : "6466825D76E54E8281F5D1F7C3DEEA22" 71 | }, 72 | "cell_type" : "markdown", 73 | "source" : "By observing the first lines of the data, we can appreciate that it is in CSV format and has a header. We can use the CSV reader to load the data.\n```\n\"id\",\"date\",\"Temperature\",\"Humidity\",\"Light\",\"CO2\",\"HumidityRatio\",\"Occupancy\"\n\"1\",\"2015-02-04 17:51:00\",23.18,27.272,426,721.25,0.00479298817650529,1\n```\n(Note that the original dataset misses the \"id\" field in the header. To make the process easier, edit the file to add `<\"id\",>` at the beginning)" 74 | }, { 75 | "metadata" : { 76 | "trusted" : true, 77 | "input_collapsed" : false, 78 | "collapsed" : false, 79 | "id" : "133729DDD9CA44D98B2406F157B10BCD" 80 | }, 81 | "cell_type" : "code", 82 | "source" : [ "val sensorData = sparkSession.read\n", " .option(\"header\",true)\n", " .option(\"inferSchema\", true)\n", " .csv(s\"$dataDir/datatraining.txt\")" ], 83 | "outputs" : [ { 84 | "name" : "stdout", 85 | "output_type" : "stream", 86 | "text" : "sensorData: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 6 more fields]\n" 87 | }, { 88 | "metadata" : { }, 89 | "data" : { 90 | "text/html" : "" 91 | }, 92 | "output_type" : "execute_result", 93 | "execution_count" : 1, 94 | "time" : "Took: 1.044s, at 2018-08-29 14:45" 95 | } ] 96 | }, { 97 | "metadata" : { 98 | "trusted" : true, 99 | "input_collapsed" : false, 100 | "collapsed" : false, 101 | "id" : "23F385965316440F8CE52ECB7F7D4953" 102 | }, 103 | "cell_type" : "code", 104 | "source" : [ "// check that the inferred schema corresponds to the expected types\n", "sensorData.printSchema" ], 105 | "outputs" : [ { 106 | "name" : "stdout", 107 | "output_type" : "stream", 108 | "text" : "root\n |-- id: integer (nullable = true)\n |-- date: timestamp (nullable = true)\n |-- Temperature: double (nullable = true)\n |-- Humidity: double (nullable = true)\n |-- Light: double (nullable = true)\n |-- CO2: double (nullable = true)\n |-- HumidityRatio: double (nullable = true)\n |-- Occupancy: integer (nullable = true)\n\n" 109 | }, { 110 | "metadata" : { }, 111 | "data" : { 112 | "text/html" : "" 113 | }, 114 | "output_type" : "execute_result", 115 | "execution_count" : 2, 116 | "time" : "Took: 0.962s, at 2018-08-29 14:45" 117 | } ] 118 | }, { 119 | "metadata" : { 120 | "id" : "D3B58EDDA35C494D92EF75DB1C1D5A67" 121 | }, 122 | "cell_type" : "markdown", 123 | "source" : "## Building a Logistic Regression Model" 124 | }, { 125 | "metadata" : { 126 | "id" : "B1425DB6CAC946759EEDEE4EBFA4A1D2" 127 | }, 128 | "cell_type" : "markdown", 129 | "source" : "To train our model, we are going to build a ML Pipeline. " 130 | }, { 131 | "metadata" : { 132 | "trusted" : true, 133 | "input_collapsed" : false, 134 | "collapsed" : false, 135 | "id" : "B4F822F660E24517878AE2C1007B1EB1" 136 | }, 137 | "cell_type" : "code", 138 | "source" : [ "import org.apache.spark.ml.Pipeline\n", "import org.apache.spark.ml.classification.LogisticRegression\n", "val lr = new LogisticRegression()\n", " .setMaxIter(10)\n", " .setRegParam(0.1)\n", " .setElasticNetParam(0.8)\n", " \n" ], 139 | "outputs" : [ { 140 | "name" : "stdout", 141 | "output_type" : "stream", 142 | "text" : "import org.apache.spark.ml.Pipeline\nimport org.apache.spark.ml.classification.LogisticRegression\nlr: org.apache.spark.ml.classification.LogisticRegression = logreg_9b02575b4033\n" 143 | }, { 144 | "metadata" : { }, 145 | "data" : { 146 | "text/html" : "" 147 | }, 148 | "output_type" : "execute_result", 149 | "execution_count" : 3, 150 | "time" : "Took: 0.721s, at 2018-08-29 14:45" 151 | } ] 152 | }, { 153 | "metadata" : { 154 | "trusted" : true, 155 | "input_collapsed" : false, 156 | "collapsed" : false, 157 | "id" : "D7BB26D91E7F45AB83800603786814A5" 158 | }, 159 | "cell_type" : "code", 160 | "source" : [ "import org.apache.spark.ml.feature.VectorAssembler\n", "val assembler = new VectorAssembler()\n", " .setInputCols(Array(\"Temperature\", \"Humidity\", \"Light\", \"CO2\", \"HumidityRatio\"))\n", "//.setInputCols(Array(\"Temperature\", \"Humidity\", \"Light\", \"CO2\", \"HumidityRatio\"))\n", " .setOutputCol(\"features\")" ], 161 | "outputs" : [ { 162 | "name" : "stdout", 163 | "output_type" : "stream", 164 | "text" : "import org.apache.spark.ml.feature.VectorAssembler\nassembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_530fd58b21e5\n" 165 | }, { 166 | "metadata" : { }, 167 | "data" : { 168 | "text/html" : "" 169 | }, 170 | "output_type" : "execute_result", 171 | "execution_count" : 4, 172 | "time" : "Took: 0.671s, at 2018-08-29 14:45" 173 | } ] 174 | }, { 175 | "metadata" : { 176 | "trusted" : true, 177 | "input_collapsed" : false, 178 | "collapsed" : false, 179 | "id" : "6640A8AE3C2B47CAA7A4B20DFA64F16C" 180 | }, 181 | "cell_type" : "code", 182 | "source" : [ "val labeledData = sensorData.withColumn(\"label\", $\"Occupancy\".cast(\"Double\"))" ], 183 | "outputs" : [ { 184 | "name" : "stdout", 185 | "output_type" : "stream", 186 | "text" : "labeledData: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 7 more fields]\n" 187 | }, { 188 | "metadata" : { }, 189 | "data" : { 190 | "text/html" : "" 191 | }, 192 | "output_type" : "execute_result", 193 | "execution_count" : 5, 194 | "time" : "Took: 0.680s, at 2018-08-29 14:45" 195 | } ] 196 | }, { 197 | "metadata" : { 198 | "trusted" : true, 199 | "input_collapsed" : false, 200 | "collapsed" : false, 201 | "id" : "91E8109D6FCE43A58F7C67980E409D01" 202 | }, 203 | "cell_type" : "code", 204 | "source" : [ "labeledData.printSchema" ], 205 | "outputs" : [ { 206 | "name" : "stdout", 207 | "output_type" : "stream", 208 | "text" : "root\n |-- id: integer (nullable = true)\n |-- date: timestamp (nullable = true)\n |-- Temperature: double (nullable = true)\n |-- Humidity: double (nullable = true)\n |-- Light: double (nullable = true)\n |-- CO2: double (nullable = true)\n |-- HumidityRatio: double (nullable = true)\n |-- Occupancy: integer (nullable = true)\n |-- label: double (nullable = true)\n\n" 209 | }, { 210 | "metadata" : { }, 211 | "data" : { 212 | "text/html" : "" 213 | }, 214 | "output_type" : "execute_result", 215 | "execution_count" : 6, 216 | "time" : "Took: 0.804s, at 2018-08-29 14:45" 217 | } ] 218 | }, { 219 | "metadata" : { 220 | "id" : "862B5C38FD0D4FAAACEC27D580DE5A36" 221 | }, 222 | "cell_type" : "markdown", 223 | "source" : "We define the Pipeline as a sequence of stages. \nIn our case, the assembler, which brings the features together into a `Vector` and the parameterized _Logistic Regression_ `Estimator` that we instantiated earlier." 224 | }, { 225 | "metadata" : { 226 | "trusted" : true, 227 | "input_collapsed" : false, 228 | "collapsed" : false, 229 | "id" : "EC24EE52015D490EB6C6FB84AF16E73E" 230 | }, 231 | "cell_type" : "code", 232 | "source" : [ "import org.apache.spark.ml.Pipeline\n", "val pipeline = new Pipeline().setStages(Array(assembler, lr))" ], 233 | "outputs" : [ { 234 | "name" : "stdout", 235 | "output_type" : "stream", 236 | "text" : "import org.apache.spark.ml.Pipeline\npipeline: org.apache.spark.ml.Pipeline = pipeline_026fbfe8ca4c\n" 237 | }, { 238 | "metadata" : { }, 239 | "data" : { 240 | "text/html" : "" 241 | }, 242 | "output_type" : "execute_result", 243 | "execution_count" : 7, 244 | "time" : "Took: 0.769s, at 2018-08-29 14:45" 245 | } ] 246 | }, { 247 | "metadata" : { 248 | "id" : "79C81BC30E6340B8A51CFE994EEF54E6" 249 | }, 250 | "cell_type" : "markdown", 251 | "source" : "The `fit` method in a `Pipeline` lets us train the model on a dataset and produces\na `model` that we can use to make predictions on new data." 252 | }, { 253 | "metadata" : { 254 | "trusted" : true, 255 | "input_collapsed" : false, 256 | "collapsed" : false, 257 | "id" : "025AA6212956453F8A152C6F84A8BA26" 258 | }, 259 | "cell_type" : "code", 260 | "source" : [ "val model = pipeline.fit(labeledData)" ], 261 | "outputs" : [ { 262 | "name" : "stdout", 263 | "output_type" : "stream", 264 | "text" : "model: org.apache.spark.ml.PipelineModel = pipeline_026fbfe8ca4c\n" 265 | }, { 266 | "metadata" : { }, 267 | "data" : { 268 | "text/html" : "" 269 | }, 270 | "output_type" : "execute_result", 271 | "execution_count" : 8, 272 | "time" : "Took: 2.919s, at 2018-08-29 14:45" 273 | } ] 274 | }, { 275 | "metadata" : { 276 | "id" : "B9EF97F9597949848CA59694AD6CC0D4" 277 | }, 278 | "cell_type" : "markdown", 279 | "source" : "## Validating the Model" 280 | }, { 281 | "metadata" : { 282 | "id" : "E79AFD7A9F9A47FB920997F66618C6A8" 283 | }, 284 | "cell_type" : "markdown", 285 | "source" : "To validate the model, we use data for which we know the expected outcome.\n\nThat way, we can compare the real with the predicted value and evaluate how well our model is performing." 286 | }, { 287 | "metadata" : { 288 | "trusted" : true, 289 | "input_collapsed" : false, 290 | "collapsed" : false, 291 | "id" : "BD841317241C49EE89AF15B50F4D272B" 292 | }, 293 | "cell_type" : "code", 294 | "source" : [ "val testData = sparkSession.read\n", " .option(\"header\",true)\n", " .option(\"inferSchema\", true)\n", " .csv(s\"$dataDir/datatest.txt\")\n", " .withColumn(\"label\", $\"Occupancy\".cast(\"Double\"))" ], 295 | "outputs" : [ { 296 | "name" : "stdout", 297 | "output_type" : "stream", 298 | "text" : "testData: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 7 more fields]\n" 299 | }, { 300 | "metadata" : { }, 301 | "data" : { 302 | "text/html" : "" 303 | }, 304 | "output_type" : "execute_result", 305 | "execution_count" : 9, 306 | "time" : "Took: 0.931s, at 2018-08-29 14:45" 307 | } ] 308 | }, { 309 | "metadata" : { 310 | "trusted" : true, 311 | "input_collapsed" : false, 312 | "collapsed" : false, 313 | "id" : "440B4A22231341DFAB245B14BCD26B3C" 314 | }, 315 | "cell_type" : "code", 316 | "source" : [ "val predictions = model.transform(testData)" ], 317 | "outputs" : [ { 318 | "name" : "stdout", 319 | "output_type" : "stream", 320 | "text" : "predictions: org.apache.spark.sql.DataFrame = [id: int, date: timestamp ... 11 more fields]\n" 321 | }, { 322 | "metadata" : { }, 323 | "data" : { 324 | "text/html" : "" 325 | }, 326 | "output_type" : "execute_result", 327 | "execution_count" : 10, 328 | "time" : "Took: 0.713s, at 2018-08-29 14:45" 329 | } ] 330 | }, { 331 | "metadata" : { 332 | "trusted" : true, 333 | "input_collapsed" : false, 334 | "collapsed" : false, 335 | "id" : "A27E17D4A1104629877D5B98B562F75B" 336 | }, 337 | "cell_type" : "code", 338 | "source" : [ "predictions.select($\"Occupancy\", $\"rawPrediction\",$\"probability\", $\"prediction\").show(10, truncate=false )" ], 339 | "outputs" : [ { 340 | "name" : "stdout", 341 | "output_type" : "stream", 342 | "text" : "+---------+----------------------------------------+----------------------------------------+----------+\n|Occupancy|rawPrediction |probability |prediction|\n+---------+----------------------------------------+----------------------------------------+----------+\n|1 |[-1.165992741474785,1.165992741474785] |[0.2375800787231101,0.76241992127689] |1.0 |\n|1 |[-1.134749462216978,1.134749462216978] |[0.24328566985645464,0.7567143301435453]|1.0 |\n|1 |[-1.1084831249923008,1.1084831249923008]|[0.24815378909083458,0.7518462109091655]|1.0 |\n|1 |[-0.5854424488314132,0.5854424488314132]|[0.35768125007101514,0.6423187499289849]|1.0 |\n|1 |[-0.5536855193052468,0.5536855193052468]|[0.3650097638103235,0.6349902361896765] |1.0 |\n|1 |[-1.1082268289859558,1.1082268289859558]|[0.24820161021665613,0.7517983897833439]|1.0 |\n|1 |[-0.9054518379194303,0.9054518379194303]|[0.2879314327729891,0.712068567227011] |1.0 |\n|1 |[-0.7174372358135888,0.7174372358135888]|[0.3279575705150612,0.6720424294849388] |1.0 |\n|1 |[-0.5043690269606838,0.5043690269606838]|[0.37651448191486914,0.6234855180851309]|1.0 |\n|1 |[-0.7436762950278142,0.7436762950278142]|[0.32220076291467276,0.6777992370853272]|1.0 |\n+---------+----------------------------------------+----------------------------------------+----------+\nonly showing top 10 rows\n\n" 343 | }, { 344 | "metadata" : { }, 345 | "data" : { 346 | "text/html" : "" 347 | }, 348 | "output_type" : "execute_result", 349 | "execution_count" : 11, 350 | "time" : "Took: 1.169s, at 2018-08-29 14:45" 351 | } ] 352 | }, { 353 | "metadata" : { 354 | "id" : "6F60600DC69541088F3C957C97672F0A" 355 | }, 356 | "cell_type" : "markdown", 357 | "source" : "### Model Evaluation" 358 | }, { 359 | "metadata" : { 360 | "trusted" : true, 361 | "input_collapsed" : false, 362 | "collapsed" : false, 363 | "id" : "3E955E8AF73C45E8B345333233247091" 364 | }, 365 | "cell_type" : "code", 366 | "source" : [ "import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\n", "val evaluator = new BinaryClassificationEvaluator()\n", " .setLabelCol(\"label\")\n", " .setRawPredictionCol(\"rawPrediction\")\n", " .setMetricName(\"areaUnderROC\")\n", "// Evaluates predictions and returns AUC (Area Under ROC Curve - larger is better, 1 is perfect).\n", "val accuracy = evaluator.evaluate(predictions)" ], 367 | "outputs" : [ { 368 | "name" : "stdout", 369 | "output_type" : "stream", 370 | "text" : "import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\nevaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_4164a6bf8fcc\naccuracy: Double = 0.9917154635767224\n" 371 | }, { 372 | "metadata" : { }, 373 | "data" : { 374 | "text/html" : "" 375 | }, 376 | "output_type" : "execute_result", 377 | "execution_count" : 12, 378 | "time" : "Took: 1.289s, at 2018-08-29 14:45" 379 | } ] 380 | }, { 381 | "metadata" : { 382 | "id" : "39829315F4684BB78F2BD76FA0B5FEAF" 383 | }, 384 | "cell_type" : "markdown", 385 | "source" : "## Store the model for later use\nWe store the trained model on disk.\nIt can be read back from disk and applied at a later stage." 386 | }, { 387 | "metadata" : { 388 | "trusted" : true, 389 | "input_collapsed" : false, 390 | "collapsed" : false, 391 | "id" : "A6CCE6F6B7144B2B8889DE735ECDF5B5" 392 | }, 393 | "cell_type" : "code", 394 | "source" : [ "model.write.overwrite.save(modelFile)" ], 395 | "outputs" : [ { 396 | "metadata" : { }, 397 | "data" : { 398 | "text/html" : "" 399 | }, 400 | "output_type" : "execute_result", 401 | "execution_count" : 13, 402 | "time" : "Took: 2.136s, at 2018-08-29 14:46" 403 | } ] 404 | }, { 405 | "metadata" : { 406 | "trusted" : true, 407 | "input_collapsed" : false, 408 | "collapsed" : true, 409 | "id" : "09BCE508408A43D1873E3259DCC1167F" 410 | }, 411 | "cell_type" : "code", 412 | "source" : [ "" ], 413 | "outputs" : [ ] 414 | } ], 415 | "nbformat" : 4 416 | } -------------------------------------------------------------------------------- /chapter-16/simple_network_stream_example.repl: -------------------------------------------------------------------------------- 1 | import org.apache.spark.streaming._ 2 | val ssc = new StreamingContext(sc, Seconds(2)) 3 | val dstream = ssc.socketTextStream("localhost", 8088) 4 | val countStream = dstream.count() 5 | countStream.print() 6 | ssc.start() -------------------------------------------------------------------------------- /chapter-23/better-joins-streaming-data.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "id": "057c00fd-2c79-4279-aba7-9fef1db5d27d", 4 | "name": "better-joins-streaming-data", 5 | "user_save_timestamp": "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z", 7 | "language_info": { 8 | "name": "scala", 9 | "file_extension": "scala", 10 | "codemirror_mode": "text/x-scala" 11 | }, 12 | "trusted": true, 13 | "sparkNotebook": null, 14 | "customLocalRepo": null, 15 | "customRepos": null, 16 | "customDeps": null, 17 | "customImports": null, 18 | "customArgs": null, 19 | "customSparkConf": null, 20 | "customVars": null 21 | }, 22 | "cells": [ 23 | { 24 | "metadata": { 25 | "id": "40AAFF23B9654D2A9B2E0C59515BAF8A" 26 | }, 27 | "cell_type": "markdown", 28 | "source": "# Improving Joins with Streaming Data" 29 | }, 30 | { 31 | "metadata": { 32 | "id": "D5BA242BD89A48F3816A0D7F53206849" 33 | }, 34 | "cell_type": "markdown", 35 | "source": "In this notebook improves on the previous exercise where we started using joins:\n\nWe have a process that is able to generate random data and we created a streaming job that is able to consume, \nparse and save this data to a parquet file.\n\nWe are now interested in loading a parquet file containig the configuration for each sensor in our system and use that data to enrich the incoming sensor data. We would also like to avoid losing records for which we don't have a registered id.\n\nOn top of that, given that the amount of sensors known in our reference file is quite limited, we can improve performance by hinting Spark to use a broadcast join instead of the heavier shuffle-join." 36 | }, 37 | { 38 | "metadata": { 39 | "id": "79CD566650664F04A8515F55A809A44D" 40 | }, 41 | "cell_type": "markdown", 42 | "source": "## Our Streaming dataset will consist of sensor information, containing the sensorId, a timestamp, and a value.\nFor the sake of simplicity in this self-contained example, we are going to generate a randomized dataset, using an scenario that simulates a real IoT use case.\nThe timestamp will be the time of execution and each record will be formatted as a string coming from \"the field\" of comma separated values.\n\nWe also add a bit of real-world chaos to the data: Due to weather conditions, some sensors publish corrupt data. " 43 | }, 44 | { 45 | "metadata": { 46 | "trusted": true, 47 | "input_collapsed": false, 48 | "collapsed": false, 49 | "id": "1BA7507DA513491F8D3A1AED38F087CB" 50 | }, 51 | "cell_type": "code", 52 | "source": "val sensorCount = 100000\nval unknownSensors = 1000\nval workDir = \"/tmp/learningsparkstreaming/\"\nval referenceFile = \"sensor-records.parquet\"\nval targetFile = \"enrichedIoTStream.parquet\"\nval unknownSensorsTargetFile = \"unknownSensorsStream.parquet\"", 53 | "outputs": [] 54 | }, 55 | { 56 | "metadata": { 57 | "trusted": true, 58 | "input_collapsed": false, 59 | "collapsed": false, 60 | "id": "9CA4FD2CB34F47AF8097D5E2A2BF97DA" 61 | }, 62 | "cell_type": "code", 63 | "source": "import scala.util.Random\nval sensorId: () => Int = () => Random.nextInt(sensorCount+unknownSensors)\nval data: () => Double = () => Random.nextDouble\nval timestamp: () => Long = () => System.currentTimeMillis\nval recordFunction: () => String = { () => \n if (Random.nextDouble < 0.9) {\n Seq(sensorId().toString, timestamp(), data()).mkString(\",\")\n } else {\n \"!!~corrupt~^&##$\" \n }\n }", 64 | "outputs": [] 65 | }, 66 | { 67 | "metadata": { 68 | "id": "5806B402E3314F15A0B21F46BC97FF4A" 69 | }, 70 | "cell_type": "markdown", 71 | "source": "### We use a particular trick that requires a moment of attention\nInstead of creating an RDD of text records, we create an RDD of record-generating functions. \nThen, each time the RDD is evaluated, the record function will generate a new random record. \nThis way we can simulate a realistic load of data that delivers a different set on each batch." 72 | }, 73 | { 74 | "metadata": { 75 | "trusted": true, 76 | "input_collapsed": false, 77 | "collapsed": false, 78 | "id": "A9AADE76D9DA471B9708B80B3E765890" 79 | }, 80 | "cell_type": "code", 81 | "source": "val sensorDataGenerator = sparkContext.parallelize(1 to 100).map(_ => recordFunction)\nval sensorData = sensorDataGenerator.map(recordFun => recordFun())", 82 | "outputs": [] 83 | }, 84 | { 85 | "metadata": { 86 | "id": "62A5D12E89B24EFB8C411495942032B1" 87 | }, 88 | "cell_type": "markdown", 89 | "source": "# Load the reference data from a parquet file\nWe also cache the data to keep it in memory and improve the performance of our steaming application" 90 | }, 91 | { 92 | "metadata": { 93 | "trusted": true, 94 | "input_collapsed": false, 95 | "collapsed": false, 96 | "id": "78E6ABA2D88B47F38905441B7D79E65D" 97 | }, 98 | "cell_type": "code", 99 | "source": "val sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\nsensorRef.cache()", 100 | "outputs": [] 101 | }, 102 | { 103 | "metadata": { 104 | "id": "9CBAE4C0C1EB48DBB802A0CC2960B2DB" 105 | }, 106 | "cell_type": "markdown", 107 | "source": "(Parquet files preserve the schema information, which we can retrieve from the DataFrame)" 108 | }, 109 | { 110 | "metadata": { 111 | "trusted": true, 112 | "input_collapsed": false, 113 | "collapsed": false, 114 | "presentation": { 115 | "tabs_state": "{\n \"tab_id\": \"#tab886805482-0\"\n}", 116 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 117 | }, 118 | "id": "774E09B823774494979FC8FE6AE3078F" 119 | }, 120 | "cell_type": "code", 121 | "source": "sensorRef.schema", 122 | "outputs": [] 123 | }, 124 | { 125 | "metadata": { 126 | "id": "BFF9E80BAB5C4D468154BCD6FB5C6501" 127 | }, 128 | "cell_type": "markdown", 129 | "source": "## We create our Streaming Context" 130 | }, 131 | { 132 | "metadata": { 133 | "trusted": true, 134 | "input_collapsed": false, 135 | "collapsed": false, 136 | "id": "83B2B58C552240BBA6FF518B1AD274EB" 137 | }, 138 | "cell_type": "code", 139 | "source": "import org.apache.spark.streaming.StreamingContext\nimport org.apache.spark.streaming.Seconds\n\nval streamingContext = new StreamingContext(sparkContext, Seconds(2))", 140 | "outputs": [] 141 | }, 142 | { 143 | "metadata": { 144 | "id": "4E6DBF82BB6F4B6584233CD460F67263" 145 | }, 146 | "cell_type": "markdown", 147 | "source": "## Our stream source will be a ConstantInputDStream fed by the record-generating RDD.\nBy combining a constant input DStream with the record generating RDD, we create a changing stream of data to process in our example.\n(This method makes the example self-contained. It removes the need of an external stream generating process)\n" 148 | }, 149 | { 150 | "metadata": { 151 | "trusted": true, 152 | "input_collapsed": false, 153 | "collapsed": false, 154 | "id": "DF03F66BDDE0447B8202D39F2C0202E2" 155 | }, 156 | "cell_type": "code", 157 | "source": "import org.apache.spark.streaming.dstream.ConstantInputDStream\nval rawDStream = new ConstantInputDStream(streamingContext, sensorData)\n", 158 | "outputs": [] 159 | }, 160 | { 161 | "metadata": { 162 | "id": "CCCB597031E7451FB59D18BA85C0E4A4" 163 | }, 164 | "cell_type": "markdown", 165 | "source": "# Providing Schema information for our streaming data\nNow that we have a DStream of fresh data processed in a 2-second interval, we can start focusing on the gist of this example.\nFirst, we want to define and apply a schema to the data we are receiving.\nIn Scala, we can define a schema with a `case class`" 166 | }, 167 | { 168 | "metadata": { 169 | "trusted": true, 170 | "input_collapsed": false, 171 | "collapsed": false, 172 | "id": "E7A917C393654969812E6E38223BBA52" 173 | }, 174 | "cell_type": "code", 175 | "source": "case class SensorData(sensorId: Int, timestamp: Long, value: Double)", 176 | "outputs": [] 177 | }, 178 | { 179 | "metadata": { 180 | "id": "9AD1ACAD450E44DA8C046EB48CD4EE5A" 181 | }, 182 | "cell_type": "markdown", 183 | "source": "Now we apply that schema to the dstream, using the `flatMap` function.\n\nWe use `flatMap` instead of a `map` because there might be cases when the incoming data is incomplete or corrupted.\nIf we would use `map`, we would have to provide a resulting value for each transformed record. \nThat is something we cannot do for invalid records.\nWith `flatMap` in combination with `Option`, we can represent valid records as `Some(recordValue)` and invalid records as `None`.\nBy the virtue of `flatMap` the internal `Option` container gets flattend and our resulting stream will only contain valid `recordValue`s.\n\nDuring the parsing of the comma separated records, we not only protect ourselves against missing fields, but also parse the numeric values to their expected types. The surrounding `Try` captures any `NumberFormatException` that might arise from invalid records." 184 | }, 185 | { 186 | "metadata": { 187 | "trusted": true, 188 | "input_collapsed": false, 189 | "collapsed": false, 190 | "id": "5285C2BBC1854F059AB8E1D0244AE1C7" 191 | }, 192 | "cell_type": "code", 193 | "source": "import scala.util.Try\nval schemaStream = rawDStream.flatMap{record => \n val fields = record.split(\",\")\n if (fields.size == 3) {\n Try (SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)).toOption\n } else { None }\n }", 194 | "outputs": [] 195 | }, 196 | { 197 | "metadata": { 198 | "id": "7A7C144384904E96BE66A649BD193C15" 199 | }, 200 | "cell_type": "markdown", 201 | "source": "# Enrich the streaming data, without dropping records.\nWith the schema stream in place, we can proceed to transform the underlying RDDs into DataFrames.\n\nAs in the previous notebook, we are going to use the reference data to add the specific sensor information.\nPreviously, we used the default 'join', which is an inner-join that requires the join key to be available on both sides of the join.\nThis causes us to drop all data records for which we don't know the id. Given that new sensors might become available or misconfigured sensors might be sending an incorrect id, we would like to preserve all records in order to reconcile them in a latter stage.\n\nAs before, we do this in the context of the general-purpose action `foreachRDD`. " 202 | }, 203 | { 204 | "metadata": { 205 | "trusted": true, 206 | "input_collapsed": false, 207 | "collapsed": false, 208 | "id": "99696609577F49DB809AF94C319CB449" 209 | }, 210 | "cell_type": "code", 211 | "source": "val stableSparkSession = sparkSession\nimport stableSparkSession.implicits._\nimport org.apache.spark.sql.SaveMode.Append\nschemaStream.foreachRDD{rdd => \n val sensorDF = rdd.toDF()\n val sensorWithInfo = sensorRef.join(broadcast(sensorDF), Seq(\"sensorId\"), \"rightouter\")\n val unknownSensors = sensorWithInfo.filter($\"sensorType\".isNull) \n val knownSensors = sensorWithInfo.filter(!$\"sensorType\".isNull) \n val denormalizedSensorData =\n knownSensors.withColumn(\"dnvalue\", $\"value\"*($\"maxRange\"-$\"minRange\")+$\"minRange\")\n val sensorRecords = denormalizedSensorData.drop(\"value\", \"maxRange\", \"minRange\")\n sensorRecords.write.format(\"parquet\").mode(Append).save(s\"$workDir/$targetFile\")\n unknownSensors.write.format(\"parquet\").mode(Append).save(s\"$workDir/$unknownSensorsTargetFile\")\n }", 212 | "outputs": [] 213 | }, 214 | { 215 | "metadata": { 216 | "trusted": true, 217 | "input_collapsed": false, 218 | "collapsed": false, 219 | "id": "F366201F2275412F818532AB671A55BC" 220 | }, 221 | "cell_type": "code", 222 | "source": "streamingContext.start()", 223 | "outputs": [] 224 | }, 225 | { 226 | "metadata": { 227 | "trusted": true, 228 | "input_collapsed": false, 229 | "collapsed": false, 230 | "id": "B6F0075E9BB04467858CABAA000489EF" 231 | }, 232 | "cell_type": "code", 233 | "source": "// Be careful not to stop the context if you want the streaming process to continue\nstreamingContext.stop(false)", 234 | "outputs": [] 235 | }, 236 | { 237 | "metadata": { 238 | "id": "5BF4B4ECDC794A769ED429A2D35B8A38" 239 | }, 240 | "cell_type": "markdown", 241 | "source": "#Inspect the result\nWe can use the current Spark Session concurrently with the running Spark Streaming job in order to inspect the resulting data.\n" 242 | }, 243 | { 244 | "metadata": { 245 | "trusted": true, 246 | "input_collapsed": false, 247 | "collapsed": false, 248 | "id": "87973510A2E544B88D0825533CB24BC5" 249 | }, 250 | "cell_type": "code", 251 | "source": "val enrichedRecords = sparkSession.read.parquet(s\"$workDir/$targetFile\")\nenrichedRecords", 252 | "outputs": [] 253 | }, 254 | { 255 | "metadata": { 256 | "trusted": true, 257 | "input_collapsed": false, 258 | "collapsed": false, 259 | "id": "9224E4A6B30943FD8AC88441973B14C9" 260 | }, 261 | "cell_type": "code", 262 | "source": "val unknownRecords = sparkSession.read.parquet(s\"$workDir/$unknownSensorsTargetFile\")\nunknownRecords.count\nunknownRecords", 263 | "outputs": [] 264 | }, 265 | { 266 | "metadata": { 267 | "trusted": true, 268 | "input_collapsed": false, 269 | "collapsed": false, 270 | "id": "03C77BDE93904F3A8BDD12B66B427E5A" 271 | }, 272 | "cell_type": "code", 273 | "source": "enrichedRecords.count", 274 | "outputs": [] 275 | }, 276 | { 277 | "metadata": { 278 | "trusted": true, 279 | "input_collapsed": false, 280 | "collapsed": false, 281 | "id": "0560260836FA43538229F37AF9503EA5" 282 | }, 283 | "cell_type": "code", 284 | "source": "unknownRecords.count", 285 | "outputs": [] 286 | } 287 | ], 288 | "nbformat": 4 289 | } 290 | -------------------------------------------------------------------------------- /chapter-23/enriching-streaming-data.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "81c89972-6415-4fac-ad50-72dcfc5f7587", 4 | "name" : "enriching-streaming-data", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "40AAFF23B9654D2A9B2E0C59515BAF8A" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# Enriching Streaming Data" 28 | }, { 29 | "metadata" : { 30 | "id" : "D5BA242BD89A48F3816A0D7F53206849" 31 | }, 32 | "cell_type" : "markdown", 33 | "source" : "In this notebook we start where our previous example left us:\n\nWe have a process that is able to generate random data and we created a streaming job that is able to consume, \nparse and save this data to a parquet file.\n\nWe are now interested in loading a parquet file containig the configuration for each sensor in our system and use that data to enrich the incoming sensor data." 34 | }, { 35 | "metadata" : { 36 | "id" : "79CD566650664F04A8515F55A809A44D" 37 | }, 38 | "cell_type" : "markdown", 39 | "source" : "## Our Streaming dataset will consist of sensor information, containing the sensorId, a timestamp, and a value.\nFor the sake of simplicity in this self-contained example, we are going to generate a randomized dataset, using an scenario that simulates a real IoT use case.\nThe timestamp will be the time of execution and each record will be formatted as a string coming from \"the field\" of comma separated values.\n\nWe also add a bit of real-world chaos to the data: Due to weather conditions, some sensors publish corrupt data. " 40 | }, { 41 | "metadata" : { 42 | "trusted" : true, 43 | "input_collapsed" : false, 44 | "collapsed" : false, 45 | "id" : "1BA7507DA513491F8D3A1AED38F087CB" 46 | }, 47 | "cell_type" : "code", 48 | "source" : [ "val sensorCount = 100000\n", "val workDir = \"/tmp/learningsparkstreaming/\"\n", "val referenceFile = \"sensor-records.parquet\"\n", "val targetFile = \"enrichedIoTStream.parquet\"" ], 49 | "outputs" : [ ] 50 | }, { 51 | "metadata" : { 52 | "trusted" : true, 53 | "input_collapsed" : false, 54 | "collapsed" : false, 55 | "id" : "9CA4FD2CB34F47AF8097D5E2A2BF97DA" 56 | }, 57 | "cell_type" : "code", 58 | "source" : [ "import scala.util.Random\n", "val sensorId: () => Int = () => Random.nextInt(sensorCount)\n", "val data: () => Double = () => Random.nextDouble\n", "val timestamp: () => Long = () => System.currentTimeMillis\n", "val recordFunction: () => String = { () => \n", " if (Random.nextDouble < 0.9) {\n", " Seq(sensorId().toString, timestamp(), data()).mkString(\",\")\n", " } else {\n", " \"!!~corrupt~^&##$\" \n", " }\n", " }" ], 59 | "outputs" : [ ] 60 | }, { 61 | "metadata" : { 62 | "id" : "5806B402E3314F15A0B21F46BC97FF4A" 63 | }, 64 | "cell_type" : "markdown", 65 | "source" : "### We use a particular trick that requires a moment of attention\nInstead of creating an RDD of text records, we create an RDD of record-generating functions. \nThen, each time the RDD is evaluated, the record function will generate a new random record. \nThis way we can simulate a realistic load of data that delivers a different set on each batch." 66 | }, { 67 | "metadata" : { 68 | "trusted" : true, 69 | "input_collapsed" : false, 70 | "collapsed" : false, 71 | "id" : "A9AADE76D9DA471B9708B80B3E765890" 72 | }, 73 | "cell_type" : "code", 74 | "source" : [ "val sensorDataGenerator = sparkContext.parallelize(1 to 100).map(_ => recordFunction)\n", "val sensorData = sensorDataGenerator.map(recordFun => recordFun())" ], 75 | "outputs" : [ ] 76 | }, { 77 | "metadata" : { 78 | "id" : "62A5D12E89B24EFB8C411495942032B1" 79 | }, 80 | "cell_type" : "markdown", 81 | "source" : "# Load the reference data from a parquet file\nWe also cache the data to keep it in memory and improve the performance of our steaming application" 82 | }, { 83 | "metadata" : { 84 | "trusted" : true, 85 | "input_collapsed" : false, 86 | "collapsed" : false, 87 | "id" : "78E6ABA2D88B47F38905441B7D79E65D" 88 | }, 89 | "cell_type" : "code", 90 | "source" : [ "val sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\n", "sensorRef.cache()" ], 91 | "outputs" : [ ] 92 | }, { 93 | "metadata" : { 94 | "id" : "9CBAE4C0C1EB48DBB802A0CC2960B2DB" 95 | }, 96 | "cell_type" : "markdown", 97 | "source" : "(Parquet files preserve the schema information, which we can retrieve from the DataFrame)" 98 | }, { 99 | "metadata" : { 100 | "trusted" : true, 101 | "input_collapsed" : false, 102 | "collapsed" : false, 103 | "presentation" : { 104 | "tabs_state" : "{\n \"tab_id\": \"#tab484749271-0\"\n}", 105 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 106 | }, 107 | "id" : "774E09B823774494979FC8FE6AE3078F" 108 | }, 109 | "cell_type" : "code", 110 | "source" : [ "sensorRef.schema" ], 111 | "outputs" : [ ] 112 | }, { 113 | "metadata" : { 114 | "id" : "BFF9E80BAB5C4D468154BCD6FB5C6501" 115 | }, 116 | "cell_type" : "markdown", 117 | "source" : "## We create our Streaming Context" 118 | }, { 119 | "metadata" : { 120 | "trusted" : true, 121 | "input_collapsed" : false, 122 | "collapsed" : false, 123 | "id" : "83B2B58C552240BBA6FF518B1AD274EB" 124 | }, 125 | "cell_type" : "code", 126 | "source" : [ "import org.apache.spark.streaming.StreamingContext\n", "import org.apache.spark.streaming.Seconds\n", "\n", "@transient val streamingContext = new StreamingContext(sparkContext, Seconds(2))" ], 127 | "outputs" : [ ] 128 | }, { 129 | "metadata" : { 130 | "id" : "4E6DBF82BB6F4B6584233CD460F67263" 131 | }, 132 | "cell_type" : "markdown", 133 | "source" : "## Our stream source will be a ConstantInputDStream fed by the record-generating RDD.\nBy combining a constant input DStream with the record generating RDD, we create a changing stream of data to process in our example.\n(This method makes the example self-contained. It removes the need of an external stream generating process)\n" 134 | }, { 135 | "metadata" : { 136 | "trusted" : true, 137 | "input_collapsed" : false, 138 | "collapsed" : false, 139 | "id" : "DF03F66BDDE0447B8202D39F2C0202E2" 140 | }, 141 | "cell_type" : "code", 142 | "source" : [ "import org.apache.spark.streaming.dstream.ConstantInputDStream\n", "val rawDStream = new ConstantInputDStream(streamingContext, sensorData)\n" ], 143 | "outputs" : [ ] 144 | }, { 145 | "metadata" : { 146 | "id" : "CCCB597031E7451FB59D18BA85C0E4A4" 147 | }, 148 | "cell_type" : "markdown", 149 | "source" : "# Providing Schema information for our streaming data\nNow that we have a DStream of fresh data processed in a 2-second interval, we can start focusing on the gist of this example.\nFirst, we want to define and apply a schema to the data we are receiving.\nIn Scala, we can define a schema with a `case class`" 150 | }, { 151 | "metadata" : { 152 | "trusted" : true, 153 | "input_collapsed" : false, 154 | "collapsed" : false, 155 | "id" : "E7A917C393654969812E6E38223BBA52" 156 | }, 157 | "cell_type" : "code", 158 | "source" : [ "case class SensorData(sensorId: Int, timestamp: Long, value: Double)" ], 159 | "outputs" : [ ] 160 | }, { 161 | "metadata" : { 162 | "id" : "9AD1ACAD450E44DA8C046EB48CD4EE5A" 163 | }, 164 | "cell_type" : "markdown", 165 | "source" : "Now we apply that schema to the dstream, using the `flatMap` function.\n\nWe use `flatMap` instead of a `map` because there might be cases when the incoming data is incomplete or corrupted.\nIf we would use `map`, we would have to provide a resulting value for each transformed record. \nThat is something we cannot do for invalid records.\nWith `flatMap` in combination with `Option`, we can represent valid records as `Some(recordValue)` and invalid records as `None`.\nBy the virtue of `flatMap` the internal `Option` container gets flattend and our resulting stream will only contain valid `recordValue`s.\n\nDuring the parsing of the comma separated records, we not only protect ourselves against missing fields, but also parse the numeric values to their expected types. The surrounding `Try` captures any `NumberFormatException` that might arise from invalid records." 166 | }, { 167 | "metadata" : { 168 | "trusted" : true, 169 | "input_collapsed" : false, 170 | "collapsed" : false, 171 | "id" : "5285C2BBC1854F059AB8E1D0244AE1C7" 172 | }, 173 | "cell_type" : "code", 174 | "source" : [ "import scala.util.Try\n", "val schemaStream = rawDStream.flatMap{record => \n", " val fields = record.split(\",\")\n", " if (fields.size == 3) {\n", " Try (SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)).toOption\n", " } else { None }\n", " }" ], 175 | "outputs" : [ ] 176 | }, { 177 | "metadata" : { 178 | "id" : "7A7C144384904E96BE66A649BD193C15" 179 | }, 180 | "cell_type" : "markdown", 181 | "source" : "# Enrich the streaming data\nWith the schema stream in place, we can proceed to transform the underlying RDDs in DataFrames.\nThis time, we are going to use the reference data to add the specific sensor information. \n\nWe are also going to de-normalize the recorded value according to the sensor range so that we don't have to repeat that data in the resulting dataset.\n\nAs before, we do this in the context of the general-purpose action `foreachRDD`. " 182 | }, { 183 | "metadata" : { 184 | "trusted" : true, 185 | "input_collapsed" : false, 186 | "collapsed" : false, 187 | "id" : "99696609577F49DB809AF94C319CB449" 188 | }, 189 | "cell_type" : "code", 190 | "source" : [ "val stableSparkSession = sparkSession\n", "import stableSparkSession.implicits._\n", "import org.apache.spark.sql.SaveMode.Append\n", "schemaStream.foreachRDD{rdd => \n", " val sensorDF = rdd.toDF()\n", " val sensorWithInfo = sensorDF.join(broadcast(sensorRef), \"sensorId\")\n", " val denormalizedSensorData =\n", " sensorWithInfo.withColumn(\"dnvalue\", $\"value\"*($\"maxRange\"-$\"minRange\")+$\"minRange\")\n", " val sensorRecords = denormalizedSensorData.drop(\"value\", \"maxRange\", \"minRange\")\n", " sensorRecords.write.format(\"parquet\").mode(Append).save(s\"$workDir/$targetFile\")\n", " }" ], 191 | "outputs" : [ ] 192 | }, { 193 | "metadata" : { 194 | "trusted" : true, 195 | "input_collapsed" : false, 196 | "collapsed" : false, 197 | "id" : "F366201F2275412F818532AB671A55BC" 198 | }, 199 | "cell_type" : "code", 200 | "source" : [ "streamingContext.start()" ], 201 | "outputs" : [ ] 202 | }, { 203 | "metadata" : { 204 | "trusted" : true, 205 | "input_collapsed" : false, 206 | "collapsed" : false, 207 | "id" : "B6F0075E9BB04467858CABAA000489EF" 208 | }, 209 | "cell_type" : "code", 210 | "source" : [ "// Be careful not to stop the context if you want the streaming process to continue\n", "// uncomment the statement below and execute this cell to stop the streaming process\n", "// streamingContext.stop(false)" ], 211 | "outputs" : [ ] 212 | }, { 213 | "metadata" : { 214 | "id" : "5BF4B4ECDC794A769ED429A2D35B8A38" 215 | }, 216 | "cell_type" : "markdown", 217 | "source" : "#Inspect the result\nWe can use the current Spark Session concurrently with the running Spark Streaming job in order to inspect the resulting data.\n" 218 | }, { 219 | "metadata" : { 220 | "trusted" : true, 221 | "input_collapsed" : false, 222 | "collapsed" : false, 223 | "id" : "87973510A2E544B88D0825533CB24BC5" 224 | }, 225 | "cell_type" : "code", 226 | "source" : [ "val enrichedRecords = sparkSession.read.parquet(s\"$workDir/$targetFile\")\n", "enrichedRecords" ], 227 | "outputs" : [ ] 228 | }, { 229 | "metadata" : { 230 | "trusted" : true, 231 | "input_collapsed" : false, 232 | "collapsed" : false, 233 | "id" : "03C77BDE93904F3A8BDD12B66B427E5A" 234 | }, 235 | "cell_type" : "code", 236 | "source" : [ "enrichedRecords.count" ], 237 | "outputs" : [ ] 238 | }, { 239 | "metadata" : { 240 | "trusted" : true, 241 | "input_collapsed" : false, 242 | "collapsed" : false, 243 | "id" : "2DF364CB97B8419095105116FC4D1897" 244 | }, 245 | "cell_type" : "code", 246 | "source" : [ "enrichedRecords.count" ], 247 | "outputs" : [ ] 248 | }, { 249 | "metadata" : { 250 | "trusted" : true, 251 | "input_collapsed" : false, 252 | "collapsed" : true, 253 | "id" : "EE4C39F5EBCF44BB81021B4B50015C68" 254 | }, 255 | "cell_type" : "code", 256 | "source" : [ "" ], 257 | "outputs" : [ ] 258 | } ], 259 | "nbformat" : 4 260 | } -------------------------------------------------------------------------------- /chapter-23/reference-data-generator.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "bac1b48e-b675-41c4-83cb-14cfddef6ac7", 4 | "name" : "reference-data-generator", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "05BA7B120BD24B44A94495B3B13DB04C" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "#Sensor Reference Data Generator\n\nThis notebook generates the fixed reference dataset used through in the IoT examples.\n\nTo use, specify the target directory in the cell below and execute until the end." 28 | }, { 29 | "metadata" : { 30 | "trusted" : true, 31 | "input_collapsed" : false, 32 | "collapsed" : false, 33 | "id" : "40032AE70D9B4ABABFBC6EAA0A38F3B0" 34 | }, 35 | "cell_type" : "code", 36 | "source" : [ "val sensorCount = 100000\n", "val workDir = \"/tmp/streaming-with-spark\"\n", "val referenceFile = \"sensor-records.parquet\"" ], 37 | "outputs" : [ { 38 | "name" : "stdout", 39 | "output_type" : "stream", 40 | "text" : "sensorCount: Int = 100000\nworkDir: String = /tmp/streaming-with-spark/\nreferenceFile: String = sensor-records.parquet\n" 41 | }, { 42 | "metadata" : { }, 43 | "data" : { 44 | "text/html" : "" 45 | }, 46 | "output_type" : "execute_result", 47 | "execution_count" : 1, 48 | "time" : "Took: 1.024s, at 2019-05-12 17:47" 49 | } ] 50 | }, { 51 | "metadata" : { 52 | "trusted" : true, 53 | "input_collapsed" : false, 54 | "collapsed" : false, 55 | "id" : "C9A22043B45445E48F2DD2839DB0116B" 56 | }, 57 | "cell_type" : "code", 58 | "source" : [ "case class SensorType(sensorType: String, unit: String, minRange: Double, maxRange: Double)" ], 59 | "outputs" : [ { 60 | "name" : "stdout", 61 | "output_type" : "stream", 62 | "text" : "defined class SensorType\n" 63 | }, { 64 | "metadata" : { }, 65 | "data" : { 66 | "text/html" : "" 67 | }, 68 | "output_type" : "execute_result", 69 | "execution_count" : 2, 70 | "time" : "Took: 0.755s, at 2019-05-12 17:47" 71 | } ] 72 | }, { 73 | "metadata" : { 74 | "trusted" : true, 75 | "input_collapsed" : false, 76 | "collapsed" : false, 77 | "id" : "D28261212443425C8938630D60098A8C" 78 | }, 79 | "cell_type" : "code", 80 | "source" : [ "case class SensorReference(sensorId: Long, sensorType: String, unit: String, minRange: Double, maxRange: Double)" ], 81 | "outputs" : [ { 82 | "name" : "stdout", 83 | "output_type" : "stream", 84 | "text" : "defined class SensorReference\n" 85 | }, { 86 | "metadata" : { }, 87 | "data" : { 88 | "text/html" : "" 89 | }, 90 | "output_type" : "execute_result", 91 | "execution_count" : 3, 92 | "time" : "Took: 0.650s, at 2019-05-12 17:47" 93 | } ] 94 | }, { 95 | "metadata" : { 96 | "trusted" : true, 97 | "input_collapsed" : false, 98 | "collapsed" : false, 99 | "id" : "FBDD084D0A4648FC80B7E8BC4F14A3F9" 100 | }, 101 | "cell_type" : "code", 102 | "source" : [ "val sensorTypes = List (\n", " SensorType(\"humidity\", \"%Rh\", 0, 100),\n", " SensorType(\"temperature\", \"oC\", -100, 100),\n", " SensorType(\"brightness\", \"lux\", 0, 100000),\n", " SensorType(\"rainfall\",\"mm/day\",0, 5000),\n", " SensorType(\"windspeed\",\"m/s\", 0, 50),\n", " SensorType(\"pressure\", \"mmHg\", 800, 1100),\n", " SensorType(\"magnetism\", \"T\", 0, 1000),\n", " SensorType(\"Radiation\", \"mSv\", 0.01, 10000)\n", ")\n", "\n", " " ], 103 | "outputs" : [ { 104 | "name" : "stdout", 105 | "output_type" : "stream", 106 | "text" : "sensorTypes: List[SensorType] = List(SensorType(humidity,%Rh,0.0,100.0), SensorType(temperature,oC,-100.0,100.0), SensorType(brightness,lux,0.0,100000.0), SensorType(rainfall,mm/day,0.0,5000.0), SensorType(windspeed,m/s,0.0,50.0), SensorType(pressure,mmHg,800.0,1100.0), SensorType(magnetism,T,0.0,1000.0), SensorType(Radiation,mSv,0.01,10000.0))\n" 107 | }, { 108 | "metadata" : { }, 109 | "data" : { 110 | "text/html" : "" 111 | }, 112 | "output_type" : "execute_result", 113 | "execution_count" : 4, 114 | "time" : "Took: 0.696s, at 2019-05-12 17:47" 115 | } ] 116 | }, { 117 | "metadata" : { 118 | "trusted" : true, 119 | "input_collapsed" : false, 120 | "collapsed" : false, 121 | "id" : "B3C135CC5E0340928B1567700315991A" 122 | }, 123 | "cell_type" : "code", 124 | "source" : [ "val sensorIds = sparkSession.range(0, sensorCount)" ], 125 | "outputs" : [ { 126 | "name" : "stdout", 127 | "output_type" : "stream", 128 | "text" : "sensorIds: org.apache.spark.sql.Dataset[Long] = [id: bigint]\n" 129 | }, { 130 | "metadata" : { }, 131 | "data" : { 132 | "text/html" : "" 133 | }, 134 | "output_type" : "execute_result", 135 | "execution_count" : 5, 136 | "time" : "Took: 1.471s, at 2019-05-12 17:47" 137 | } ] 138 | }, { 139 | "metadata" : { 140 | "trusted" : true, 141 | "input_collapsed" : false, 142 | "collapsed" : false, 143 | "id" : "91C4D164711748C0A48D06854234213C" 144 | }, 145 | "cell_type" : "code", 146 | "source" : [ "import scala.util.Random\n", "val sensors = sensorIds.map{id => \n", " val sensorType = sensorTypes(Random.nextInt(sensorTypes.size))\n", " SensorReference(id, sensorType.sensorType, sensorType.unit, sensorType.minRange, sensorType.maxRange)\n", " }" ], 147 | "outputs" : [ { 148 | "name" : "stdout", 149 | "output_type" : "stream", 150 | "text" : "import scala.util.Random\nsensors: org.apache.spark.sql.Dataset[SensorReference] = [sensorId: bigint, sensorType: string ... 3 more fields]\n" 151 | }, { 152 | "metadata" : { }, 153 | "data" : { 154 | "text/html" : "" 155 | }, 156 | "output_type" : "execute_result", 157 | "execution_count" : 6, 158 | "time" : "Took: 1.084s, at 2019-05-12 17:47" 159 | } ] 160 | }, { 161 | "metadata" : { 162 | "trusted" : true, 163 | "input_collapsed" : false, 164 | "collapsed" : false, 165 | "id" : "BB72B2989AE14D628301293C03B135F9" 166 | }, 167 | "cell_type" : "code", 168 | "source" : [ "sensors.show()" ], 169 | "outputs" : [ { 170 | "name" : "stdout", 171 | "output_type" : "stream", 172 | "text" : "+--------+-----------+------+--------+--------+\n|sensorId| sensorType| unit|minRange|maxRange|\n+--------+-----------+------+--------+--------+\n| 0| pressure| mmHg| 800.0| 1100.0|\n| 1| rainfall|mm/day| 0.0| 5000.0|\n| 2| brightness| lux| 0.0|100000.0|\n| 3|temperature| oC| -100.0| 100.0|\n| 4| pressure| mmHg| 800.0| 1100.0|\n| 5| humidity| %Rh| 0.0| 100.0|\n| 6| brightness| lux| 0.0|100000.0|\n| 7| pressure| mmHg| 800.0| 1100.0|\n| 8| windspeed| m/s| 0.0| 50.0|\n| 9| brightness| lux| 0.0|100000.0|\n| 10|temperature| oC| -100.0| 100.0|\n| 11| brightness| lux| 0.0|100000.0|\n| 12| pressure| mmHg| 800.0| 1100.0|\n| 13| humidity| %Rh| 0.0| 100.0|\n| 14| windspeed| m/s| 0.0| 50.0|\n| 15| magnetism| T| 0.0| 1000.0|\n| 16| pressure| mmHg| 800.0| 1100.0|\n| 17| Radiation| mSv| 0.01| 10000.0|\n| 18| pressure| mmHg| 800.0| 1100.0|\n| 19| Radiation| mSv| 0.01| 10000.0|\n+--------+-----------+------+--------+--------+\nonly showing top 20 rows\n\n" 173 | }, { 174 | "metadata" : { }, 175 | "data" : { 176 | "text/html" : "" 177 | }, 178 | "output_type" : "execute_result", 179 | "execution_count" : 7, 180 | "time" : "Took: 2.251s, at 2019-05-12 17:47" 181 | } ] 182 | }, { 183 | "metadata" : { 184 | "trusted" : true, 185 | "input_collapsed" : false, 186 | "collapsed" : false, 187 | "id" : "B88A683EFFDF47128101120929470552" 188 | }, 189 | "cell_type" : "code", 190 | "source" : [ "sensors.write.mode(\"overwrite\").parquet(s\"$workDir/$referenceFile\")\n" ], 191 | "outputs" : [ { 192 | "metadata" : { }, 193 | "data" : { 194 | "text/html" : "" 195 | }, 196 | "output_type" : "execute_result", 197 | "execution_count" : 8, 198 | "time" : "Took: 1.767s, at 2019-05-12 17:47" 199 | } ] 200 | }, { 201 | "metadata" : { 202 | "trusted" : true, 203 | "input_collapsed" : false, 204 | "collapsed" : true, 205 | "id" : "BC543FC3AFAD4B5CA57F180F23981707" 206 | }, 207 | "cell_type" : "code", 208 | "source" : [ "" ], 209 | "outputs" : [ ] 210 | } ], 211 | "nbformat" : 4 212 | } -------------------------------------------------------------------------------- /chapter-23/refresh-reference-data-streaming.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "id": "7ebf5557-84ed-4f34-8bf1-fdacb6839242", 4 | "name": "refresh-reference-data-streaming", 5 | "user_save_timestamp": "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z", 7 | "language_info": { 8 | "name": "scala", 9 | "file_extension": "scala", 10 | "codemirror_mode": "text/x-scala" 11 | }, 12 | "trusted": true, 13 | "sparkNotebook": null, 14 | "customLocalRepo": null, 15 | "customRepos": null, 16 | "customDeps": null, 17 | "customImports": null, 18 | "customArgs": null, 19 | "customSparkConf": null, 20 | "customVars": null 21 | }, 22 | "cells": [ 23 | { 24 | "metadata": { 25 | "id": "40AAFF23B9654D2A9B2E0C59515BAF8A" 26 | }, 27 | "cell_type": "markdown", 28 | "source": "# Refresh Reference Data" 29 | }, 30 | { 31 | "metadata": { 32 | "id": "D5BA242BD89A48F3816A0D7F53206849" 33 | }, 34 | "cell_type": "markdown", 35 | "source": "In this notebook we continue our journey to improve our IoT streaming application.\nUp to now, we have a streaming job that processes randomly generated data as a stream. This stream is enriched using a fixed reference dataset that we load at start.\n\nThere is still an issue: We cannot add new sensors to our system. Once the reference dataset has been loaded and cached it cannot be changed from the outside world. A naive approach would be to remove the caching on the reference dataset and load the complete reference file on each streaming interval. Although this approach would work, it would not scale past certain file size. Loading data from secondary storage is costly and will consume computing resources that we would rather invest in processing the incoming data.\n\nIn this notebook we are going to explore a technique to amortize this cost. Instead of refreshing the dataset on each interval, we will cache the reference data for few interval and then we will refresh it. This process amortizes the cost over several streaming cycles, making it more \"affordable\".\n\nWe will be building on top of the previous notebooks and adding the new refresh logic. Refer to the *Learning Spark Streaming* book for a discussion of the delta changes in this notebook. " 36 | }, 37 | { 38 | "metadata": { 39 | "id": "79CD566650664F04A8515F55A809A44D" 40 | }, 41 | "cell_type": "markdown", 42 | "source": "## Our Streaming dataset will consist of sensor information, containing the sensorId, a timestamp, and a value.\nFor the sake of simplicity in this self-contained example, we are going to generate a randomized dataset, using an scenario that simulates a real IoT use case.\nThe timestamp will be the time of execution and each record will be formatted as a string coming from \"the field\" of comma separated values.\n\nWe also add a bit of real-world chaos to the data: Due to weather conditions, some sensors publish corrupt data. " 43 | }, 44 | { 45 | "metadata": { 46 | "trusted": true, 47 | "input_collapsed": false, 48 | "collapsed": false, 49 | "id": "1BA7507DA513491F8D3A1AED38F087CB" 50 | }, 51 | "cell_type": "code", 52 | "source": "val sensorCount = 100000\nval workDir = \"/tmp/learningsparkstreaming/\"\nval referenceFile = \"sensor-records.parquet\"\nval targetFile = \"enrichedIoTStream.parquet\"", 53 | "outputs": [] 54 | }, 55 | { 56 | "metadata": { 57 | "trusted": true, 58 | "input_collapsed": false, 59 | "collapsed": false, 60 | "id": "9CA4FD2CB34F47AF8097D5E2A2BF97DA" 61 | }, 62 | "cell_type": "code", 63 | "source": "import scala.util.Random\nval sensorId: () => Int = () => Random.nextInt(sensorCount)\nval data: () => Double = () => Random.nextDouble\nval timestamp: () => Long = () => System.currentTimeMillis\nval recordFunction: () => String = { () => \n if (Random.nextDouble < 0.9) {\n Seq(sensorId().toString, timestamp(), data()).mkString(\",\")\n } else {\n \"!!~corrupt~^&##$\" \n }\n }", 64 | "outputs": [] 65 | }, 66 | { 67 | "metadata": { 68 | "id": "5806B402E3314F15A0B21F46BC97FF4A" 69 | }, 70 | "cell_type": "markdown", 71 | "source": "### We use a particular trick that requires a moment of attention\nInstead of creating an RDD of text records, we create an RDD of record-generating functions. \nThen, each time the RDD is evaluated, the record function will generate a new random record. \nThis way we can simulate a realistic load of data that delivers a different set on each batch." 72 | }, 73 | { 74 | "metadata": { 75 | "trusted": true, 76 | "input_collapsed": false, 77 | "collapsed": false, 78 | "id": "A9AADE76D9DA471B9708B80B3E765890" 79 | }, 80 | "cell_type": "code", 81 | "source": "val sensorDataGenerator = sparkContext.parallelize(1 to 100).map(_ => recordFunction)\nval sensorData = sensorDataGenerator.map(recordFun => recordFun())", 82 | "outputs": [] 83 | }, 84 | { 85 | "metadata": { 86 | "id": "BFF9E80BAB5C4D468154BCD6FB5C6501" 87 | }, 88 | "cell_type": "markdown", 89 | "source": "## We create our Streaming Context" 90 | }, 91 | { 92 | "metadata": { 93 | "trusted": true, 94 | "input_collapsed": false, 95 | "collapsed": false, 96 | "id": "83B2B58C552240BBA6FF518B1AD274EB" 97 | }, 98 | "cell_type": "code", 99 | "source": "import org.apache.spark.streaming.StreamingContext\nimport org.apache.spark.streaming.Seconds\n\nval streamingContext = new StreamingContext(sparkContext, Seconds(2))", 100 | "outputs": [] 101 | }, 102 | { 103 | "metadata": { 104 | "id": "4E6DBF82BB6F4B6584233CD460F67263" 105 | }, 106 | "cell_type": "markdown", 107 | "source": "## Our stream source will be a ConstantInputDStream fed by the record-generating RDD.\nBy combining a constant input DStream with the record generating RDD, we create a changing stream of data to process in our example.\n(This method makes the example self-contained. It removes the need of an external stream generating process)\n" 108 | }, 109 | { 110 | "metadata": { 111 | "trusted": true, 112 | "input_collapsed": false, 113 | "collapsed": false, 114 | "id": "DF03F66BDDE0447B8202D39F2C0202E2" 115 | }, 116 | "cell_type": "code", 117 | "source": "import org.apache.spark.streaming.dstream.ConstantInputDStream\nval rawDStream = new ConstantInputDStream(streamingContext, sensorData)\n", 118 | "outputs": [] 119 | }, 120 | { 121 | "metadata": { 122 | "id": "62A5D12E89B24EFB8C411495942032B1" 123 | }, 124 | "cell_type": "markdown", 125 | "source": "# Load the initial reference data from a parquet file\nWe load the initial state of our reference data in the same way we did for the case of the static file. The only difference is that the reference is held in a mutable variable." 126 | }, 127 | { 128 | "metadata": { 129 | "trusted": true, 130 | "input_collapsed": false, 131 | "collapsed": false, 132 | "id": "2CEA3423DAA1415584BC10519D2ACE9D" 133 | }, 134 | "cell_type": "code", 135 | "source": "var sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\nsensorRef.cache()", 136 | "outputs": [] 137 | }, 138 | { 139 | "metadata": { 140 | "trusted": true, 141 | "input_collapsed": false, 142 | "collapsed": true, 143 | "id": "E9EE391DF94E4FB79FCD1D9801BBF330" 144 | }, 145 | "cell_type": "markdown", 146 | "source": "# Let the Spark Streaming Scheduler refresh the data\n\nIn order to periodically load the reference data, we are going to 'hook' onto the Spark Streaming scheduler.\n\nAs we have mentioned before, at its heart, Spark Streaming is a high-performance scheduling framework on top of the Spark engine. We can take advantage of the Spark Streaming scheduling capabilities, to have it refresh our reference data in periodic intervals. We express that refresh interval as a `window` over the base `batch interval`. In practical terms, every `x` batches we are going to refresh our reference data. \n\nWe use a `ConstantInputDStream` with an empty `RDD`. This ensures that, at all times, we have an empty `DStream` whose only function will be to give us access to the scheduler through the `foreachRDD` function.\n\nAt each `window` interval, we will update the variable that points to the current `DataFrame`. This is a safe construction as the Spark Streaming scheduler will linearly execute the scheduled operations that are due at each `batch interval`. Therefore, the new data will be available for the upstream operations that make use of it.\n\nWe use caching to ensure that the reference dataset is only loaded once over the intervals that it's used in the streaming application.\nIt's also important to `cache` the expiring data that was previously cached in order to free resources in the cluster and ensure that we have a stable system from the perspective of resource consumption." 147 | }, 148 | { 149 | "metadata": { 150 | "trusted": true, 151 | "input_collapsed": false, 152 | "collapsed": false, 153 | "id": "E15017FBFBB547FF888612C60F62F851" 154 | }, 155 | "cell_type": "code", 156 | "source": "import org.apache.spark.rdd.RDD\nval emptyRDD: RDD[Int] = sparkContext.emptyRDD\nval refreshDStream = new ConstantInputDStream(streamingContext, emptyRDD)\nval refreshIntervalDStream = refreshDStream.window(Seconds(60), Seconds(60))\nrefreshIntervalDStream.foreachRDD{ _ =>\n sensorRef.unpersist(false)\n sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\n sensorRef.cache()\n}\n ", 157 | "outputs": [] 158 | }, 159 | { 160 | "metadata": { 161 | "id": "CCCB597031E7451FB59D18BA85C0E4A4" 162 | }, 163 | "cell_type": "markdown", 164 | "source": "# Providing Schema information for our streaming data\nAs before, we want to define and apply a schema to the data we are receiving.\nIn Scala, we can define a schema with a `case class`" 165 | }, 166 | { 167 | "metadata": { 168 | "trusted": true, 169 | "input_collapsed": false, 170 | "collapsed": false, 171 | "id": "E7A917C393654969812E6E38223BBA52" 172 | }, 173 | "cell_type": "code", 174 | "source": "case class SensorData(sensorId: Int, timestamp: Long, value: Double)", 175 | "outputs": [] 176 | }, 177 | { 178 | "metadata": { 179 | "id": "9AD1ACAD450E44DA8C046EB48CD4EE5A" 180 | }, 181 | "cell_type": "markdown", 182 | "source": "Now we apply that schema to the dstream, using the `flatMap` function.\n\nWe use `flatMap` instead of a `map` because there might be cases when the incoming data is incomplete or corrupted.\nIf we would use `map`, we would have to provide a resulting value for each transformed record. \nThat is something we cannot do for invalid records.\nWith `flatMap` in combination with `Option`, we can represent valid records as `Some(recordValue)` and invalid records as `None`.\nBy the virtue of `flatMap` the internal `Option` container gets flattend and our resulting stream will only contain valid `recordValue`s.\n\nDuring the parsing of the comma separated records, we not only protect ourselves against missing fields, but also parse the numeric values to their expected types. The surrounding `Try` captures any `NumberFormatException` that might arise from invalid records." 183 | }, 184 | { 185 | "metadata": { 186 | "trusted": true, 187 | "input_collapsed": false, 188 | "collapsed": false, 189 | "id": "5285C2BBC1854F059AB8E1D0244AE1C7" 190 | }, 191 | "cell_type": "code", 192 | "source": "import scala.util.Try\nval schemaStream = rawDStream.flatMap{record => \n val fields = record.split(\",\")\n if (fields.size == 3) {\n Try (SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)).toOption\n } else { None }\n }", 193 | "outputs": [] 194 | }, 195 | { 196 | "metadata": { 197 | "id": "7A7C144384904E96BE66A649BD193C15" 198 | }, 199 | "cell_type": "markdown", 200 | "source": "# Enrich the streaming data\nWith the schema stream in place, we can proceed to transform the underlying RDDs using DataFrames.\nWe are going to use the reference data to add the specific sensor information. Note that from the perspective of the core algorithm in this example, nothing has changed. We use the reference data as before to enrich our sensor information.\n\nAs before, we do this in the context of the general-purpose action `foreachRDD`. In this context, we can convert the incoming data to a `DataFrame` and use the `join` operation defined on `DataFrame`s to \n" 201 | }, 202 | { 203 | "metadata": { 204 | "trusted": true, 205 | "input_collapsed": false, 206 | "collapsed": false, 207 | "id": "99696609577F49DB809AF94C319CB449" 208 | }, 209 | "cell_type": "code", 210 | "source": "val stableSparkSession = sparkSession\nimport stableSparkSession.implicits._\nimport org.apache.spark.sql.SaveMode.Append\nschemaStream.foreachRDD{rdd => \n val sensorDF = rdd.toDF()\n val sensorWithInfo = sensorDF.join(sensorRef, \"sensorId\")\n val denormalizedSensorData =\n sensorWithInfo.withColumn(\"dnvalue\", $\"value\"*($\"maxRange\"-$\"minRange\")+$\"minRange\")\n val sensorRecords = denormalizedSensorData.drop(\"value\", \"maxRange\", \"minRange\")\n sensorRecords.write.format(\"parquet\").mode(Append).save(s\"$workDir/$targetFile\")\n }", 211 | "outputs": [] 212 | }, 213 | { 214 | "metadata": { 215 | "trusted": true, 216 | "input_collapsed": false, 217 | "collapsed": false, 218 | "id": "F366201F2275412F818532AB671A55BC" 219 | }, 220 | "cell_type": "code", 221 | "source": "streamingContext.start()", 222 | "outputs": [] 223 | }, 224 | { 225 | "metadata": { 226 | "trusted": true, 227 | "input_collapsed": false, 228 | "collapsed": true, 229 | "id": "B6F0075E9BB04467858CABAA000489EF" 230 | }, 231 | "cell_type": "markdown", 232 | "source": "Be careful not to stop the context if you want the streaming process to continue. \n" 233 | }, 234 | { 235 | "metadata": { 236 | "trusted": true, 237 | "input_collapsed": false, 238 | "collapsed": false, 239 | "id": "1169F38EFAB44E5C89BE6D8C85035CCA" 240 | }, 241 | "cell_type": "code", 242 | "source": "streamingContext.stop(stopSparkContext=false, stopGracefully=true )", 243 | "outputs": [] 244 | }, 245 | { 246 | "metadata": { 247 | "id": "5BF4B4ECDC794A769ED429A2D35B8A38" 248 | }, 249 | "cell_type": "markdown", 250 | "source": "#Inspect the result\nWe can use the current Spark Session concurrently with the running Spark Streaming job in order to inspect the resulting data.\n" 251 | }, 252 | { 253 | "metadata": { 254 | "trusted": true, 255 | "input_collapsed": false, 256 | "collapsed": false, 257 | "id": "87973510A2E544B88D0825533CB24BC5" 258 | }, 259 | "cell_type": "code", 260 | "source": "val enrichedRecords = sparkSession.read.parquet(s\"$workDir/$targetFile\")\nenrichedRecords", 261 | "outputs": [] 262 | }, 263 | { 264 | "metadata": { 265 | "trusted": true, 266 | "input_collapsed": false, 267 | "collapsed": false, 268 | "id": "03C77BDE93904F3A8BDD12B66B427E5A" 269 | }, 270 | "cell_type": "code", 271 | "source": "enrichedRecords.count", 272 | "outputs": [] 273 | }, 274 | { 275 | "metadata": { 276 | "trusted": true, 277 | "input_collapsed": false, 278 | "collapsed": true, 279 | "id": "EE4C39F5EBCF44BB81021B4B50015C68" 280 | }, 281 | "cell_type": "code", 282 | "source": "", 283 | "outputs": [] 284 | } 285 | ], 286 | "nbformat": 4 287 | } 288 | -------------------------------------------------------------------------------- /chapter-23/streaming-data-to-parquet.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "a6321d91-1183-4799-8d64-3b9deecfb255", 4 | "name" : "streaming-data-to-parquet", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "40AAFF23B9654D2A9B2E0C59515BAF8A" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# Writing Streaming Data to Parquet" 28 | }, { 29 | "metadata" : { 30 | "id" : "79CD566650664F04A8515F55A809A44D" 31 | }, 32 | "cell_type" : "markdown", 33 | "source" : "## Our Streaming dataset will consist of sensor information, containing the sensorId, a timestamp, and a value.\nFor the sake of simplicity in this self-contained example, we are going to generate a randomized dataset, using an scenario that simulates a real IoT use case.\nThe timestamp will be the time of execution and each record will be formatted as a string coming from \"the field\" of comma separated values.\n\nWe also add a bit of real-world chaos to the data: Due to weather conditions, some sensors publish corrupt data. " 34 | }, { 35 | "metadata" : { 36 | "trusted" : true, 37 | "input_collapsed" : false, 38 | "collapsed" : true, 39 | "id" : "6074E008266C4FDF92DFB936D47A7CBE" 40 | }, 41 | "cell_type" : "code", 42 | "source" : "val sensorCount = 100000\nval workDir = \"/tmp/learningsparkstreaming/\"", 43 | "outputs" : [ ] 44 | }, { 45 | "metadata" : { 46 | "trusted" : true, 47 | "input_collapsed" : false, 48 | "collapsed" : true, 49 | "id" : "9CA4FD2CB34F47AF8097D5E2A2BF97DA" 50 | }, 51 | "cell_type" : "code", 52 | "source" : "import scala.util.Random\n\nval sensorId: () => Int = () => Random.nextInt(sensorCount) // 100K sensors in our system\nval data: () => Double = () => Random.nextDouble\nval timestamp: () => Long = () => System.currentTimeMillis\nval recordFunction: () => String = { () => \n if (Random.nextDouble < 0.9) {\n Seq(sensorId().toString, timestamp(), data()).mkString(\",\")\n } else {\n \"!!~corrupt~^&##$\" \n }\n }", 53 | "outputs" : [ ] 54 | }, { 55 | "metadata" : { 56 | "id" : "5806B402E3314F15A0B21F46BC97FF4A" 57 | }, 58 | "cell_type" : "markdown", 59 | "source" : "### We use a particular trick that requires a moment of attention\nInstead of creating an RDD of text records, we create an RDD of record-generating functions. \nThen, each time the RDD is evaluated, the record function will generate a new random record. \nThis way we can simulate a realistic load of data that delivers a different set on each batch." 60 | }, { 61 | "metadata" : { 62 | "trusted" : true, 63 | "input_collapsed" : false, 64 | "collapsed" : true, 65 | "id" : "A9AADE76D9DA471B9708B80B3E765890" 66 | }, 67 | "cell_type" : "code", 68 | "source" : "val sensorDataGenerator = sparkContext.parallelize(1 to 100).map(_ => recordFunction)\nval sensorData = sensorDataGenerator.map(recordFun => recordFun())", 69 | "outputs" : [ ] 70 | }, { 71 | "metadata" : { 72 | "id" : "9954BB2935D8408C9FC5F563BB108504" 73 | }, 74 | "cell_type" : "markdown", 75 | "source" : "## Lets sample some data" 76 | }, { 77 | "metadata" : { 78 | "trusted" : true, 79 | "input_collapsed" : false, 80 | "collapsed" : true, 81 | "presentation" : { 82 | "tabs_state" : "{\n \"tab_id\": \"#tab1859881927-0\"\n}", 83 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 84 | }, 85 | "id" : "0D10D927B8F74EB88ECD6DB3A60A6064" 86 | }, 87 | "cell_type" : "code", 88 | "source" : "sensorData.take(5)", 89 | "outputs" : [ ] 90 | }, { 91 | "metadata" : { 92 | "id" : "BFF9E80BAB5C4D468154BCD6FB5C6501" 93 | }, 94 | "cell_type" : "markdown", 95 | "source" : "## We create our Streaming Context" 96 | }, { 97 | "metadata" : { 98 | "trusted" : true, 99 | "input_collapsed" : false, 100 | "collapsed" : true, 101 | "id" : "83B2B58C552240BBA6FF518B1AD274EB" 102 | }, 103 | "cell_type" : "code", 104 | "source" : "import org.apache.spark.streaming.StreamingContext\nimport org.apache.spark.streaming.Seconds\n\nval streamingContext = new StreamingContext(sparkContext, Seconds(2))", 105 | "outputs" : [ ] 106 | }, { 107 | "metadata" : { 108 | "id" : "4E6DBF82BB6F4B6584233CD460F67263" 109 | }, 110 | "cell_type" : "markdown", 111 | "source" : "## Our stream source will be a ConstantInputDStream fed by the record-generating RDD.\nBy combining a constant input DStream with the record generating RDD, we create a changing stream of data to process in our example.\n(This method makes the example self-contained. It removes the need of an external stream generating process)\n" 112 | }, { 113 | "metadata" : { 114 | "trusted" : true, 115 | "input_collapsed" : false, 116 | "collapsed" : true, 117 | "id" : "DF03F66BDDE0447B8202D39F2C0202E2" 118 | }, 119 | "cell_type" : "code", 120 | "source" : "import org.apache.spark.streaming.dstream.ConstantInputDStream\nval rawDStream = new ConstantInputDStream(streamingContext, sensorData)\n", 121 | "outputs" : [ ] 122 | }, { 123 | "metadata" : { 124 | "id" : "CCCB597031E7451FB59D18BA85C0E4A4" 125 | }, 126 | "cell_type" : "markdown", 127 | "source" : "# Providing Schema information for our streaming data\nNow that we have a DStream of fresh data processed in a 2-second interval, we can start focusing on the gist of this example.\nFirst, we want to define and apply a schema to the data we are receiving.\nIn Scala, we can define a schema with a `case class`" 128 | }, { 129 | "metadata" : { 130 | "trusted" : true, 131 | "input_collapsed" : false, 132 | "collapsed" : true, 133 | "id" : "E7A917C393654969812E6E38223BBA52" 134 | }, 135 | "cell_type" : "code", 136 | "source" : "case class SensorData(sensorId: Int, timestamp: Long, value: Double)", 137 | "outputs" : [ ] 138 | }, { 139 | "metadata" : { 140 | "id" : "9AD1ACAD450E44DA8C046EB48CD4EE5A" 141 | }, 142 | "cell_type" : "markdown", 143 | "source" : "Now we apply that schema to the dstream, using the `flatMap` function.\n\nWe use `flatMap` instead of a `map` because there might be cases when the incoming data is incomplete or corrupted.\nIf we would use `map`, we would have to provide a resulting value for each transformed record. \nThat is something we cannot do for invalid records.\nWith `flatMap` in combination with `Option`, we can represent valid records as `Some(recordValue)` and invalid records as `None`.\nBy the virtue of `flatMap` the internal `Option` container gets flattend and our resulting stream will only contain valid `recordValue`s.\n\nDuring the parsing of the comma separated records, we not only protect ourselves against missing fields, but also parse the numeric values to their expected types. The surrounding `Try` captures any `NumberFormatException` that might arise from invalid records." 144 | }, { 145 | "metadata" : { 146 | "trusted" : true, 147 | "input_collapsed" : false, 148 | "collapsed" : true, 149 | "id" : "5285C2BBC1854F059AB8E1D0244AE1C7" 150 | }, 151 | "cell_type" : "code", 152 | "source" : "import scala.util.Try\nval schemaStream = rawDStream.flatMap{record => \n val fields = record.split(\",\")\n if (fields.size == 3) {\n Try (SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)).toOption\n } else { None }\n }", 153 | "outputs" : [ ] 154 | }, { 155 | "metadata" : { 156 | "id" : "7A7C144384904E96BE66A649BD193C15" 157 | }, 158 | "cell_type" : "markdown", 159 | "source" : "# Saving DataFrames\nWith the schema stream in place, we can proceed to transform the underlying RDDs in DataFrames.\nWe do this in the context of the general-purpose action `foreachRDD`. \nIt's impossible to use transformation at this point because a `DStream[DataFrame]` is undefined.\nThis also means that any further operations we would like to apply to the DataFrame (or DataSet) needs to be contained in the scope of the \n`foreachRDD` closure." 160 | }, { 161 | "metadata" : { 162 | "trusted" : true, 163 | "input_collapsed" : false, 164 | "collapsed" : true, 165 | "id" : "99696609577F49DB809AF94C319CB449" 166 | }, 167 | "cell_type" : "code", 168 | "source" : "import org.apache.spark.sql.SaveMode.Append\nschemaStream.foreachRDD{rdd => \n val df = rdd.toDF()\n df.write.format(\"parquet\").mode(Append).save(s\"$workDir/iotstream.parquet\")\n }", 169 | "outputs" : [ ] 170 | }, { 171 | "metadata" : { 172 | "trusted" : true, 173 | "input_collapsed" : false, 174 | "collapsed" : true, 175 | "id" : "F366201F2275412F818532AB671A55BC" 176 | }, 177 | "cell_type" : "code", 178 | "source" : "streamingContext.start()", 179 | "outputs" : [ ] 180 | }, { 181 | "metadata" : { 182 | "trusted" : true, 183 | "input_collapsed" : false, 184 | "collapsed" : true, 185 | "id" : "B6F0075E9BB04467858CABAA000489EF" 186 | }, 187 | "cell_type" : "code", 188 | "source" : "streamingContext.stop(false)", 189 | "outputs" : [ ] 190 | }, { 191 | "metadata" : { 192 | "trusted" : true, 193 | "input_collapsed" : false, 194 | "collapsed" : true, 195 | "id" : "87973510A2E544B88D0825533CB24BC5" 196 | }, 197 | "cell_type" : "code", 198 | "source" : "", 199 | "outputs" : [ ] 200 | } ], 201 | "nbformat" : 4 202 | } 203 | -------------------------------------------------------------------------------- /chapter-25/streaming-listener-example.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "033cd6c0-1dcb-4a13-9367-c83c4629f49d", 4 | "name" : "streaming-listener-example.snb.ipynb", 5 | "user_save_timestamp" : "2019-05-13T00:30:13.957Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : [ "org.apache.spark %% spark-streaming-kafka-0-8 % 2.1.0" ], 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : { 20 | "jars" : "" 21 | }, 22 | "customVars" : null 23 | }, 24 | "cells" : [ { 25 | "metadata" : { 26 | "id" : "40AAFF23B9654D2A9B2E0C59515BAF8A" 27 | }, 28 | "cell_type" : "markdown", 29 | "source" : "# Streaming Data with Kafka - With added Custom Listener" 30 | }, { 31 | "metadata" : { 32 | "id" : "D5BA242BD89A48F3816A0D7F53206849" 33 | }, 34 | "cell_type" : "markdown", 35 | "source" : "In this notebook extends our Kafka job with an custom Listener that will let us visualize the lifecycle of the streaming job.\n\nAs we have explored the application logic of this job already, we can head to towards the bottom of the notebook where we implement a custom StreamingListener and register it to the StreamingContext.\n\nThen, when we start our streaming job, we can observe the different streaming events reported in a reactive UI Table widget in the notebook." 36 | }, { 37 | "metadata" : { 38 | "id" : "79CD566650664F04A8515F55A809A44D" 39 | }, 40 | "cell_type" : "markdown", 41 | "source" : "\n## Our Streaming dataset will consist of sensor information, containing the sensorId, a timestamp, and a value.\nFor the sake of simplicity in this self-contained example, we are going to generate a randomized dataset, using an scenario that simulates a real IoT use case.\nThe timestamp will be the time of execution and each record will be formatted as a string coming from \"the field\" of comma separated values.\n\nWe also add a bit of real-world chaos to the data: Due to weather conditions, some sensors publish corrupt data. " 42 | }, { 43 | "metadata" : { 44 | "trusted" : true, 45 | "input_collapsed" : false, 46 | "collapsed" : true, 47 | "id" : "1BA7507DA513491F8D3A1AED38F087CB" 48 | }, 49 | "cell_type" : "code", 50 | "source" : [ "val topic = \"iot-data\"\n", "val workDir = \"/tmp/learningsparkstreaming/\"\n", "val referenceFile = \"sensor-records.parquet\"\n", "val targetFile = \"enrichedIoTStream.parquet\"\n", "val unknownSensorsTargetFile = \"unknownSensorsStream.parquet\"\n", "val kafkaBootstrapServer = \"127.0.0.1:9092\"" ], 51 | "outputs" : [ ] 52 | }, { 53 | "metadata" : { 54 | "id" : "62A5D12E89B24EFB8C411495942032B1" 55 | }, 56 | "cell_type" : "markdown", 57 | "source" : "# Load the reference data from a parquet file\nWe also cache the data to keep it in memory and improve the performance of our steaming application" 58 | }, { 59 | "metadata" : { 60 | "trusted" : true, 61 | "input_collapsed" : false, 62 | "collapsed" : true, 63 | "id" : "78E6ABA2D88B47F38905441B7D79E65D" 64 | }, 65 | "cell_type" : "code", 66 | "source" : [ "val sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\n", "sensorRef.cache()" ], 67 | "outputs" : [ ] 68 | }, { 69 | "metadata" : { 70 | "id" : "9CBAE4C0C1EB48DBB802A0CC2960B2DB" 71 | }, 72 | "cell_type" : "markdown", 73 | "source" : "(Parquet files preserve the schema information, which we can retrieve from the DataFrame)" 74 | }, { 75 | "metadata" : { 76 | "trusted" : true, 77 | "input_collapsed" : false, 78 | "collapsed" : true, 79 | "presentation" : { 80 | "tabs_state" : "{\n \"tab_id\": \"#tab188158199-0\"\n}", 81 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 82 | }, 83 | "id" : "774E09B823774494979FC8FE6AE3078F" 84 | }, 85 | "cell_type" : "code", 86 | "source" : [ "sensorRef.schema" ], 87 | "outputs" : [ ] 88 | }, { 89 | "metadata" : { 90 | "id" : "BFF9E80BAB5C4D468154BCD6FB5C6501" 91 | }, 92 | "cell_type" : "markdown", 93 | "source" : "## We create our Streaming Context" 94 | }, { 95 | "metadata" : { 96 | "trusted" : true, 97 | "input_collapsed" : false, 98 | "collapsed" : true, 99 | "id" : "83B2B58C552240BBA6FF518B1AD274EB" 100 | }, 101 | "cell_type" : "code", 102 | "source" : [ "import org.apache.spark.streaming.StreamingContext\n", "import org.apache.spark.streaming.Seconds\n", "\n", "val streamingContext = new StreamingContext(sparkContext, Seconds(2))" ], 103 | "outputs" : [ ] 104 | }, { 105 | "metadata" : { 106 | "id" : "4E6DBF82BB6F4B6584233CD460F67263" 107 | }, 108 | "cell_type" : "markdown", 109 | "source" : "## Our stream source will be a a Direct Kafka Stream\n" 110 | }, { 111 | "metadata" : { 112 | "trusted" : true, 113 | "input_collapsed" : false, 114 | "collapsed" : true, 115 | "id" : "DF03F66BDDE0447B8202D39F2C0202E2" 116 | }, 117 | "cell_type" : "code", 118 | "source" : [ "import org.apache.kafka.clients.consumer.ConsumerRecord\n", "import kafka.serializer.StringDecoder\n", "import org.apache.spark.streaming.kafka._\n", "\n", "val kafkaParams = Map[String, String](\n", " \"metadata.broker.list\" -> kafkaBootstrapServer,\n", " \"group.id\" -> \"iot-data-group\",\n", " \"auto.offset.reset\" -> \"largest\",\n", " \"enable.auto.commit\" -> (false: java.lang.Boolean).toString\n", ")\n", "\n", "val topics = Set(topic)\n", "@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](\n", " streamingContext, kafkaParams, topics)\n", "\n", "// @transient val stream = KafkaUtils.createDirectStream[String, String](\n", "// streamingContext,\n", "// PreferConsistent,\n", "// Subscribe[String, String](topics, kafkaParams)\n", "// )\n", "\n" ], 119 | "outputs" : [ ] 120 | }, { 121 | "metadata" : { 122 | "id" : "CCCB597031E7451FB59D18BA85C0E4A4" 123 | }, 124 | "cell_type" : "markdown", 125 | "source" : "# Providing Schema information for our streaming data\nNow that we have a DStream of fresh data processed in a 2-second interval, we can start focusing on the gist of this example.\nFirst, we want to define and apply a schema to the data we are receiving.\nIn Scala, we can define a schema with a `case class`" 126 | }, { 127 | "metadata" : { 128 | "trusted" : true, 129 | "input_collapsed" : false, 130 | "collapsed" : true, 131 | "id" : "E7A917C393654969812E6E38223BBA52" 132 | }, 133 | "cell_type" : "code", 134 | "source" : [ "case class SensorData(sensorId: Int, timestamp: Long, value: Double)" ], 135 | "outputs" : [ ] 136 | }, { 137 | "metadata" : { 138 | "id" : "9AD1ACAD450E44DA8C046EB48CD4EE5A" 139 | }, 140 | "cell_type" : "markdown", 141 | "source" : "Now we apply that schema to the dstream, using the `flatMap` function.\n\nWe use `flatMap` instead of a `map` because there might be cases when the incoming data is incomplete or corrupted.\nIf we would use `map`, we would have to provide a resulting value for each transformed record. \nThat is something we cannot do for invalid records.\nWith `flatMap` in combination with `Option`, we can represent valid records as `Some(recordValue)` and invalid records as `None`.\nBy the virtue of `flatMap` the internal `Option` container gets flattend and our resulting stream will only contain valid `recordValue`s.\n\nDuring the parsing of the comma separated records, we not only protect ourselves against missing fields, but also parse the numeric values to their expected types. The surrounding `Try` captures any `NumberFormatException` that might arise from invalid records." 142 | }, { 143 | "metadata" : { 144 | "trusted" : true, 145 | "input_collapsed" : false, 146 | "collapsed" : true, 147 | "id" : "5285C2BBC1854F059AB8E1D0244AE1C7" 148 | }, 149 | "cell_type" : "code", 150 | "source" : [ "import scala.util.Try\n", "val schemaStream = stream.flatMap{case (id, record) => \n", " val fields = record.split(\",\")\n", " if (fields.size == 3) {\n", " Try (SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)).toOption\n", " } else { None }\n", " }" ], 151 | "outputs" : [ ] 152 | }, { 153 | "metadata" : { 154 | "id" : "7A7C144384904E96BE66A649BD193C15" 155 | }, 156 | "cell_type" : "markdown", 157 | "source" : "# Enrich the streaming data, without dropping records.\nWith the schema stream in place, we can proceed to transform the underlying RDDs into DataFrames.\n\nAs in the previous notebook, we are going to use the reference data to add the specific sensor information.\nPreviously, we used the default 'join', which is an inner-join that requires the join key to be available on both sides of the join.\nThis causes us to drop all data records for which we don't know the id. Given that new sensors might become available or misconfigured sensors might be sending an incorrect id, we would like to preserve all records in order to reconcile them in a latter stage.\n\nAs before, we do this in the context of the general-purpose action `foreachRDD`. " 158 | }, { 159 | "metadata" : { 160 | "trusted" : true, 161 | "input_collapsed" : false, 162 | "collapsed" : true, 163 | "id" : "99696609577F49DB809AF94C319CB449" 164 | }, 165 | "cell_type" : "code", 166 | "source" : [ "val stableSparkSession = sparkSession\n", "import stableSparkSession.implicits._\n", "import org.apache.spark.sql.SaveMode.Append\n", "schemaStream.foreachRDD{rdd => \n", " val sensorDF = rdd.toDF()\n", " val sensorWithInfo = sensorRef.join(broadcast(sensorDF), Seq(\"sensorId\"), \"rightouter\")\n", " val unknownSensors = sensorWithInfo.filter($\"sensorType\".isNull) \n", " val knownSensors = sensorWithInfo.filter(!$\"sensorType\".isNull) \n", " val denormalizedSensorData =\n", " knownSensors.withColumn(\"dnvalue\", $\"value\"*($\"maxRange\"-$\"minRange\")+$\"minRange\")\n", " val sensorRecords = denormalizedSensorData.drop(\"value\", \"maxRange\", \"minRange\")\n", " val ts= System.currentTimeMillis\n", " sensorRecords.write.format(\"parquet\").mode(Append).save(s\"$workDir/$targetFile\")\n", " unknownSensors.write.format(\"parquet\").mode(Append).save(s\"$workDir/$unknownSensorsTargetFile\")\n", " }" ], 167 | "outputs" : [ ] 168 | }, { 169 | "metadata" : { 170 | "id" : "DEAD82F07F53459D8D7213998F106D0F" 171 | }, 172 | "cell_type" : "markdown", 173 | "source" : "# Custom Streaming Listener\nThis sample custom listener shows how to implement a Streaming Custom Listener to receive updates about our streaming application evetns and progress.\nWe have opted for a UI data display: This custom listener produces a Notebook Widget that reactively receives and displays the notified data from the `StreamingListener` interface.\nThat way we can visually explore the execution lifecycle of this Spark Streaming Job" 174 | }, { 175 | "metadata" : { 176 | "trusted" : true, 177 | "input_collapsed" : false, 178 | "collapsed" : true, 179 | "id" : "4D17FFA32D9049A58CB55FD17BDD4755" 180 | }, 181 | "cell_type" : "code", 182 | "source" : [ "import org.apache.spark.streaming.scheduler._\n", "import scala.collection.immutable.Queue\n", "class NotebookTableStreamingListener() extends StreamingListener {\n", " case class TableEntry(timestamp: Long, operation: String, target: String, duration: Option[Long])\n", " object TableEntry {\n", " def now() = System.currentTimeMillis()\n", " def apply(operation: String, target: String, duration: Option[Long] = None): TableEntry = {\n", " this(now(), operation, target, duration)\n", " }\n", " }\n", " val dummyEntry = Seq(TableEntry(\"-\",\"-\"))\n", " val table = new notebook.front.widgets.charts.TableChart[Seq[TableEntry]](dummyEntry)\n", " var entries: List[TableEntry] = List() \n", " val EventLimit = 40\n", " def add(tableEntry: TableEntry) = {\n", " entries = (tableEntry :: entries).take(EventLimit)\n", " table.applyOn(entries)\n", " }\n", " \n", " def batchName(batchInfo: BatchInfo):String = {\n", " \"batch-\" + batchInfo.batchTime\n", " }\n", " def shortOutputOperationDescription(outOp: OutputOperationInfo) : String = {\n", " outOp.description.split(\"\\n\").headOption.getOrElse(\"-\")\n", " }\n", " \n", " /** Called when the streaming has been started */\n", " override def onStreamingStarted(streamingStarted: StreamingListenerStreamingStarted): Unit = {\n", " add(TableEntry(\"stream started\", \"-\"))\n", " }\n", "\n", " /** Called when a receiver has been started */\n", " override def onReceiverStarted(receiverStarted: StreamingListenerReceiverStarted): Unit = {\n", " add(TableEntry(\"receiver started\", receiverStarted.receiverInfo.name))\n", " }\n", "\n", " /** Called when a receiver has reported an error */\n", " override def onReceiverError(receiverError: StreamingListenerReceiverError): Unit = {\n", " add(TableEntry(\"receiver error\", receiverError.receiverInfo.lastError))\n", " }\n", "\n", " /** Called when a receiver has been stopped */\n", " override def onReceiverStopped(receiverStopped: StreamingListenerReceiverStopped) = {\n", " add(TableEntry(\"receiver stopped\", receiverStopped.receiverInfo.name))\n", " }\n", "\n", " /** Called when a batch of jobs has been submitted for processing. */\n", " override def onBatchSubmitted(batchSubmitted: StreamingListenerBatchSubmitted) = {\n", " add(TableEntry(\"batch submitted\", batchName(batchSubmitted.batchInfo)))\n", " }\n", "\n", " /** Called when processing of a batch of jobs has started. */\n", " override def onBatchStarted(batchStarted: StreamingListenerBatchStarted): Unit = {\n", " add(TableEntry(\"batch started\", batchName(batchStarted.batchInfo)))\n", " }\n", "\n", " /** Called when processing of a batch of jobs has completed. */\n", " override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = {\n", " add(TableEntry(\"batch completed\", batchName(batchCompleted.batchInfo), batchCompleted.batchInfo.totalDelay))\n", " }\n", "\n", " /** Called when processing of a job of a batch has started. */\n", " override def onOutputOperationStarted(outputOperationStarted: StreamingListenerOutputOperationStarted): Unit = {\n", " add(TableEntry(\"output operation started\", shortOutputOperationDescription(outputOperationStarted.outputOperationInfo)))\n", " }\n", "\n", " /** Called when processing of a job of a batch has completed. */\n", " override def onOutputOperationCompleted(outputOperationCompleted: StreamingListenerOutputOperationCompleted): Unit = {\n", " add(TableEntry(\"output operation completed\", shortOutputOperationDescription(outputOperationCompleted.outputOperationInfo), outputOperationCompleted.outputOperationInfo.duration))\n", " }\n", "\n", "}\n" ], 183 | "outputs" : [ ] 184 | }, { 185 | "metadata" : { 186 | "id" : "85D8972DB3584F748E467AAEF68ED742" 187 | }, 188 | "cell_type" : "markdown", 189 | "source" : "## Create an instance of the listener that we just defined" 190 | }, { 191 | "metadata" : { 192 | "trusted" : true, 193 | "input_collapsed" : false, 194 | "collapsed" : true, 195 | "id" : "841F9D154A9841D7AC4846CE84100E41" 196 | }, 197 | "cell_type" : "code", 198 | "source" : [ "val customTableListener = new NotebookTableStreamingListener()" ], 199 | "outputs" : [ ] 200 | }, { 201 | "metadata" : { 202 | "id" : "256641FD99D245678304F0BA2EEFCC18" 203 | }, 204 | "cell_type" : "markdown", 205 | "source" : "## Add the listener to the streaming context so that it can receive callbacks from the different lifecycle points" 206 | }, { 207 | "metadata" : { 208 | "trusted" : true, 209 | "input_collapsed" : false, 210 | "collapsed" : true, 211 | "id" : "E6285DF57D0F44F4804924505C29A0E7" 212 | }, 213 | "cell_type" : "code", 214 | "source" : [ "streamingContext.addStreamingListener(customTableListener)" ], 215 | "outputs" : [ ] 216 | }, { 217 | "metadata" : { 218 | "id" : "8F6454492ABD4E81A5BB81E2FDB8BB7B" 219 | }, 220 | "cell_type" : "markdown", 221 | "source" : "## We add the table widget to our notebook to render it" 222 | }, { 223 | "metadata" : { 224 | "trusted" : true, 225 | "input_collapsed" : false, 226 | "collapsed" : true, 227 | "id" : "CE2B2D0436CB45B08D3DAF05C1A6AFCD" 228 | }, 229 | "cell_type" : "code", 230 | "source" : [ "customTableListener.table" ], 231 | "outputs" : [ ] 232 | }, { 233 | "metadata" : { 234 | "id" : "6511488CAD4041658603F0CF485BC548" 235 | }, 236 | "cell_type" : "markdown", 237 | "source" : "## Start the streaming context so that the streaming process can start.\nWatch the table for updates with data about the execution of our streaming context." 238 | }, { 239 | "metadata" : { 240 | "trusted" : true, 241 | "input_collapsed" : false, 242 | "collapsed" : true, 243 | "id" : "F366201F2275412F818532AB671A55BC" 244 | }, 245 | "cell_type" : "code", 246 | "source" : [ "streamingContext.start()" ], 247 | "outputs" : [ ] 248 | }, { 249 | "metadata" : { 250 | "trusted" : true, 251 | "input_collapsed" : false, 252 | "collapsed" : true, 253 | "id" : "B6F0075E9BB04467858CABAA000489EF" 254 | }, 255 | "cell_type" : "code", 256 | "source" : [ "// Be careful not to stop the context if you want the streaming process to continue\n", "streamingContext.stop(false)" ], 257 | "outputs" : [ ] 258 | } ], 259 | "nbformat" : 4 260 | } -------------------------------------------------------------------------------- /chapter-27/counting-unique-users.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "id": "49b7a4a3-bec7-4bd4-a926-fef53506cf9b", 4 | "name": "CountingUniqueIds", 5 | "user_save_timestamp": "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z", 7 | "language_info": { 8 | "name": "scala", 9 | "file_extension": "scala", 10 | "codemirror_mode": "text/x-scala" 11 | }, 12 | "trusted": true, 13 | "sparkNotebook": null, 14 | "customLocalRepo": "/home/maasg/.ivy2/local", 15 | "customRepos": null, 16 | "customDeps": [ 17 | "learning.spark.streaming %% hllaccumulator % 0.1.1-SNAPSHOT" 18 | ], 19 | "customImports": null, 20 | "customArgs": null, 21 | "customSparkConf": null, 22 | "customVars": null 23 | }, 24 | "cells": [ 25 | { 26 | "metadata": { 27 | "id": "9D292F715E104E7B88D7817FC94C7E1B" 28 | }, 29 | "cell_type": "markdown", 30 | "source": "#Approximate Counting Unique Users in a Stream\nIn this notebook we are going to explore the use of a probabilistic data structure, HyperLogLog, to approximate the count of unique visitors in a website.\n\nFor that purpose, we are going look at the logs of an imaginary popular weblog.\nIn this job, the core logic is to track which articles are popular, by maintaing a register of the clicks received by URL and partitioned by path.\n To keep track of the unique visitors, we will make use of accumulators, which are maintained as a parallel channel to the main streaming logic. " 31 | }, 32 | { 33 | "metadata": { 34 | "id": "448179780A5F48A4922403BD69B5510D" 35 | }, 36 | "cell_type": "markdown", 37 | "source": "__Note that, to run this notebook, the project \"hll-accumulator\" needs to be published in the local system.\nRefer to [hll-accumulator](https://github.com/LearningSparkStreaming/HLLAccumulator)__" 38 | }, 39 | { 40 | "metadata": { 41 | "trusted": true, 42 | "input_collapsed": false, 43 | "collapsed": false, 44 | "id": "AFA92D14BAD64A768C419526E46EA888" 45 | }, 46 | "cell_type": "code", 47 | "source": "// These are the main- and sub- categories for our blog site\nval categories = Map(\"health\" -> List(\"sleep\",\"training\",\"diet\",\"meditation\"), \n \"science\" -> List(\"physics\", \"mechanics\", \"nature\", \"ecology\"), \n \"diy\" -> List(\"home_improvement\", \"robotics\", \"drones\", \"home_automation\")\n )\n// All topics merged\nval topics = (for {\n mainCategory <- categories.keys\n subCategory <- categories(mainCategory)\n} yield (s\"$mainCategory/$subCategory\")).toList\n", 48 | "outputs": [] 49 | }, 50 | { 51 | "metadata": { 52 | "id": "9046E37AF5BF443A88FEC82578163283" 53 | }, 54 | "cell_type": "markdown", 55 | "source": "## A Simplified Domain Model\nFor this example we are going to obviate ancilliary aspects to log processing, such as user agent, ip-address, HTTP/codes, etc... Our simplified model will consist of `timestamp, userId, pathVisited`\n\nThis \"stream setup\" code can be mostly only glanced over as its only function is to generate a weighted-randomized stream of user clicks. In a \"real-life\" scenario we would get this data from a production web server." 56 | }, 57 | { 58 | "metadata": { 59 | "trusted": true, 60 | "input_collapsed": false, 61 | "collapsed": false, 62 | "id": "907E46E518F749898C2A307B1C03A4FD" 63 | }, 64 | "cell_type": "code", 65 | "source": "case class BlogHit(timestamp: Long, userId: String, path: String)\n", 66 | "outputs": [] 67 | }, 68 | { 69 | "metadata": { 70 | "id": "0BB085CC5B434B3C856D85BED7AD84F5" 71 | }, 72 | "cell_type": "markdown", 73 | "source": "We are going to generate a series of random articles, some of which will become very popular" 74 | }, 75 | { 76 | "metadata": { 77 | "trusted": true, 78 | "input_collapsed": false, 79 | "collapsed": false, 80 | "id": "B80C865329864BD686C72A97898C8355" 81 | }, 82 | "cell_type": "code", 83 | "source": "import scala.util.Random\nval generatedContent: List[(String, Double)] = {\n val qualifiers = \"\"\"good, new, first, last, long, great, little, own, other, old, \n right, big, high, different, small, large, next, early, young, important, few, public, bad, same, able\"\"\".split(\",\").map(_.trim)\n val themas = \"\"\"strawberry, screw, rice, chocolate, lettuce, sleep, wood, frame, electricity, discharge, voltage, shock, distress, \n anxiety, processor, memory\"\"\".split(\",\").map(_.trim)\n val titles = for {\n qualifier <- qualifiers\n thema <- themas\n } yield s\"${qualifier}_${thema}\"\n\n val paths = for {\n topic <- topics\n title <- titles\n popularity <- Some(Random.nextDouble)\n _ <- if (Random.nextDouble() > 0.98) Some(()) else None // only 2% chance of generating this combination\n } yield (s\"$topic/$title\",popularity) \n paths\n}\n", 84 | "outputs": [] 85 | }, 86 | { 87 | "metadata": { 88 | "id": "50EA138D2A834EB5860957721CD48565" 89 | }, 90 | "cell_type": "markdown", 91 | "source": "We will simulate a user visit with a randomized event. Some user with a userId visits some content if a random generated value is within the \"popularity\" \nindex, which is also represented with a probability `0<= x <=1`" 92 | }, 93 | { 94 | "metadata": { 95 | "trusted": true, 96 | "input_collapsed": false, 97 | "collapsed": false, 98 | "presentation": { 99 | "tabs_state": "{\n \"tab_id\": \"#tab711025873-0\"\n}", 100 | "pivot_chart_state": "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 101 | }, 102 | "id": "826129FAD2764D36AF51FE934902E807" 103 | }, 104 | "cell_type": "code", 105 | "source": "val userVisit: () => Option[BlogHit] = () => {\n val generatedContextSize = generatedContent.size\n for {\n userId <- Some(\"user\"+Random.nextInt(100000))\n (content, popularity) <- generatedContent.lift(Random.nextInt(generatedContextSize))\n chance = Random.nextDouble\n _ <- if (chance < popularity) Some(()) else None // weight popularity of the content \n } yield {\n BlogHit(System.currentTimeMillis, userId, content)\n }\n}\n ", 106 | "outputs": [] 107 | }, 108 | { 109 | "metadata": { 110 | "trusted": true, 111 | "input_collapsed": false, 112 | "collapsed": false, 113 | "id": "74A3FBA2502D406DBE989FEF933CABDA" 114 | }, 115 | "cell_type": "code", 116 | "source": "val visitRDD = sparkContext.parallelize(List.fill(100)(userVisit))\n", 117 | "outputs": [] 118 | }, 119 | { 120 | "metadata": { 121 | "id": "AC3352665869439FACA47804E169C25A" 122 | }, 123 | "cell_type": "markdown", 124 | "source": "## Create a stream of simulated data \n" 125 | }, 126 | { 127 | "metadata": { 128 | "trusted": true, 129 | "input_collapsed": false, 130 | "collapsed": false, 131 | "id": "4ABC1788747542FB9CCEBDBBCD1B517E" 132 | }, 133 | "cell_type": "code", 134 | "source": "import org.apache.spark.streaming.{StreamingContext, Seconds, Minutes}\n@transient val ssc = new StreamingContext(sparkContext, Seconds(2))", 135 | "outputs": [] 136 | }, 137 | { 138 | "metadata": { 139 | "id": "1A256D51684C49A8BB7D5961504C348C" 140 | }, 141 | "cell_type": "markdown", 142 | "source": "Using the click generation function we defined above, on each interval we are going to generate a series of \"clicks\" on content by some users. The clicks are weighted by popularity so that popular content gets more visits." 143 | }, 144 | { 145 | "metadata": { 146 | "trusted": true, 147 | "input_collapsed": false, 148 | "collapsed": false, 149 | "id": "2A67DD8D44344414B920CEBA40DECD5C" 150 | }, 151 | "cell_type": "code", 152 | "source": "import org.apache.spark.streaming.dstream.ConstantInputDStream\n@transient val clickStream = new ConstantInputDStream(ssc, visitRDD).flatMap(e=>e())", 153 | "outputs": [] 154 | }, 155 | { 156 | "metadata": { 157 | "trusted": true, 158 | "input_collapsed": false, 159 | "collapsed": false, 160 | "id": "8751ECA71C7F45298494986D64917B35" 161 | }, 162 | "cell_type": "code", 163 | "source": "@transient val topUrlsChart = CustomPlotlyChart(Seq[(String,Int)](), \n dataOptions=\"{type: 'bar'}\",\n dataSources=\"{x: '_1', y: '_2'}\")\n@transient val uniqueUsersChart = CustomPlotlyChart(Seq[(Long, Long)](), \n dataOptions=\"{type: 'line'}\",\n dataSources=\"{x: '_1', y: '_2'}\")\n", 164 | "outputs": [] 165 | }, 166 | { 167 | "metadata": { 168 | "id": "9A78424A92A74DCFB336CBAF570A4C2B" 169 | }, 170 | "cell_type": "markdown", 171 | "source": "# Define the unique visitors accumulator\n" 172 | }, 173 | { 174 | "metadata": { 175 | "trusted": true, 176 | "input_collapsed": false, 177 | "collapsed": false, 178 | "id": "33EF4283CC73440C8679B6343383EF39" 179 | }, 180 | "cell_type": "code", 181 | "source": "import learning.spark.streaming.HLLAccumulator\nval uniqueVisitorsAccumulator = new HLLAccumulator[String]()\nsc.register(uniqueVisitorsAccumulator, \"unique-visitors\")", 182 | "outputs": [] 183 | }, 184 | { 185 | "metadata": { 186 | "id": "F5B5D311DB8641E988ADBD30737E68DD" 187 | }, 188 | "cell_type": "markdown", 189 | "source": "# Windowed Trends\n\nWe will use the knowledge we recently acquired about sliding windows to create a view of recent trends of our website traffic.\nBefore decomposing the Click info into a URL count, we will also add the userId to our accumulator to register the userId of the click." 190 | }, 191 | { 192 | "metadata": { 193 | "trusted": true, 194 | "input_collapsed": false, 195 | "collapsed": false, 196 | "id": "11D7AC5D84FE4578AD30F21D181B030A" 197 | }, 198 | "cell_type": "code", 199 | "source": "clickStream.foreachRDD{rdd => \n rdd.foreach{case BlogHit(ts, user, url) => uniqueVisitorsAccumulator.add(user)}\n val currentTime = (System.currentTimeMillis / 1000) % (60*60*24) // get the hour part of the current timestamp in seconds\n val currentUniqueVisitors = uniqueVisitorsAccumulator.value\n uniqueUsersChart.addAndApply(Seq((currentTime, currentUniqueVisitors)))\n }", 200 | "outputs": [] 201 | }, 202 | { 203 | "metadata": { 204 | "trusted": true, 205 | "input_collapsed": false, 206 | "collapsed": false, 207 | "id": "B42F6DD790644EB09ACEC4D4188429A3" 208 | }, 209 | "cell_type": "code", 210 | "source": "@transient val trendStream = clickStream.map{case BlogHit(ts, user, url) => (url,1)}\n .reduceByKeyAndWindow((count:Int, agg:Int) => count + agg, Minutes(5), Seconds(2)) ", 211 | "outputs": [] 212 | }, 213 | { 214 | "metadata": { 215 | "trusted": true, 216 | "input_collapsed": false, 217 | "collapsed": false, 218 | "id": "ABFFE3FC45404A398252A46A1DBEC3D5" 219 | }, 220 | "cell_type": "code", 221 | "source": "trendStream.foreachRDD{rdd => \n val top10URLs = rdd.map(_.swap).sortByKey(ascending = false).take(10).map(_.swap)\n topUrlsChart.applyOn(top10URLs)\n }", 222 | "outputs": [] 223 | }, 224 | { 225 | "metadata": { 226 | "id": "D5BF21497EC3462D81668776014C6671" 227 | }, 228 | "cell_type": "markdown", 229 | "source": "## Top URLs Chart\n(charts will update once the Streaming job is started)" 230 | }, 231 | { 232 | "metadata": { 233 | "trusted": true, 234 | "input_collapsed": false, 235 | "collapsed": false, 236 | "id": "1B54BDC1D9734536B61540F7275B52EE" 237 | }, 238 | "cell_type": "code", 239 | "source": "topUrlsChart", 240 | "outputs": [] 241 | }, 242 | { 243 | "metadata": { 244 | "id": "8707383E034B4D9D9155BD6301E2C6A4" 245 | }, 246 | "cell_type": "markdown", 247 | "source": "## Unique Users Chart\n(charts will update once the Streaming job is started)" 248 | }, 249 | { 250 | "metadata": { 251 | "trusted": true, 252 | "input_collapsed": false, 253 | "collapsed": false, 254 | "id": "7ED5D91408CE4B4883FC1F1659C68206" 255 | }, 256 | "cell_type": "code", 257 | "source": "uniqueUsersChart", 258 | "outputs": [] 259 | }, 260 | { 261 | "metadata": { 262 | "trusted": true, 263 | "input_collapsed": false, 264 | "collapsed": false, 265 | "id": "0B654D4F097E4D9E89FEA1582F775ABE" 266 | }, 267 | "cell_type": "code", 268 | "source": "ssc.start()", 269 | "outputs": [] 270 | }, 271 | { 272 | "metadata": { 273 | "trusted": true, 274 | "input_collapsed": false, 275 | "collapsed": false, 276 | "id": "C3837F2D182C4F8C840126C7C33D2C5E" 277 | }, 278 | "cell_type": "code", 279 | "source": "// only stop the content after we are done exploring the data stream\n// ssc.stop(false)", 280 | "outputs": [ 281 | { 282 | "metadata": {}, 283 | "data": { 284 | "text/html": "" 285 | }, 286 | "output_type": "execute_result", 287 | "execution_count": 109, 288 | "time": "Took: 0.622s, at 2017-07-3 03:01" 289 | } 290 | ] 291 | }, 292 | { 293 | "metadata": { 294 | "trusted": true, 295 | "input_collapsed": true, 296 | "collapsed": true, 297 | "id": "9916A8DD876E45379869BFFD5FED5D19" 298 | }, 299 | "cell_type": "code", 300 | "source": "", 301 | "outputs": [] 302 | } 303 | ], 304 | "nbformat": 4 305 | } -------------------------------------------------------------------------------- /chapter-7/Structured-Streaming-in-action.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "75765162-2d2f-4070-b5b4-9818cac60853", 4 | "name" : "Structured-Streaming-in-action", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : [ "org.apache.spark %% spark-sql-kafka-0-10 % 2.3.0" ], 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "6E2995E02B244E978E1327B7A60484F0" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# Structured Streaming - Kafka Example\n\nThe intention of this example is to explore the main aspects of the Structured Streaming API.\n\n - We will the Kafka `source` to consume the `iot-data` topic.\n - We use a file `sink` to store the data into a _Parquet_ file." 28 | }, { 29 | "metadata" : { 30 | "trusted" : true, 31 | "input_collapsed" : false, 32 | "collapsed" : false, 33 | "id" : "3B783C2DA5A9409E85DF2CE7F061AECB" 34 | }, 35 | "cell_type" : "code", 36 | "source" : [ "import java.io.File\n", "val topic = \"iot-data\"\n", "val workDir = \"/tmp/streaming-with-spark\"\n", "val referenceFile = \"sensor-records.parquet\"\n", "val targetFile = \"structured_enrichedIoTStream.parquet\"\n", "val targetPath = new File(workDir, targetFile).getAbsolutePath\n", "val unknownSensorsTargetFile = \"unknownSensorsStream.parquet\"\n", "val unknownSensorsTargetPath = new File(workDir, unknownSensorsTargetFile).getAbsolutePath\n", "val kafkaBootstrapServer = \"127.0.0.1:9092\"" ], 37 | "outputs" : [ { 38 | "name" : "stdout", 39 | "output_type" : "stream", 40 | "text" : "import java.io.File\ntopic: String = iot-data\nworkDir: String = /tmp/streaming-with-spark\nreferenceFile: String = sensor-records.parquet\ntargetFile: String = structured_enrichedIoTStream.parquet\ntargetPath: String = /tmp/streaming-with-spark/structured_enrichedIoTStream.parquet\nunknownSensorsTargetFile: String = unknownSensorsStream.parquet\nunknownSensorsTargetPath: String = /tmp/streaming-with-spark/unknownSensorsStream.parquet\nkafkaBootstrapServer: String = 127.0.0.1:9092\n" 41 | }, { 42 | "metadata" : { }, 43 | "data" : { 44 | "text/html" : "" 45 | }, 46 | "output_type" : "execute_result", 47 | "execution_count" : 1, 48 | "time" : "Took: 1.076s, at 2019-03-02 20:46" 49 | } ] 50 | }, { 51 | "metadata" : { 52 | "trusted" : true, 53 | "input_collapsed" : false, 54 | "collapsed" : false, 55 | "id" : "3225F1A3939642B087F1BE6A37EB9D03" 56 | }, 57 | "cell_type" : "code", 58 | "source" : [ "val rawData = sparkSession.readStream\n", " .format(\"kafka\")\n", " .option(\"kafka.bootstrap.servers\", kafkaBootstrapServer)\n", " .option(\"subscribe\", topic)\n", " .option(\"enable.auto.commit\", true)\n", " .option(\"group.id\", \"iot-data-consumer\")\n", " .option(\"startingOffsets\", \"earliest\")\n", " .load()" ], 59 | "outputs" : [ { 60 | "name" : "stdout", 61 | "output_type" : "stream", 62 | "text" : "rawData: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]\n" 63 | }, { 64 | "metadata" : { }, 65 | "data" : { 66 | "text/html" : "" 67 | }, 68 | "output_type" : "execute_result", 69 | "execution_count" : 3, 70 | "time" : "Took: 0.972s, at 2017-08-09 18:00" 71 | } ] 72 | }, { 73 | "metadata" : { 74 | "trusted" : true, 75 | "input_collapsed" : false, 76 | "collapsed" : false, 77 | "id" : "497CEEFAB7DF40F884CC7A8139C3DA5F" 78 | }, 79 | "cell_type" : "code", 80 | "source" : [ "rawData.isStreaming" ], 81 | "outputs" : [ { 82 | "name" : "stdout", 83 | "output_type" : "stream", 84 | "text" : "res7: Boolean = true\n" 85 | }, { 86 | "metadata" : { }, 87 | "data" : { 88 | "text/html" : "true" 89 | }, 90 | "output_type" : "execute_result", 91 | "execution_count" : 5, 92 | "time" : "Took: 0.841s, at 2017-08-09 18:00" 93 | } ] 94 | }, { 95 | "metadata" : { 96 | "trusted" : true, 97 | "input_collapsed" : false, 98 | "collapsed" : false, 99 | "id" : "A74BF086DCC240168F21E57797088678" 100 | }, 101 | "cell_type" : "code", 102 | "source" : [ "rawData.printSchema()" ], 103 | "outputs" : [ { 104 | "name" : "stdout", 105 | "output_type" : "stream", 106 | "text" : "root\n |-- key: binary (nullable = true)\n |-- value: binary (nullable = true)\n |-- topic: string (nullable = true)\n |-- partition: integer (nullable = true)\n |-- offset: long (nullable = true)\n |-- timestamp: timestamp (nullable = true)\n |-- timestampType: integer (nullable = true)\n\n" 107 | }, { 108 | "metadata" : { }, 109 | "data" : { 110 | "text/html" : "" 111 | }, 112 | "output_type" : "execute_result", 113 | "execution_count" : 9, 114 | "time" : "Took: 0.764s, at 2017-08-09 14:33" 115 | } ] 116 | }, { 117 | "metadata" : { 118 | "trusted" : true, 119 | "input_collapsed" : false, 120 | "collapsed" : false, 121 | "id" : "A39D7FB1A9AC496F8DFC7502EB0A4C29" 122 | }, 123 | "cell_type" : "code", 124 | "source" : [ "case class SensorData(sensorId: Int, timestamp: Long, value: Double)" ], 125 | "outputs" : [ { 126 | "name" : "stdout", 127 | "output_type" : "stream", 128 | "text" : "defined class SensorData\n" 129 | }, { 130 | "metadata" : { }, 131 | "data" : { 132 | "text/html" : "" 133 | }, 134 | "output_type" : "execute_result", 135 | "execution_count" : 10, 136 | "time" : "Took: 0.541s, at 2017-08-09 14:33" 137 | } ] 138 | }, { 139 | "metadata" : { 140 | "trusted" : true, 141 | "input_collapsed" : false, 142 | "collapsed" : false, 143 | "id" : "6E729EE3495E4AA88FC9E347BDEE3210" 144 | }, 145 | "cell_type" : "code", 146 | "source" : [ "val iotData = rawData.select(\"value\").as[String].map{r =>\n", " val Array(id, timestamp, value) = r.split(\",\")\n", " SensorData(id.toInt, timestamp.toLong, value.toDouble)\n", "}" ], 147 | "outputs" : [ { 148 | "name" : "stdout", 149 | "output_type" : "stream", 150 | "text" : "iotData: org.apache.spark.sql.Dataset[SensorData] = [sensorId: int, timestamp: bigint ... 1 more field]\n" 151 | }, { 152 | "metadata" : { }, 153 | "data" : { 154 | "text/html" : "" 155 | }, 156 | "output_type" : "execute_result", 157 | "execution_count" : 11, 158 | "time" : "Took: 0.799s, at 2017-08-09 14:33" 159 | } ] 160 | }, { 161 | "metadata" : { 162 | "id" : "CD174F6E3EF1475AA51899B5931B7415" 163 | }, 164 | "cell_type" : "markdown", 165 | "source" : "## Load the reference data from a parquet file¶\nWe also cache the data to keep it in memory and improve the performance of our steaming application" 166 | }, { 167 | "metadata" : { 168 | "trusted" : true, 169 | "input_collapsed" : false, 170 | "collapsed" : false, 171 | "id" : "0314D4CF7E88448EB2A5BDBED5839282" 172 | }, 173 | "cell_type" : "code", 174 | "source" : [ "val sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\n", "sensorRef.cache()" ], 175 | "outputs" : [ { 176 | "name" : "stdout", 177 | "output_type" : "stream", 178 | "text" : "org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/learningsparkstreaming/sensor-records.parquet;\n at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)\n at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)\n at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n at scala.collection.immutable.List.foreach(List.scala:381)\n at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n at scala.collection.immutable.List.flatMap(List.scala:344)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)\n at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)\n at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)\n at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)\n ... 63 elided\n" 179 | } ] 180 | }, { 181 | "metadata" : { 182 | "trusted" : true, 183 | "input_collapsed" : false, 184 | "collapsed" : false, 185 | "id" : "967DA9CD00034B93A9F9077ACCABBF69" 186 | }, 187 | "cell_type" : "code", 188 | "source" : [ "val query = iotData.writeStream\n", " .outputMode(\"append\")\n", " .format(\"parquet\")\n", " .option(\"path\", workDir)\n", " .option(\"checkpointLocation\", \"/tmp/checkpoint\")\n", " .start()" ], 189 | "outputs" : [ { 190 | "name" : "stdout", 191 | "output_type" : "stream", 192 | "text" : "queryDef: org.apache.spark.sql.streaming.DataStreamWriter[SensorData] = org.apache.spark.sql.streaming.DataStreamWriter@58df37bb\n" 193 | }, { 194 | "metadata" : { }, 195 | "data" : { 196 | "text/html" : "" 197 | }, 198 | "output_type" : "execute_result", 199 | "execution_count" : 27, 200 | "time" : "Took: 0.430s, at 2017-08-09 13:58" 201 | } ] 202 | }, { 203 | "metadata" : { 204 | "trusted" : true, 205 | "input_collapsed" : false, 206 | "collapsed" : false, 207 | "presentation" : { 208 | "tabs_state" : "{\n \"tab_id\": \"#tab636027600-0\"\n}", 209 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 210 | }, 211 | "id" : "C60D42C6AA7C43108AF52A4EEA7B407E" 212 | }, 213 | "cell_type" : "code", 214 | "source" : [ "\n", "query.recentProgress" ], 215 | "outputs" : [ { 216 | "name" : "stdout", 217 | "output_type" : "stream", 218 | "text" : "res38: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] =\nArray({\n \"id\" : \"ce29c1eb-bebc-45cb-abd8-3e6437c16518\",\n \"runId\" : \"dcfd946a-097a-41c1-b810-f848baaddc10\",\n \"name\" : null,\n \"timestamp\" : \"2017-08-09T11:52:44.721Z\",\n \"numInputRows\" : 21421,\n \"processedRowsPerSecond\" : 56819.62864721486,\n \"durationMs\" : {\n \"addBatch\" : 236,\n \"getBatch\" : 4,\n \"getOffset\" : 109,\n \"queryPlanning\" : 2,\n \"triggerExecution\" : 377,\n \"walCommit\" : 13\n },\n \"stateOperators\" : [ ],\n \"sources\" : [ {\n \"description\" : \"KafkaSource[Subscribe[iot-data]]\",\n \"startOffset\" : {\n \"iot-data\" : {\n \"0\" : 3468807\n }\n },\n \"endOffset\" : {\n \"iot-data\" : {\n \"0\" : 3490228\n }\n },\n \"numInputRows\" : 21421,\n \"processedRowsPerSecond\" : 5..." 219 | }, { 220 | "metadata" : { }, 221 | "data" : { 222 | "text/html" : "
\n \n
\n
\n
  • \n \n
  • \n \n
\n\n
\n
\n \n
\n

entries total
\n

\n
\n
\n
\n
\n
\n \n
\n

entries total
\n

\n
\n
\n
\n
\n
\n
" 223 | }, 224 | "output_type" : "execute_result", 225 | "execution_count" : 26, 226 | "time" : "Took: 0.572s, at 2017-08-09 13:52" 227 | } ] 228 | }, { 229 | "metadata" : { 230 | "trusted" : true, 231 | "input_collapsed" : false, 232 | "collapsed" : true, 233 | "id" : "A8292D947D7E4A889B416F5C2BB158C7" 234 | }, 235 | "cell_type" : "code", 236 | "source" : [ "" ], 237 | "outputs" : [ ] 238 | } ], 239 | "nbformat" : 4 240 | } -------------------------------------------------------------------------------- /chapter-7/batch_weblogs.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "c8749075-25ae-48aa-a102-495a26a92d54", 4 | "name" : "batch_weblogs", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "A7ECD7E2EBF047A283DBADDCA0BB55E1" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# Simple Weblog Analytics - The Batch Way\nIn this notebook, we are going to quickly visit a batch process of a series of weblog files to obtain the top trending pages per day." 28 | }, { 29 | "metadata" : { 30 | "trusted" : true, 31 | "input_collapsed" : false, 32 | "collapsed" : false, 33 | "id" : "4D776A6E557446488681874D1D710705" 34 | }, 35 | "cell_type" : "code", 36 | "source" : [ "// This is the location of the unpackaged files. Update accordingly\n", "// You can unpack the provided dataset with:\n", "// tar xvf datasets/NASA-weblogs/nasa_dataset_july_1995.tgz -C /tmp/data/\n", "val logsDirectory = \"/tmp/data/nasa_dataset_july_1995\"\n", "val rawLogs = sparkSession.read.json(logsDirectory)" ], 37 | "outputs" : [ ] 38 | }, { 39 | "metadata" : { 40 | "id" : "A60148A17B904A72A421640BB8E87474" 41 | }, 42 | "cell_type" : "markdown", 43 | "source" : "## We define a schema for the data in the logs\nFollowing the formal description of the dataset (at: [NASA-HTTP](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) ), the log is structured as follows:\n\n>The logs are an ASCII file with one line per request, with the following columns:\n- host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.\n- timestamp in the format \"DAY MON DD HH:MM:SS YYYY\", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.\n- request given in quotes.\n- HTTP reply code.\n- bytes in the reply.\n\nThe dataset provided for this exercise offers this data in JSON format" 44 | }, { 45 | "metadata" : { 46 | "trusted" : true, 47 | "input_collapsed" : false, 48 | "collapsed" : false, 49 | "id" : "82F9E2879A1747EF8CC36B5E5D64E701" 50 | }, 51 | "cell_type" : "code", 52 | "source" : [ "import java.sql.Timestamp\n", "case class WebLog(host:String, \n", " timestamp: Timestamp, \n", " request: String, \n", " http_reply:Int, \n", " bytes: Long\n", " )" ], 53 | "outputs" : [ ] 54 | }, { 55 | "metadata" : { 56 | "id" : "CB0651C3AB3747EF9CB38477A6E59201" 57 | }, 58 | "cell_type" : "markdown", 59 | "source" : "## We convert the raw data to structured logs" 60 | }, { 61 | "metadata" : { 62 | "trusted" : true, 63 | "input_collapsed" : false, 64 | "collapsed" : false, 65 | "id" : "7AF0F566B58C49F195A3D1E4990BA3E6" 66 | }, 67 | "cell_type" : "code", 68 | "source" : [ "import org.apache.spark.sql.functions._\n", "import org.apache.spark.sql.types.IntegerType\n", "val preparedLogs = rawLogs.withColumn(\"http_reply\", $\"http_reply\".cast(IntegerType))" ], 69 | "outputs" : [ ] 70 | }, { 71 | "metadata" : { 72 | "trusted" : true, 73 | "input_collapsed" : false, 74 | "collapsed" : false, 75 | "id" : "833438CD8CA1471181AC509553B749C2" 76 | }, 77 | "cell_type" : "code", 78 | "source" : [ "val weblogs = preparedLogs.as[WebLog]" ], 79 | "outputs" : [ ] 80 | }, { 81 | "metadata" : { 82 | "trusted" : true, 83 | "input_collapsed" : false, 84 | "collapsed" : true, 85 | "id" : "9313FDE7CE7548B180AE2A47CDBED56E" 86 | }, 87 | "cell_type" : "markdown", 88 | "source" : "## Now, we have the data in a structured format and we can start asking the questions that interest us.\nWe have imported the data and transformed it using a known schema. We can use this 'structured' data to create queries that provide insights in the behavior of the users. " 89 | }, { 90 | "metadata" : { 91 | "id" : "144FB0F8C3AE42328435A2A8F3C150AB" 92 | }, 93 | "cell_type" : "markdown", 94 | "source" : "### As a first step, we would like to know how many records are contained in our dataset." 95 | }, { 96 | "metadata" : { 97 | "trusted" : true, 98 | "input_collapsed" : false, 99 | "collapsed" : false, 100 | "id" : "FBE9E30417204C9D83E8F1FE65AB6471" 101 | }, 102 | "cell_type" : "code", 103 | "source" : [ "val recordCount = weblogs.count" ], 104 | "outputs" : [ ] 105 | }, { 106 | "metadata" : { 107 | "id" : "BEA0D41E58324A479D2B91093864E6C7" 108 | }, 109 | "cell_type" : "markdown", 110 | "source" : "### A common question would be, what was the most popular URL per day?\nWe first reduce the timestamp to the day of the year. We then group by this new 'day of year' column and the request url and we count over this aggregate. We finally order using descending order to get this top URLs first." 111 | }, { 112 | "metadata" : { 113 | "trusted" : true, 114 | "input_collapsed" : false, 115 | "collapsed" : false, 116 | "id" : "B65C7756F4EA47388884AB0DC4619CE8" 117 | }, 118 | "cell_type" : "code", 119 | "source" : [ "val topDailyURLs = weblogs.withColumn(\"dayOfMonth\", dayofmonth($\"timestamp\"))\n", " .select($\"request\", $\"dayOfMonth\")\n", " .groupBy($\"dayOfMonth\", $\"request\")\n", " .agg(count($\"request\").alias(\"count\"))\n", " .orderBy(desc(\"count\"))" ], 120 | "outputs" : [ ] 121 | }, { 122 | "metadata" : { 123 | "trusted" : true, 124 | "input_collapsed" : false, 125 | "collapsed" : false, 126 | "id" : "03E60D724B504A918E10873AEF0F0CF1" 127 | }, 128 | "cell_type" : "code", 129 | "source" : [ "topDailyURLs.show()" ], 130 | "outputs" : [ ] 131 | }, { 132 | "metadata" : { 133 | "trusted" : true, 134 | "input_collapsed" : false, 135 | "collapsed" : false, 136 | "presentation" : { 137 | "tabs_state" : "{\n \"tab_id\": \"#tab2096092733-0\"\n}", 138 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 139 | }, 140 | "id" : "F1CB4530E62149F08D39256D53BA1C07" 141 | }, 142 | "cell_type" : "code", 143 | "source" : [ "topDailyURLs.take(10)\n" ], 144 | "outputs" : [ ] 145 | }, { 146 | "metadata" : { 147 | "id" : "4F7AB40ADF1D4EFF9166CDB63A0DC77E" 148 | }, 149 | "cell_type" : "markdown", 150 | "source" : "### Top hits are all images. What now?\nIt's not unusual to see that the top URLs are images commonly used across a site.\n\nOur true interest lies in the content pages generating most traffic. To find those, we first filter on `html` content \nand then proceed to apply the top aggregation we just learned.\n\nThe request field is a quoted sequence of `[HTTP_VERB] URL [HTTP_VERSION]`. We will extract the url and preserve only those ending in `.html`, `.htm` or no extension (directories). This is a simplification for the purpose of the exercise. " 151 | }, { 152 | "metadata" : { 153 | "trusted" : true, 154 | "input_collapsed" : false, 155 | "collapsed" : false, 156 | "id" : "7191956637C34DF28D24D2947E6999A8" 157 | }, 158 | "cell_type" : "code", 159 | "source" : [ "val urlExtractor = \"\"\"^GET (.+) HTTP/\\d.\\d\"\"\".r\n", "val allowedExtensions = Set(\".html\",\".htm\", \"\")\n", "val contentPageLogs = weblogs.filter {log => \n", " log.request match { \n", " case urlExtractor(url) => \n", " val ext = url.takeRight(5).dropWhile(c => c != '.')\n", " allowedExtensions.contains(ext)\n", " case _ => false \n", " }\n", "}\n", " " ], 160 | "outputs" : [ ] 161 | }, { 162 | "metadata" : { 163 | "trusted" : true, 164 | "input_collapsed" : false, 165 | "collapsed" : false, 166 | "presentation" : { 167 | "tabs_state" : "{\n \"tab_id\": \"#tab1029193174-0\"\n}", 168 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 169 | }, 170 | "id" : "844604F254BF4450B2D60F09A766C444" 171 | }, 172 | "cell_type" : "code", 173 | "source" : [ "val topContentPages = contentPageLogs.withColumn(\"dayOfMonth\", dayofmonth($\"timestamp\"))\n", " .select($\"request\", $\"dayOfMonth\")\n", " .groupBy($\"dayOfMonth\", $\"request\")\n", " .agg(count($\"request\").alias(\"count\"))\n", " .orderBy(desc(\"count\"))" ], 174 | "outputs" : [ ] 175 | }, { 176 | "metadata" : { 177 | "trusted" : true, 178 | "input_collapsed" : false, 179 | "collapsed" : false, 180 | "id" : "392A9C52AEC34D9F8EF709F5B407F0E1" 181 | }, 182 | "cell_type" : "code", 183 | "source" : [ "topContentPages" ], 184 | "outputs" : [ ] 185 | }, { 186 | "metadata" : { 187 | "id" : "995C97DC2E9F4DBA95A832588B147843" 188 | }, 189 | "cell_type" : "markdown", 190 | "source" : "We can see that the most popular page that month was `liftoff.html ` corresponding to the coverage of the launch of the Discovery shuttle, as documented on the NASA archives: https://www.nasa.gov/mission_pages/shuttle/shuttlemissions/archives/sts-70.html.\n\nIt's closely followed by `countdown/` the days prior ot the launch." 191 | }, { 192 | "metadata" : { 193 | "trusted" : true, 194 | "input_collapsed" : false, 195 | "collapsed" : true, 196 | "id" : "8DBF5B380B654300B6064149A1B63ABC" 197 | }, 198 | "cell_type" : "code", 199 | "source" : [ "" ], 200 | "outputs" : [ ] 201 | } ], 202 | "nbformat" : 4 203 | } -------------------------------------------------------------------------------- /chapter-7/streaming_weblogs.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "5297db27-a8c6-40dc-b442-d4b6f279e3bb", 4 | "name" : "streaming_weblogs", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "trusted" : true, 25 | "input_collapsed" : false, 26 | "collapsed" : true, 27 | "id" : "A2D4928A0946433492E0431FB9AD1D66" 28 | }, 29 | "cell_type" : "markdown", 30 | "source" : "# Simple Weblog Analytics - The Streaming Way\nIn this notebook, we are going to explore the weblog use case using the stream 'as it happens'.\n\nThis notebook requires a local `TCP ` server that simulates the Web server sending data. \n\nPlease start the [weblogs_TCP_Server](./weblogs_TCP_server.snb.ipynb) notebook before running this one." 31 | }, { 32 | "metadata" : { 33 | "id" : "66DD67FAD0C24E00965CD26C0858D46B" 34 | }, 35 | "cell_type" : "markdown", 36 | "source" : "## To connect to a TCP source, we need the host and the port of the TCP server.\nHere we use the defaults used in the `weblogs_TCP_server` notebook. If you changed these parameters there, change them here accordingly" 37 | }, { 38 | "metadata" : { 39 | "trusted" : true, 40 | "input_collapsed" : false, 41 | "collapsed" : false, 42 | "id" : "35F22D17E25940618FFE838C99405B16" 43 | }, 44 | "cell_type" : "code", 45 | "source" : [ "val host = \"localhost\"\n", "val port = 9999" ], 46 | "outputs" : [ ] 47 | }, { 48 | "metadata" : { 49 | "id" : "C0E80D55C81340318BD2CD304E322F5B" 50 | }, 51 | "cell_type" : "markdown", 52 | "source" : "## We use the `TextSocketSource` in Structured Streaming to connect to the TCP server and consume the text stream.\nThis `Source` is called `socket` as the short name we can use as `format` to instantiate it.\n\nThe options needed to configure the `socket` `Source` are `host` and `port` to provide the configuration of our TCP server." 53 | }, { 54 | "metadata" : { 55 | "trusted" : true, 56 | "input_collapsed" : false, 57 | "collapsed" : false, 58 | "id" : "1356EE7F9684454E89FA1F1F8D5CAA95" 59 | }, 60 | "cell_type" : "code", 61 | "source" : [ "val stream = sparkSession.readStream\n", " .format(\"socket\")\n", " .option(\"host\", host)\n", " .option(\"port\", port)\n", " .load()" ], 62 | "outputs" : [ ] 63 | }, { 64 | "metadata" : { 65 | "id" : "87702285726047CD8F3BE8FE0471C863" 66 | }, 67 | "cell_type" : "markdown", 68 | "source" : "## We define a schema for the data in the logs\nFollowing the formal description of the dataset (at: [NASA-HTTP](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) ), the log is structured as follows:\n\n>The logs are an ASCII file with one line per request, with the following columns:\n- host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.\n- timestamp in the format \"DAY MON DD HH:MM:SS YYYY\", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.\n- request given in quotes.\n- HTTP reply code.\n- bytes in the reply.\n\nThe dataset provided for this exercise offers this data in JSON format" 69 | }, { 70 | "metadata" : { 71 | "trusted" : true, 72 | "input_collapsed" : false, 73 | "collapsed" : false, 74 | "id" : "727B6082C6C24A39B0D651583E62BFBC" 75 | }, 76 | "cell_type" : "code", 77 | "source" : [ "import java.sql.Timestamp\n", "case class WebLog(host:String, \n", " timestamp: Timestamp, \n", " request: String, \n", " http_reply:Int, \n", " bytes: Long\n", " )" ], 78 | "outputs" : [ ] 79 | }, { 80 | "metadata" : { 81 | "id" : "52B745BDE1A44B928CF66DB4118589A4" 82 | }, 83 | "cell_type" : "markdown", 84 | "source" : "##We convert the raw data to structured logs\nIn the batch analytics case we could load the data directly as JSON records. In the case of the `Socket` source, that data is received as plain text.\nTo transform our raw data to `WebLog` records, we first require a schema. The schema provides the necessary information to parse the text to a JSON object. It's the 'structure' when we talk about 'structured' streaming.\n\nAfter defining a schema for our data, we will:\n\n- Transform the text `value` to JSON using the JSON support built in the structured API of Spark\n- Use the `Dataset` API to transform the JSON records to `WebLog` objects\n\nAs result of this process, we will obtain a `Streaming Dataset` of `WebLog` records." 85 | }, { 86 | "metadata" : { 87 | "trusted" : true, 88 | "input_collapsed" : false, 89 | "collapsed" : false, 90 | "id" : "DADD984537CC4AE3816E2BA42480F969" 91 | }, 92 | "cell_type" : "code", 93 | "source" : [ "val webLogSchema = Encoders.product[WebLog].schema " ], 94 | "outputs" : [ ] 95 | }, { 96 | "metadata" : { 97 | "trusted" : true, 98 | "input_collapsed" : false, 99 | "collapsed" : false, 100 | "id" : "C39C1D8FE59242F58953E8F72E95E2AD" 101 | }, 102 | "cell_type" : "code", 103 | "source" : [ "val jsonStream = stream.select(from_json($\"value\", webLogSchema) as \"record\")" ], 104 | "outputs" : [ ] 105 | }, { 106 | "metadata" : { 107 | "trusted" : true, 108 | "input_collapsed" : false, 109 | "collapsed" : false, 110 | "id" : "62E7E2F5CDC04C1185FD779141748500" 111 | }, 112 | "cell_type" : "code", 113 | "source" : [ "val webLogStream: Dataset[WebLog] = jsonStream.select(\"record.*\").as[WebLog]" ], 114 | "outputs" : [ ] 115 | }, { 116 | "metadata" : { 117 | "id" : "7BF49320BEDD4896882407FDF6CD3E37" 118 | }, 119 | "cell_type" : "markdown", 120 | "source" : "## We have a structured stream.\nThe `webLogStream` we just obtained is of type `Dataset[WebLog]` like we had in the batch analytics job.\nThe difference between this instance and the batch version is that `webLogStream` is a streaming `Dataset`.\n\nWe can observe this by querying the object.\n" 121 | }, { 122 | "metadata" : { 123 | "trusted" : true, 124 | "input_collapsed" : false, 125 | "collapsed" : false, 126 | "id" : "E18B9BF47F9644EE8EC52885827EF4EE" 127 | }, 128 | "cell_type" : "code", 129 | "source" : [ "webLogStream.isStreaming" ], 130 | "outputs" : [ ] 131 | }, { 132 | "metadata" : { 133 | "id" : "41E1A019CAEE4976B484B8A5D59D100E" 134 | }, 135 | "cell_type" : "markdown", 136 | "source" : "## Operations on Streaming Datasets\nAt this point in the batch job, we were creating the first query on our data: How many records are contained in our dataset?\nThis is a question that we can answer easily when we have access to all the data. But how to count records that are constantly arriving? \nThe answer is that some operations we consider usual on a static `Dataset`, like counting all records, do not have a defined meaning on a streaming Dataset.\n\nAs we can observe, attempting to execute the `count` query below will result in an `AnalysisException`. Queries in Structured Streaming are a continuous operation that needs to be scheduled. To start scheduling queries on a stream, we use the `writeStream.start()` operation. " 137 | }, { 138 | "metadata" : { 139 | "trusted" : true, 140 | "input_collapsed" : false, 141 | "collapsed" : false, 142 | "id" : "47C12878560C49089D4E4742729D9694" 143 | }, 144 | "cell_type" : "code", 145 | "source" : [ "// expect this call to fail!\n", "val count = webLogStream.count()" ], 146 | "outputs" : [ ] 147 | }, { 148 | "metadata" : { 149 | "id" : "4288478AA31B49928541E3795ED7217F" 150 | }, 151 | "cell_type" : "markdown", 152 | "source" : "## What are popular URLs? In what timeframe?\n\nNow that we have immediate analytic access to the stream of weblogs we don't need to wait for a day or a month to have a rank of the popular URLs. We can have that information as trends unfold on much shorter windows of time.\n\nTo define the period of our interest, we create a window over some timestamp. An interesting feature of Structured Streaming is that we can define that window on the timestamp when the data was produced, also known as 'event time' as opposed to the time when the data is processed.\n\nOur window definition is of 5 minutes of event data. Given that the TCP Server is replaying the logs in a simulated timeline, the 5 minutes might happen much faster or slower than the clock time. In this way, we can appreciate how Structured Streaming uses the timestamp information in the events to keep track of the event timeline.\n\nAs we learned from the batch analytics, we should extract the URLs and only select content pages, like `html`, `htm`, or directories. Let's apply that acquired knowledge first before proceeding to define our `window` query." 153 | }, { 154 | "metadata" : { 155 | "trusted" : true, 156 | "input_collapsed" : false, 157 | "collapsed" : false, 158 | "id" : "A033C0056A0F478B806143935073E3E4" 159 | }, 160 | "cell_type" : "code", 161 | "source" : [ "// A regex expression to extract the accessed URL from weblog.request \n", "val urlExtractor = \"\"\"^GET (.+) HTTP/\\d.\\d\"\"\".r\n", "val allowedExtensions = Set(\".html\",\".htm\", \"\")\n", "\n", "val contentPageLogs: String => Boolean = url => {\n", " val ext = url.takeRight(5).dropWhile(c => c != '.')\n", " allowedExtensions.contains(ext)\n", "}\n", "\n", "val urlWebLogStream = webLogStream.flatMap{ weblog => \n", " weblog.request match { \n", " case urlExtractor(url) if (contentPageLogs(url)) => Some(weblog.copy(request = url))\n", " case _ => None\n", " }\n", "}" ], 162 | "outputs" : [ ] 163 | }, { 164 | "metadata" : { 165 | "id" : "008F6271D01845028313B5DA01D7E2BB" 166 | }, 167 | "cell_type" : "markdown", 168 | "source" : "## Top Content Pages Query\nWe have converted the request to only contain the visited URL and filtered out all non-content pages. \nWe will now define the windowed query to compute the top trending URLs " 169 | }, { 170 | "metadata" : { 171 | "trusted" : true, 172 | "input_collapsed" : false, 173 | "collapsed" : false, 174 | "id" : "00D498E82EBB42DD8EB660C40F15D8F0" 175 | }, 176 | "cell_type" : "code", 177 | "source" : [ "val rankingURLStream = urlWebLogStream.groupBy($\"request\", window($\"timestamp\", \"5 minutes\", \"1 minute\")).count()" ], 178 | "outputs" : [ ] 179 | }, { 180 | "metadata" : { 181 | "id" : "D43F6DAFECBB416F9B5C73A9D16B6438" 182 | }, 183 | "cell_type" : "markdown", 184 | "source" : "## Start the stream processing\nAll the steps we have followed so far have been to define the process that the stream will undergo but no data has been processed yet. \n\nTo start a Structured Streaming job, we need to specify a `sink` and an `output mode`. \nThese are two new concepts introduced by Structured Streaming.\n\nA `sink` defines where we want to materialize the resulting data, like to a file in a file system, to an in-memory table or to another streaming system such as Kafka.\nThe `output mode` defines how we want the results to be delivered: Do we want to see all data every time, only updates or just the new records? \n\nThese options are given to a `writeStream` operation that creates the streaming query that starts the stream consumption, materializes the computations \ndeclared on the query and produces the result to the output `sink`.\n\nWe will visit all these concepts in detail later on. For now, we will use them empirically and observe the results.\n\nFor our query, we will use the `memory` `sink` and output mode `complete` to have a fully updated table each time new records are added to the result of keeping track of the URL ranking." 185 | }, { 186 | "metadata" : { 187 | "trusted" : true, 188 | "input_collapsed" : false, 189 | "collapsed" : false, 190 | "id" : "B01C2A1937204F6E8F001BF0ABFD0BDB" 191 | }, 192 | "cell_type" : "code", 193 | "source" : [ "val query = rankingURLStream.writeStream\n", " .queryName(\"urlranks\")\n", " .outputMode(\"complete\")\n", " .format(\"memory\")\n", " .start()" ], 194 | "outputs" : [ ] 195 | }, { 196 | "metadata" : { 197 | "id" : "8FB679DD8DD045E69BEF5DC74FB427E2" 198 | }, 199 | "cell_type" : "markdown", 200 | "source" : "### The memory sink outputs the data to a temporary table of the same name given in the queryName option." 201 | }, { 202 | "metadata" : { 203 | "trusted" : true, 204 | "input_collapsed" : false, 205 | "collapsed" : false, 206 | "id" : "F9F3A081AA0B4ED188828F2025139163" 207 | }, 208 | "cell_type" : "code", 209 | "source" : [ "sparkSession.sql(\"show tables\").show()" ], 210 | "outputs" : [ ] 211 | }, { 212 | "metadata" : { 213 | "id" : "95836A339A17498C8A35459222AE3BDF" 214 | }, 215 | "cell_type" : "markdown", 216 | "source" : "## Exploring the Data\nThe `memory` `sink` outputs the data to a temporary table of the same name given in the `queryName` option. We can create a `DataFrame` from that table to explore the results of the stream process. \n" 217 | }, { 218 | "metadata" : { 219 | "trusted" : true, 220 | "input_collapsed" : false, 221 | "collapsed" : false, 222 | "id" : "7BDA487B44AD4AFF830A9F7DF469B764" 223 | }, 224 | "cell_type" : "code", 225 | "source" : [ "val urlRanks = sparkSession.sql(\"select * from urlranks\")" ], 226 | "outputs" : [ ] 227 | }, { 228 | "metadata" : { 229 | "id" : "20C25E20A8354E0E8F21FDDBE666A6F0" 230 | }, 231 | "cell_type" : "markdown", 232 | "source" : "### Before we can see any materialized results, we need to wait for the window to complete.\nGiven that we are accelerating the log timeline on the producer side, after few seconds, we can execute the next command to see the result of the first windows." 233 | }, { 234 | "metadata" : { 235 | "trusted" : true, 236 | "input_collapsed" : false, 237 | "collapsed" : false, 238 | "id" : "A7AB127108814D1A9E76FC49E8893140" 239 | }, 240 | "cell_type" : "code", 241 | "source" : [ "urlRanks.select($\"request\", $\"window\", $\"count\").orderBy(desc(\"count\"))" ], 242 | "outputs" : [ ] 243 | }, { 244 | "metadata" : { 245 | "trusted" : true, 246 | "input_collapsed" : false, 247 | "collapsed" : true, 248 | "id" : "3BAF4A0860EE44548BD2D1D18882F142" 249 | }, 250 | "cell_type" : "code", 251 | "source" : [ "" ], 252 | "outputs" : [ ] 253 | } ], 254 | "nbformat" : 4 255 | } -------------------------------------------------------------------------------- /chapter-7/weblogs_TCP_server.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "58751561-3431-418d-a455-6b41e9c00f91", 4 | "name" : "weblogs_TCP_server", 5 | "user_save_timestamp" : "2018-01-20T17:32:39.480Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "5324899D4A1B41858D32AFA61DFF4FF7" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# TCP Weblog Server\nThis notebook simulates a log producer able to send web server logs to a client connected through a TCP connection." 28 | }, { 29 | "metadata" : { 30 | "id" : "F4C836C3A72F434EA3C3C719FE4DC497" 31 | }, 32 | "cell_type" : "markdown", 33 | "source" : "## Settings" 34 | }, { 35 | "metadata" : { 36 | "trusted" : true, 37 | "input_collapsed" : false, 38 | "collapsed" : false, 39 | "id" : "B1E5C26A4CF844CC8E7BB77B399126C5" 40 | }, 41 | "cell_type" : "code", 42 | "source" : [ "// This is the location of the unpackaged files. Update accordingly\n", "val serverPort = 9999\n", "val logsDirectory = \"/tmp/data/nasa_dataset_july_1995\"" ], 43 | "outputs" : [ ] 44 | }, { 45 | "metadata" : { 46 | "id" : "4B22602D64CB4E348E994064A0705D36" 47 | }, 48 | "cell_type" : "markdown", 49 | "source" : "## Let's reuse the `WebLog` definition used in the batch approach" 50 | }, { 51 | "metadata" : { 52 | "trusted" : true, 53 | "input_collapsed" : false, 54 | "collapsed" : false, 55 | "id" : "33C13BF6F85B425C80279008BBB6DEE0" 56 | }, 57 | "cell_type" : "code", 58 | "source" : [ "import java.sql.Timestamp\n", "case class WebLog(host:String, \n", " timestamp: Timestamp, \n", " request: String, \n", " http_reply:Int, \n", " bytes: Long\n", " )" ], 59 | "outputs" : [ ] 60 | }, { 61 | "metadata" : { 62 | "trusted" : true, 63 | "input_collapsed" : false, 64 | "collapsed" : false, 65 | "id" : "0F6108133CBB441685EDF095F4B52799" 66 | }, 67 | "cell_type" : "code", 68 | "source" : [ "val connectionWidget = ul(5)\n", "val dataWidget = ul(20)" ], 69 | "outputs" : [ ] 70 | }, { 71 | "metadata" : { 72 | "id" : "6664C6ACA92E463A857A72A8FF0F8943" 73 | }, 74 | "cell_type" : "markdown", 75 | "source" : "## A Simple TCP server implementation" 76 | }, { 77 | "metadata" : { 78 | "trusted" : true, 79 | "input_collapsed" : false, 80 | "collapsed" : false, 81 | "id" : "4BD44B38BD9C41B5A66E6B9EC8BF1914" 82 | }, 83 | "cell_type" : "code", 84 | "source" : [ "// Simple multithreaded server\n", "import java.net._\n", "import java.io._\n", "import java.sql.Timestamp\n", "import scala.concurrent.Future\n", "import scala.annotation.tailrec\n", "import scala.collection.JavaConverters._\n", "import org.apache.spark.sql.Dataset\n", "import org.apache.spark.sql.SparkSession\n", "import org.apache.spark.sql.functions._\n", "\n", "import scala.concurrent.ExecutionContext.Implicits.global\n", "\n", "class SocketHandler(sparkSession: SparkSession, port: Int, data: Dataset[WebLog]) {\n", " val logDelay = 500 // millis\n", " @volatile var active = false \n", " \n", " // non blocking start of the socket handler\n", " def start() : Unit = {\n", " active = true\n", " new Thread() {\n", " override def run() { \n", " connectionWidget.append(\"Server starting...\")\n", " acceptConnections()\n", " connectionWidget.append(\"Server stopped\")\n", " }\n", " }.start()\n", " } \n", " \n", " def stop() {\n", " active = false\n", " }\n", " \n", " @tailrec\n", " final def acceptConnections(): Unit = {\n", " val server: ServerSocket = new ServerSocket(port)\n", " val socket = server.accept()\n", " connectionWidget.append(\"Accepting connection from: \" + socket)\n", " serve(socket)\n", " if (active) {\n", " acceptConnections() \n", " } else {\n", " () // finish recursing for new connections\n", " }\n", " }\n", " \n", " // 1-thread per connection model for example purposes.\n", " def serve(socket: Socket) = {\n", " import sparkSession.implicits._\n", " val minTimestamp = data.select(min($\"timestamp\")).as[Timestamp].first\n", " val now = System.currentTimeMillis\n", " val offset = now - minTimestamp.getTime()\n", " val offsetData = data.map(weblog => weblog.copy(timestamp = new Timestamp(weblog.timestamp.getTime+ offset)))\n", " val jsonData = offsetData.toJSON\n", " val iter = jsonData.toLocalIterator.asScala\n", " new Thread() {\n", " override def run() {\n", " val out = new PrintStream(socket.getOutputStream())\n", " connectionWidget.append(\"Starting data stream for: \" + socket.getInetAddress() + \"]\")\n", " while(iter.hasNext && active) {\n", " val data = iter.next()\n", " out.println(data)\n", " dataWidget.append(s\"[${socket.getInetAddress()}] sending: ${data.take(40)}...\")\n", " out.flush()\n", " Thread.sleep(logDelay)\n", " }\n", " out.close()\n", " socket.close()\n", " }\n", " }.start()\n", " }\n", "}\n" ], 85 | "outputs" : [ ] 86 | }, { 87 | "metadata" : { 88 | "id" : "156E6DFCB6734B1C9995CC446102DA5A" 89 | }, 90 | "cell_type" : "markdown", 91 | "source" : "## We want to reuse the NASA weblog dataset with a Back-to-the-Future twist.\nWe are going to bring the timestamps to our current time." 92 | }, { 93 | "metadata" : { 94 | "trusted" : true, 95 | "input_collapsed" : false, 96 | "collapsed" : false, 97 | "id" : "C31A312D231442B5AA676B4A455C67A9" 98 | }, 99 | "cell_type" : "code", 100 | "source" : [ "val rawLogs = sparkSession.read.json(logsDirectory)" ], 101 | "outputs" : [ ] 102 | }, { 103 | "metadata" : { 104 | "trusted" : true, 105 | "input_collapsed" : false, 106 | "collapsed" : false, 107 | "id" : "14818332398A4EB79ADD52C0AECEAC78" 108 | }, 109 | "cell_type" : "code", 110 | "source" : [ "import org.apache.spark.sql.functions._\n", "import org.apache.spark.sql.types.IntegerType\n", "val preparedLogs = rawLogs.withColumn(\"http_reply\", $\"http_reply\".cast(IntegerType))\n", "val weblogs = preparedLogs.as[WebLog]" ], 111 | "outputs" : [ ] 112 | }, { 113 | "metadata" : { 114 | "trusted" : true, 115 | "input_collapsed" : false, 116 | "collapsed" : false, 117 | "id" : "639D7344557B4A4BB1FFB593762B7B20" 118 | }, 119 | "cell_type" : "code", 120 | "source" : [ "val server = new SocketHandler(sparkSession, serverPort, weblogs)" ], 121 | "outputs" : [ ] 122 | }, { 123 | "metadata" : { 124 | "id" : "BDD4A1D2C8B84E629D68DEC0DEB81EFC" 125 | }, 126 | "cell_type" : "markdown", 127 | "source" : "# Interactions Monitor\nThese two widgets will give us a view on connections and data being sent to a connecting client.\n\nWhen a client is connected, we should see the accepted connection under the `connectionWidget` and the data being sent in the `dataWidget`." 128 | }, { 129 | "metadata" : { 130 | "trusted" : true, 131 | "input_collapsed" : false, 132 | "collapsed" : false, 133 | "id" : "26F92DE4DB3442FD871F9D4E8DC2D610" 134 | }, 135 | "cell_type" : "code", 136 | "source" : [ "connectionWidget" ], 137 | "outputs" : [ ] 138 | }, { 139 | "metadata" : { 140 | "trusted" : true, 141 | "input_collapsed" : false, 142 | "collapsed" : false, 143 | "id" : "F11B11D5F0F347A4A3AADE231A898715" 144 | }, 145 | "cell_type" : "code", 146 | "source" : [ "dataWidget " ], 147 | "outputs" : [ ] 148 | }, { 149 | "metadata" : { 150 | "trusted" : true, 151 | "input_collapsed" : false, 152 | "collapsed" : false, 153 | "id" : "44680B5D303348F5850FA70FDBF0F433" 154 | }, 155 | "cell_type" : "markdown", 156 | "source" : "## Start the server accept process" 157 | }, { 158 | "metadata" : { 159 | "trusted" : true, 160 | "input_collapsed" : false, 161 | "collapsed" : false, 162 | "id" : "009428300B3C4FFCB4906427BA25FD3D" 163 | }, 164 | "cell_type" : "code", 165 | "source" : [ "server.start()" ], 166 | "outputs" : [ ] 167 | }, { 168 | "metadata" : { 169 | "trusted" : true, 170 | "input_collapsed" : false, 171 | "collapsed" : true, 172 | "id" : "941EFEA44D664DEC8C673FECA8F68C8B" 173 | }, 174 | "cell_type" : "markdown", 175 | "source" : "# Stop the server\nAfter experimenting with the TCP stream, execute the `close` method below to stop the data stream.\n\n*DO NOT* stop the server right after starting it. The command is commented out to prevent accidental execution. Uncomment and execute to stop this producer. " 176 | }, { 177 | "metadata" : { 178 | "trusted" : true, 179 | "input_collapsed" : false, 180 | "collapsed" : false, 181 | "id" : "71DA0D5613C943248184CA88AD7AE823" 182 | }, 183 | "cell_type" : "code", 184 | "source" : [ "//server.stop()" ], 185 | "outputs" : [ ] 186 | }, { 187 | "metadata" : { 188 | "trusted" : true, 189 | "input_collapsed" : false, 190 | "collapsed" : true, 191 | "id" : "0729E4DF9B1E47429D89DD2E4C00CEEB" 192 | }, 193 | "cell_type" : "code", 194 | "source" : [ "" ], 195 | "outputs" : [ ] 196 | } ], 197 | "nbformat" : 4 198 | } -------------------------------------------------------------------------------- /chapter-9/Structured-Streaming-in-action.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "75765162-2d2f-4070-b5b4-9818cac60853", 4 | "name" : "Structured-Streaming-in-action", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : [ "org.apache.spark %% spark-sql % 2.3.0", "org.apache.spark %% spark-sql-kafka-0-10 % 2.3.0" ], 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "6E2995E02B244E978E1327B7A60484F0" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "# Structured Streaming - Kafka Example\n\nThe intention of this example is to explore the main aspects of the Structured Streaming API.\n\n - We use the Kafka `source` to consume the `iot-data` topic.\n - We use a file `sink` to store the data into a _Parquet_ file.\n \nTo run this example, you also require:\n\n- a running Kafka broker. We suggest to use the easy-to-run _dockerized_ version maintained by Spotify: https://hub.docker.com/r/spotify/kafka/\n- a reference file, listing parameters for each sensor. This file must be generated with [reference-data-generator](./reference-data-generator.snb.ipynb)\n- our data producer notebook: [kafka_sensor_data_producer](./kafka_sensor_data_producer.snb.ipynb)\n\nBecause Kafka acts as a broker between producer and consumer, you can choose to run the two notebooks in any order. \nNevertheless, we suggest that you run the producer first to have data available when we go through this example." 28 | }, { 29 | "metadata" : { 30 | "trusted" : true, 31 | "input_collapsed" : false, 32 | "collapsed" : true, 33 | "id" : "447E7A8553BF47FD8416FC260B1A582C" 34 | }, 35 | "cell_type" : "code", 36 | "source" : [ "import org.apache.spark.sql.kafka010._" ], 37 | "outputs" : [ ] 38 | }, { 39 | "metadata" : { 40 | "trusted" : true, 41 | "input_collapsed" : false, 42 | "collapsed" : true, 43 | "id" : "3B783C2DA5A9409E85DF2CE7F061AECB" 44 | }, 45 | "cell_type" : "code", 46 | "source" : [ "import java.io.File\n", "// Kafka\n", "val kafkaBootstrapServer = \"127.0.0.1:9092\"\n", "val topic = \"iot-data\"\n", "\n", "// File system\n", "val workDir = \"/tmp/streaming-with-spark\"\n", "val referenceFile = \"sensor-records.parquet\"\n", "val targetFile = \"structured_enrichedIoTStream.parquet\"\n", "val targetPath = new File(workDir, targetFile).getAbsolutePath\n", "val unknownSensorsTargetFile = \"unknownSensorsStream.parquet\"\n", "val unknownSensorsTargetPath = new File(workDir, unknownSensorsTargetFile).getAbsolutePath\n" ], 47 | "outputs" : [ ] 48 | }, { 49 | "metadata" : { 50 | "trusted" : true, 51 | "input_collapsed" : false, 52 | "collapsed" : true, 53 | "id" : "3225F1A3939642B087F1BE6A37EB9D03" 54 | }, 55 | "cell_type" : "code", 56 | "source" : [ "val rawData = sparkSession.readStream\n", " .format(\"kafka\")\n", " .option(\"kafka.bootstrap.servers\", kafkaBootstrapServer)\n", " .option(\"subscribe\", topic)\n", " .option(\"startingOffsets\", \"earliest\")\n", " .load()" ], 57 | "outputs" : [ ] 58 | }, { 59 | "metadata" : { 60 | "trusted" : true, 61 | "input_collapsed" : false, 62 | "collapsed" : true, 63 | "id" : "497CEEFAB7DF40F884CC7A8139C3DA5F" 64 | }, 65 | "cell_type" : "code", 66 | "source" : [ "rawData.isStreaming" ], 67 | "outputs" : [ ] 68 | }, { 69 | "metadata" : { 70 | "trusted" : true, 71 | "input_collapsed" : false, 72 | "collapsed" : true, 73 | "id" : "A74BF086DCC240168F21E57797088678" 74 | }, 75 | "cell_type" : "code", 76 | "source" : [ "rawData.printSchema()" ], 77 | "outputs" : [ ] 78 | }, { 79 | "metadata" : { 80 | "trusted" : true, 81 | "input_collapsed" : false, 82 | "collapsed" : true, 83 | "id" : "A39D7FB1A9AC496F8DFC7502EB0A4C29" 84 | }, 85 | "cell_type" : "code", 86 | "source" : [ "case class SensorData(sensorId: Int, timestamp: Long, value: Double)" ], 87 | "outputs" : [ ] 88 | }, { 89 | "metadata" : { 90 | "trusted" : true, 91 | "input_collapsed" : false, 92 | "collapsed" : true, 93 | "id" : "6E729EE3495E4AA88FC9E347BDEE3210" 94 | }, 95 | "cell_type" : "code", 96 | "source" : [ "val iotData = rawData.select($\"value\").as[String].flatMap{record =>\n", " val fields = record.split(\",\")\n", " Try {\n", " SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)\n", " }.toOption\n", "}" ], 97 | "outputs" : [ ] 98 | }, { 99 | "metadata" : { 100 | "id" : "CD174F6E3EF1475AA51899B5931B7415" 101 | }, 102 | "cell_type" : "markdown", 103 | "source" : "## Load the reference data from a parquet file¶\nWe also cache the data to keep it in memory and improve the performance of our steaming application" 104 | }, { 105 | "metadata" : { 106 | "trusted" : true, 107 | "input_collapsed" : false, 108 | "collapsed" : true, 109 | "id" : "0314D4CF7E88448EB2A5BDBED5839282" 110 | }, 111 | "cell_type" : "code", 112 | "source" : [ "val sensorRef = sparkSession.read.parquet(s\"$workDir/$referenceFile\")\n", "sensorRef.cache()" ], 113 | "outputs" : [ ] 114 | }, { 115 | "metadata" : { 116 | "id" : "B3CA2A3BCDD14C4F8F8A51EF6345D8F2" 117 | }, 118 | "cell_type" : "markdown", 119 | "source" : "## Join the Reference Data with the Stream to Compute the Enriched Values" 120 | }, { 121 | "metadata" : { 122 | "trusted" : true, 123 | "input_collapsed" : false, 124 | "collapsed" : true, 125 | "id" : "3A7A1D84B28542CB81342CC1AA13B0E9" 126 | }, 127 | "cell_type" : "code", 128 | "source" : [ "val sensorWithInfo = sensorRef.join(iotData, Seq(\"sensorId\"), \"inner\")\n", "\n", "val knownSensors = sensorWithInfo\n", " .withColumn(\"dnvalue\", $\"value\"*($\"maxRange\"-$\"minRange\")+$\"minRange\")\n", " .drop(\"value\", \"maxRange\", \"minRange\")" ], 129 | "outputs" : [ ] 130 | }, { 131 | "metadata" : { 132 | "id" : "0A84FF9B41F0495C93E8D9D1B6AD721E" 133 | }, 134 | "cell_type" : "markdown", 135 | "source" : "## Write the Results to a Parquet File" 136 | }, { 137 | "metadata" : { 138 | "trusted" : true, 139 | "input_collapsed" : false, 140 | "collapsed" : true, 141 | "id" : "967DA9CD00034B93A9F9077ACCABBF69" 142 | }, 143 | "cell_type" : "code", 144 | "source" : [ "val query = knownSensors.writeStream\n", " .outputMode(\"append\")\n", " .format(\"parquet\")\n", " .option(\"path\", targetPath)\n", " .option(\"checkpointLocation\", workDir + \"/iot-checkpoint\")\n", " .start()" ], 145 | "outputs" : [ ] 146 | }, { 147 | "metadata" : { 148 | "trusted" : true, 149 | "input_collapsed" : false, 150 | "collapsed" : true, 151 | "presentation" : { 152 | "tabs_state" : "{\n \"tab_id\": \"#tab636027600-0\"\n}", 153 | "pivot_chart_state" : "{\n \"hiddenAttributes\": [],\n \"menuLimit\": 200,\n \"cols\": [],\n \"rows\": [],\n \"vals\": [],\n \"exclusions\": {},\n \"inclusions\": {},\n \"unusedAttrsVertical\": 85,\n \"autoSortUnusedAttrs\": false,\n \"inclusionsInfo\": {},\n \"aggregatorName\": \"Count\",\n \"rendererName\": \"Table\"\n}" 154 | }, 155 | "id" : "C60D42C6AA7C43108AF52A4EEA7B407E" 156 | }, 157 | "cell_type" : "code", 158 | "source" : [ "\n", "query.recentProgress" ], 159 | "outputs" : [ ] 160 | }, { 161 | "metadata" : { 162 | "trusted" : true, 163 | "input_collapsed" : false, 164 | "collapsed" : true, 165 | "id" : "A8292D947D7E4A889B416F5C2BB158C7" 166 | }, 167 | "cell_type" : "code", 168 | "source" : [ "" ], 169 | "outputs" : [ ] 170 | } ], 171 | "nbformat" : 4 172 | } -------------------------------------------------------------------------------- /chapter-9/kafka-sensor-data-generator.snb.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "2f2a280f-81ed-4e24-b17c-a081f920b660", 4 | "name" : "kafka-sensor-data-generator", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : [ "org.apache.spark %% spark-sql % 2.3.0", "org.apache.spark %% spark-sql-kafka-0-10 % 2.3.0" ], 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "trusted" : true, 25 | "input_collapsed" : false, 26 | "collapsed" : true, 27 | "id" : "30BAF937A31247F095E8F815E0A130EA" 28 | }, 29 | "cell_type" : "markdown", 30 | "source" : "# Sensor Data Generator\nThis notebook serves as a sensor data simulator.\nIt generates a stream of random sensor readings for a given number of sensors.\n\nThe data is produced to a configurable Kafka topic." 31 | }, { 32 | "metadata" : { 33 | "id" : "BE3667BF92C2474C968FB663897C2945" 34 | }, 35 | "cell_type" : "markdown", 36 | "source" : "### Configuration" 37 | }, { 38 | "metadata" : { 39 | "trusted" : true, 40 | "input_collapsed" : false, 41 | "collapsed" : false, 42 | "id" : "D5EADC5B952546EB89BABA94FEA6AC85" 43 | }, 44 | "cell_type" : "code", 45 | "source" : [ "// Kafka\n", "val kafkaBootstrapServer = \"172.17.0.2:9092\"\n", "val targetTopic = \"iot-data\"\n", "\n", "// File system\n", "val workDir = \"/tmp/streaming-with-spark\"\n", "\n", "// Generator\n", "val sensorCount = 100000" ], 46 | "outputs" : [ ] 47 | }, { 48 | "metadata" : { 49 | "id" : "CF7FDE26F16B476D88694FC731121CD1" 50 | }, 51 | "cell_type" : "markdown", 52 | "source" : "## Schema\nWe need a schema definition for the sensor data that we are going to generate." 53 | }, { 54 | "metadata" : { 55 | "trusted" : true, 56 | "input_collapsed" : false, 57 | "collapsed" : false, 58 | "id" : "E9D6009A13594B1294C2C55F63E923B1" 59 | }, 60 | "cell_type" : "code", 61 | "source" : [ "case class SensorData(sensorId: Int, timestamp: Long, value: Double)\n", "object SensorData {\n", " import scala.util.Random\n", " def randomGen(maxId:Int) = {\n", " SensorData(Random.nextInt(maxId), System.currentTimeMillis, Random.nextDouble())\n", " }\n", "}" ], 62 | "outputs" : [ ] 63 | }, { 64 | "metadata" : { 65 | "trusted" : true, 66 | "input_collapsed" : false, 67 | "collapsed" : false, 68 | "id" : "D432E889A5C64C1FB722CDAFBD6464FF" 69 | }, 70 | "cell_type" : "code", 71 | "source" : [ "case class Rate(timestamp: Long, value: Long)" ], 72 | "outputs" : [ ] 73 | }, { 74 | "metadata" : { 75 | "id" : "C13759524C1740209059997D9D1FE4C6" 76 | }, 77 | "cell_type" : "markdown", 78 | "source" : "## We use the built-in rate generator as the base stream for our data generator" 79 | }, { 80 | "metadata" : { 81 | "trusted" : true, 82 | "input_collapsed" : false, 83 | "collapsed" : false, 84 | "id" : "BCC7E1DD775A48CF80B51C96737A43B2" 85 | }, 86 | "cell_type" : "code", 87 | "source" : [ "val baseStream = sparkSession.readStream.format(\"rate\").option(\"recordsPerSecond\", 100).load()" ], 88 | "outputs" : [ ] 89 | }, { 90 | "metadata" : { 91 | "trusted" : true, 92 | "input_collapsed" : false, 93 | "collapsed" : false, 94 | "id" : "66DC058C3C904140836593CB97F088DE" 95 | }, 96 | "cell_type" : "code", 97 | "source" : [ "val sensorValues = baseStream.as[Rate].map(_ => SensorData.randomGen(sensorCount))" ], 98 | "outputs" : [ ] 99 | }, { 100 | "metadata" : { 101 | "trusted" : true, 102 | "input_collapsed" : false, 103 | "collapsed" : false, 104 | "id" : "AEF94D7C908641D681E7F21F22FA06F3" 105 | }, 106 | "cell_type" : "code", 107 | "source" : [ "import org.apache.spark.sql.kafka010._" ], 108 | "outputs" : [ ] 109 | }, { 110 | "metadata" : { 111 | "trusted" : true, 112 | "input_collapsed" : false, 113 | "collapsed" : false, 114 | "id" : "918CD1A8190D4BF6A9A23826E06B2933" 115 | }, 116 | "cell_type" : "code", 117 | "source" : [ "val query = sensorValues.writeStream.format(\"kafka\")\n", " .queryName(\"kafkaWriter\")\n", " .outputMode(\"append\")\n", " .option(\"kafka.bootstrap.servers\", kafkaBootstrapServer) // comma-separated list of host:port\n", " .option(\"topic\", targetTopic)\n", " .option(\"checkpointLocation\", workDir+\"/generator-checkpoint\")\n", " .option(\"failOnDataLoss\", \"false\") // use this option when testing\n", " .start()\n", "\n" ], 118 | "outputs" : [ ] 119 | }, { 120 | "metadata" : { 121 | "trusted" : true, 122 | "input_collapsed" : false, 123 | "collapsed" : true, 124 | "id" : "C7B26D61719145BF9D52A5B7605ADAAB" 125 | }, 126 | "cell_type" : "code", 127 | "source" : [ "" ], 128 | "outputs" : [ ] 129 | } ], 130 | "nbformat" : 4 131 | } -------------------------------------------------------------------------------- /chapter-9/reference-data-generator.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata" : { 3 | "id" : "bac1b48e-b675-41c4-83cb-14cfddef6ac7", 4 | "name" : "reference-data-generator", 5 | "user_save_timestamp" : "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp" : "1970-01-01T01:00:00.000Z", 7 | "language_info" : { 8 | "name" : "scala", 9 | "file_extension" : "scala", 10 | "codemirror_mode" : "text/x-scala" 11 | }, 12 | "trusted" : true, 13 | "sparkNotebook" : null, 14 | "customLocalRepo" : null, 15 | "customRepos" : null, 16 | "customDeps" : null, 17 | "customImports" : null, 18 | "customArgs" : null, 19 | "customSparkConf" : null, 20 | "customVars" : null 21 | }, 22 | "cells" : [ { 23 | "metadata" : { 24 | "id" : "05BA7B120BD24B44A94495B3B13DB04C" 25 | }, 26 | "cell_type" : "markdown", 27 | "source" : "#Sensor Reference Data Generator\nLearning Spark Streaming - supporting material\n\nThis notebook generates the fixed reference data used through in the IoT examples" 28 | }, { 29 | "metadata" : { 30 | "trusted" : true, 31 | "input_collapsed" : false, 32 | "collapsed" : false, 33 | "id" : "40032AE70D9B4ABABFBC6EAA0A38F3B0" 34 | }, 35 | "cell_type" : "code", 36 | "source" : [ "val sensorCount = 100000\n", "val workDir = \"/tmp/streaming-with-spark/\"\n", "val referenceFile = \"sensor-records.parquet\"" ], 37 | "outputs" : [ { 38 | "name" : "stdout", 39 | "output_type" : "stream", 40 | "text" : "sensorCount: Int = 100000\nworkDir: String = /tmp/streaming-with-spark/\nreferenceFile: String = sensor-records.parquet\n" 41 | }, { 42 | "metadata" : { }, 43 | "data" : { 44 | "text/html" : "" 45 | }, 46 | "output_type" : "execute_result", 47 | "execution_count" : 1, 48 | "time" : "Took: 0.924s, at 2019-03-02 20:42" 49 | } ] 50 | }, { 51 | "metadata" : { 52 | "trusted" : true, 53 | "input_collapsed" : false, 54 | "collapsed" : false, 55 | "id" : "C9A22043B45445E48F2DD2839DB0116B" 56 | }, 57 | "cell_type" : "code", 58 | "source" : [ "case class SensorType(sensorType: String, unit: String, minRange: Double, maxRange: Double)" ], 59 | "outputs" : [ { 60 | "name" : "stdout", 61 | "output_type" : "stream", 62 | "text" : "defined class SensorType\n" 63 | }, { 64 | "metadata" : { }, 65 | "data" : { 66 | "text/html" : "" 67 | }, 68 | "output_type" : "execute_result", 69 | "execution_count" : 2, 70 | "time" : "Took: 0.764s, at 2017-06-30 16:25" 71 | } ] 72 | }, { 73 | "metadata" : { 74 | "trusted" : true, 75 | "input_collapsed" : false, 76 | "collapsed" : false, 77 | "id" : "D28261212443425C8938630D60098A8C" 78 | }, 79 | "cell_type" : "code", 80 | "source" : [ "case class SensorReference(sensorId: Long, sensorType: String, unit: String, minRange: Double, maxRange: Double)" ], 81 | "outputs" : [ { 82 | "name" : "stdout", 83 | "output_type" : "stream", 84 | "text" : "defined class SensorReference\n" 85 | }, { 86 | "metadata" : { }, 87 | "data" : { 88 | "text/html" : "" 89 | }, 90 | "output_type" : "execute_result", 91 | "execution_count" : 3, 92 | "time" : "Took: 0.690s, at 2017-06-30 16:25" 93 | } ] 94 | }, { 95 | "metadata" : { 96 | "trusted" : true, 97 | "input_collapsed" : false, 98 | "collapsed" : false, 99 | "id" : "FBDD084D0A4648FC80B7E8BC4F14A3F9" 100 | }, 101 | "cell_type" : "code", 102 | "source" : [ "val sensorTypes = List (\n", " SensorType(\"humidity\", \"%Rh\", 0, 100),\n", " SensorType(\"temperature\", \"oC\", -100, 100),\n", " SensorType(\"brightness\", \"lux\", 0, 100000),\n", " SensorType(\"rainfall\",\"mm/day\",0, 5000),\n", " SensorType(\"windspeed\",\"m/s\", 0, 50),\n", " SensorType(\"pressure\", \"mmHg\", 800, 1100),\n", " SensorType(\"magnetism\", \"T\", 0, 1000),\n", " SensorType(\"Radiation\", \"mSv\", 0.01, 10000)\n", ")\n", "\n", " " ], 103 | "outputs" : [ { 104 | "name" : "stdout", 105 | "output_type" : "stream", 106 | "text" : "sensorTypes: List[SensorType] = List(SensorType(humidity,%Rh,0.0,100.0), SensorType(temperature,oC,-100.0,100.0), SensorType(brightness,lux,0.0,100000.0), SensorType(rainfall,mm/day,0.0,5000.0), SensorType(windspeed,m/s,0.0,50.0), SensorType(pressure,mmHg,800.0,1100.0), SensorType(magnetism,T,0.0,1000.0), SensorType(Radiation,mSv,0.01,10000.0))\n" 107 | }, { 108 | "metadata" : { }, 109 | "data" : { 110 | "text/html" : "" 111 | }, 112 | "output_type" : "execute_result", 113 | "execution_count" : 4, 114 | "time" : "Took: 0.977s, at 2017-06-30 16:25" 115 | } ] 116 | }, { 117 | "metadata" : { 118 | "trusted" : true, 119 | "input_collapsed" : false, 120 | "collapsed" : false, 121 | "id" : "B3C135CC5E0340928B1567700315991A" 122 | }, 123 | "cell_type" : "code", 124 | "source" : [ "val sensorIds = sparkSession.range(0, sensorCount)" ], 125 | "outputs" : [ { 126 | "name" : "stdout", 127 | "output_type" : "stream", 128 | "text" : "sensorIds: org.apache.spark.sql.Dataset[Long] = [id: bigint]\n" 129 | }, { 130 | "metadata" : { }, 131 | "data" : { 132 | "text/html" : "" 133 | }, 134 | "output_type" : "execute_result", 135 | "execution_count" : 5, 136 | "time" : "Took: 1.714s, at 2017-06-30 16:25" 137 | } ] 138 | }, { 139 | "metadata" : { 140 | "trusted" : true, 141 | "input_collapsed" : false, 142 | "collapsed" : false, 143 | "id" : "91C4D164711748C0A48D06854234213C" 144 | }, 145 | "cell_type" : "code", 146 | "source" : [ "import scala.util.Random\n", "val sensors = sensorIds.map{id => \n", " val sensorType = sensorTypes(Random.nextInt(sensorTypes.size))\n", " SensorReference(id, sensorType.sensorType, sensorType.unit, sensorType.minRange, sensorType.maxRange)\n", " }" ], 147 | "outputs" : [ { 148 | "name" : "stdout", 149 | "output_type" : "stream", 150 | "text" : "import scala.util.Random\nsensors: org.apache.spark.sql.Dataset[SensorReference] = [sensorId: bigint, sensorType: string ... 3 more fields]\n" 151 | }, { 152 | "metadata" : { }, 153 | "data" : { 154 | "text/html" : "" 155 | }, 156 | "output_type" : "execute_result", 157 | "execution_count" : 6, 158 | "time" : "Took: 1.021s, at 2017-06-30 16:25" 159 | } ] 160 | }, { 161 | "metadata" : { 162 | "trusted" : true, 163 | "input_collapsed" : false, 164 | "collapsed" : false, 165 | "id" : "BB72B2989AE14D628301293C03B135F9" 166 | }, 167 | "cell_type" : "code", 168 | "source" : [ "sensors.show()" ], 169 | "outputs" : [ { 170 | "name" : "stdout", 171 | "output_type" : "stream", 172 | "text" : "+--------+-----------+------+--------+--------+\n|sensorId| sensorType| unit|minRange|maxRange|\n+--------+-----------+------+--------+--------+\n| 0| rainfall|mm/day| 0.0| 5000.0|\n| 1| windspeed| m/s| 0.0| 50.0|\n| 2| magnetism| T| 0.0| 1000.0|\n| 3| rainfall|mm/day| 0.0| 5000.0|\n| 4| Radiation| mSv| 0.01| 10000.0|\n| 5| rainfall|mm/day| 0.0| 5000.0|\n| 6|temperature| oC| -100.0| 100.0|\n| 7| Radiation| mSv| 0.01| 10000.0|\n| 8| pressure| mmHg| 800.0| 1100.0|\n| 9| humidity| %Rh| 0.0| 100.0|\n| 10| Radiation| mSv| 0.01| 10000.0|\n| 11| pressure| mmHg| 800.0| 1100.0|\n| 12| windspeed| m/s| 0.0| 50.0|\n| 13| pressure| mmHg| 800.0| 1100.0|\n| 14| brightness| lux| 0.0|100000.0|\n| 15| brightness| lux| 0.0|100000.0|\n| 16|temperature| oC| -100.0| 100.0|\n| 17|temperature| oC| -100.0| 100.0|\n| 18| humidity| %Rh| 0.0| 100.0|\n| 19| Radiation| mSv| 0.01| 10000.0|\n+--------+-----------+------+--------+--------+\nonly showing top 20 rows\n\n" 173 | }, { 174 | "metadata" : { }, 175 | "data" : { 176 | "text/html" : "" 177 | }, 178 | "output_type" : "execute_result", 179 | "execution_count" : 7, 180 | "time" : "Took: 2.145s, at 2017-06-30 16:25" 181 | } ] 182 | }, { 183 | "metadata" : { 184 | "trusted" : true, 185 | "input_collapsed" : false, 186 | "collapsed" : false, 187 | "id" : "B88A683EFFDF47128101120929470552" 188 | }, 189 | "cell_type" : "code", 190 | "source" : [ "sensors.write.mode(\"overwrite\").parquet(s\"$workDir/$referenceFile\")\n" ], 191 | "outputs" : [ { 192 | "metadata" : { }, 193 | "data" : { 194 | "text/html" : "" 195 | }, 196 | "output_type" : "execute_result", 197 | "execution_count" : 8, 198 | "time" : "Took: 2.235s, at 2017-06-30 16:25" 199 | } ] 200 | }, { 201 | "metadata" : { 202 | "trusted" : true, 203 | "input_collapsed" : false, 204 | "collapsed" : false, 205 | "id" : "5D0779C60E37488FBAC045BCCB34702C" 206 | }, 207 | "cell_type" : "code", 208 | "source" : [ "" ], 209 | "outputs" : [ { 210 | "metadata" : { }, 211 | "data" : { 212 | "text/html" : "" 213 | }, 214 | "output_type" : "execute_result", 215 | "execution_count" : 9, 216 | "time" : "Took: 0.683s, at 2017-06-30 16:25" 217 | } ] 218 | } ], 219 | "nbformat" : 4 220 | } -------------------------------------------------------------------------------- /datasets/NASA-weblogs/nasa_dataset_july_1995.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stream-processing-with-spark/notebooks/e3703fbf227a7de6ad54221b2a27310917204c5e/datasets/NASA-weblogs/nasa_dataset_july_1995.tgz -------------------------------------------------------------------------------- /extras-twitter/LearningStreaming-GeoTwitter.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "id": "8a82bdb1-b108-4ad5-b6ce-842b37ce79d8", 4 | "name": "LearningStreaming-GeoTwitter", 5 | "user_save_timestamp": "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z", 7 | "language_info": { 8 | "name": "scala", 9 | "file_extension": "scala", 10 | "codemirror_mode": "text/x-scala" 11 | }, 12 | "trusted": true, 13 | "sparkNotebook": null, 14 | "customLocalRepo": null, 15 | "customRepos": null, 16 | "customDeps": [ 17 | "org.apache.spark %% spark-streaming @ 2.1.0", 18 | "org.apache.bahir %% spark-streaming-twitter % 2.1.0" 19 | ], 20 | "customImports": null, 21 | "customArgs": null, 22 | "customSparkConf": null, 23 | "customVars": null 24 | }, 25 | "cells": [ 26 | { 27 | "metadata": { 28 | "id": "E06591AD8AA74F1BBA6FA628F887B714" 29 | }, 30 | "cell_type": "markdown", 31 | "source": "Twitter Geolocation\n===========\nIn this notebook, we are going to explore Spark Streaming and the `DStream` API using Twitter as a stream source.\n" 32 | }, 33 | { 34 | "metadata": { 35 | "trusted": true, 36 | "input_collapsed": false, 37 | "collapsed": false, 38 | "id": "CD2D34DDB9A4465081C1F8D5EED2FEF9" 39 | }, 40 | "cell_type": "code", 41 | "source": "import org.apache.spark.streaming.{Seconds, StreamingContext}\nimport org.apache.spark.SparkContext._\nimport org.apache.spark.streaming.twitter._\n", 42 | "outputs": [] 43 | }, 44 | { 45 | "metadata": { 46 | "id": "66F3FB6D3D2C44789C0A00FBD6EF82F4" 47 | }, 48 | "cell_type": "markdown", 49 | "source": "To use the Twitter API, we require two pairs of credentials that can be created and retrieved from \n[Twitter Applications Console](https://apps.twitter.com/) : \n- consumer key\n- consumer secret\n- access token\n- access token secret\n\nHere, we define a helper function to facilitate setting these keys as `System Properties` expected by the `spark-streaming-twitter` API and the underlying library used (twitter4j)\n" 50 | }, 51 | { 52 | "metadata": { 53 | "trusted": true, 54 | "input_collapsed": false, 55 | "collapsed": false, 56 | "id": "4F95FF7380534E199F9B320776186355" 57 | }, 58 | "cell_type": "code", 59 | "source": "def configureTwitterCredentials(consumerKey: String,\n consumerSecret: String,\n accessToken: String,\n accessTokenSecret: String) {\n val configs = Seq(\"consumerKey\" -> consumerKey,\n \"consumerSecret\" -> consumerSecret,\n \"accessToken\" -> accessToken,\n \"accessTokenSecret\" -> accessTokenSecret).toMap\n val trimmedConfigs = configs.mapValues(_.trim)\n configs.foreach{ case(key, value) =>\n require(value.nonEmpty, s\"\"\"Error setting authentication - value for $key not set\"\"\")\n val fullKey = \"twitter4j.oauth.\" + key\n System.setProperty(fullKey, value) }\n }", 60 | "outputs": [] 61 | }, 62 | { 63 | "metadata": { 64 | "id": "2B0BEF3C97F0476781D6F3EDE9E8470F" 65 | }, 66 | "cell_type": "markdown", 67 | "source": "Set here your own credentials " 68 | }, 69 | { 70 | "metadata": { 71 | "trusted": true, 72 | "input_collapsed": false, 73 | "collapsed": false, 74 | "id": "75E82BE86D83492888DF3F6CC6B4E906" 75 | }, 76 | "cell_type": "code", 77 | "source": "configureTwitterCredentials(\"\", \n \"\",\n \"\",\n \"\") ", 78 | "outputs": [] 79 | }, 80 | { 81 | "metadata": { 82 | "id": "0936863808084DF4815DF39D7CC1F3B2" 83 | }, 84 | "cell_type": "markdown", 85 | "source": "## We create a Streaming Context with a `streamingInterval` of 5 seconds" 86 | }, 87 | { 88 | "metadata": { 89 | "trusted": true, 90 | "input_collapsed": false, 91 | "collapsed": false, 92 | "id": "189742F12F6A4730BE186C8B96898DA2" 93 | }, 94 | "cell_type": "code", 95 | "source": "val ssc = new StreamingContext(sparkContext, Seconds(5))", 96 | "outputs": [] 97 | }, 98 | { 99 | "metadata": { 100 | "id": "649D1A10464342588BF3713287A402E4" 101 | }, 102 | "cell_type": "markdown", 103 | "source": "And create a `twitterStream` using the keyword `spark` as subscription filter." 104 | }, 105 | { 106 | "metadata": { 107 | "trusted": true, 108 | "input_collapsed": false, 109 | "collapsed": false, 110 | "id": "A0825ADCBD91483B82732C02E636CFF8" 111 | }, 112 | "cell_type": "code", 113 | "source": "val filters = Array(\"music\")\nval twitterStream = TwitterUtils.createStream(ssc, None, filters)", 114 | "outputs": [] 115 | }, 116 | { 117 | "metadata": { 118 | "id": "52DCA2D36728413DB62D2230FB909F1A" 119 | }, 120 | "cell_type": "markdown", 121 | "source": "We are going to process the data from the Twitter subscription by:\n- extracting the `#hashtags` in the text of the tweet \n- making each `#hashtag` a pair of the form `(#hashtag, 1)` to facilitate a count operation\n- reduce with a `sum` operation to obtain a count, using a window of `60` seconds\n- invert the resulting pairs: `(#hashtag, count)` to `(count, #hashtag)`\n- sort each batch of data (each `RDD` per streaming interval) to obtain a `top-n` score" 122 | }, 123 | { 124 | "metadata": { 125 | "trusted": true, 126 | "input_collapsed": false, 127 | "collapsed": false, 128 | "id": "CF39BFF9E477447486192BCA216EB8A9" 129 | }, 130 | "cell_type": "code", 131 | "source": "import StreamingContext._\nval hashTags = twitterStream.flatMap(status => status.getText.split(\" \").filter(_.startsWith(\"#\")))\n\nval topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))\n .map{case (topic, count) => (count, topic)}\n .transform(_.sortByKey(false))", 132 | "outputs": [] 133 | }, 134 | { 135 | "metadata": { 136 | "id": "5824BAA2AFBA4B07971E0D42CB53C1C2" 137 | }, 138 | "cell_type": "markdown", 139 | "source": "We declare an Unordered List widget that will let us display the top-n results from the `DStream`" 140 | }, 141 | { 142 | "metadata": { 143 | "trusted": true, 144 | "input_collapsed": false, 145 | "collapsed": false, 146 | "id": "726AF1956669409C9E82C05D50BC4996" 147 | }, 148 | "cell_type": "code", 149 | "source": "@transient val result = ul(10)\nresult", 150 | "outputs": [] 151 | }, 152 | { 153 | "metadata": { 154 | "id": "92C7F04D831F4CEE87A42A4DC11D5488" 155 | }, 156 | "cell_type": "markdown", 157 | "source": "(optional) Use some static test data to test the behaviour of the `Unordered List` widget" 158 | }, 159 | { 160 | "metadata": { 161 | "trusted": true, 162 | "input_collapsed": false, 163 | "collapsed": false, 164 | "id": "1B6F1C3043E14FEA93DF3505436CB23C" 165 | }, 166 | "cell_type": "code", 167 | "source": "result(Array(\"top-1\",\"top-2\",\"top-3\"))", 168 | "outputs": [] 169 | }, 170 | { 171 | "metadata": { 172 | "id": "48D85C80F3664134812B5B1663507559" 173 | }, 174 | "cell_type": "markdown", 175 | "source": "We use the generic output operation `foreachRDD` to retrieve the top-10 hashtags of the corresponding window and update the display by passing this data to the `result` widget." 176 | }, 177 | { 178 | "metadata": { 179 | "trusted": true, 180 | "input_collapsed": false, 181 | "collapsed": false, 182 | "id": "55EF4D8B55BC43F0833324161BFD57A0" 183 | }, 184 | "cell_type": "code", 185 | "source": "topCounts60.foreachRDD(rdd => {\n val topList = rdd.take(10).toList\n val r = topList.map{case (count, tag) => s\"$tag: $count\"}\n result(r)\n})", 186 | "outputs": [] 187 | }, 188 | { 189 | "metadata": { 190 | "id": "9D8BE7B8B5EB4FA78D77EDA7FBC18A97" 191 | }, 192 | "cell_type": "markdown", 193 | "source": "We declare a `GeoPointsChart` that will display points on a geographical map " 194 | }, 195 | { 196 | "metadata": { 197 | "trusted": true, 198 | "input_collapsed": false, 199 | "collapsed": false, 200 | "id": "A287B7F4744E40AD8D052D4B6F126643" 201 | }, 202 | "cell_type": "code", 203 | "source": "@transient val geo = GeoPointsChart(Seq((0d,0d, \"init\")))\ngeo", 204 | "outputs": [] 205 | }, 206 | { 207 | "metadata": { 208 | "trusted": true, 209 | "input_collapsed": false, 210 | "collapsed": false, 211 | "id": "690B851AF4DD4475AC860FC96BACEEE5" 212 | }, 213 | "cell_type": "code", 214 | "source": "twitterStream.window(Seconds(300), Seconds(15))\n . filter{ s =>\n s.getGeoLocation() != null\n }\n .map{s =>\n (s.getGeoLocation().getLatitude(),\n s.getGeoLocation().getLongitude(),\n s.getText())}\n .foreachRDD{rdd =>\n geo.applyOn(rdd.take(100))\n }", 215 | "outputs": [] 216 | }, 217 | { 218 | "metadata": { 219 | "id": "14C21F354E534D1EB564FBE76BF91D42" 220 | }, 221 | "cell_type": "markdown", 222 | "source": "Finally, we start the Streaming Context" 223 | }, 224 | { 225 | "metadata": { 226 | "trusted": true, 227 | "input_collapsed": false, 228 | "collapsed": false, 229 | "id": "303CDF07F5D945338B0E56FC924AFC68" 230 | }, 231 | "cell_type": "code", 232 | "source": "ssc.start()", 233 | "outputs": [] 234 | }, 235 | { 236 | "metadata": { 237 | "id": "ED021294EB1246FBAFD55D1CB2BD38F4" 238 | }, 239 | "cell_type": "markdown", 240 | "source": "When done, use the `stop` method to stop all streaming processing. We use the option to not stop the underlying sparkContext. That way we can restart the streaming job without having to restart the complete notebook kernel." 241 | }, 242 | { 243 | "metadata": { 244 | "trusted": true, 245 | "input_collapsed": false, 246 | "collapsed": false, 247 | "id": "439A13CBFDDE43198BAC94B80556A853" 248 | }, 249 | "cell_type": "code", 250 | "source": "ssc.stop(stopSparkContext = false, stopGracefully = true)", 251 | "outputs": [] 252 | }, 253 | { 254 | "metadata": { 255 | "trusted": true, 256 | "input_collapsed": false, 257 | "collapsed": true, 258 | "id": "3C84BF88D08A4020858D633A828A2AE3" 259 | }, 260 | "cell_type": "code", 261 | "source": "", 262 | "outputs": [] 263 | } 264 | ], 265 | "nbformat": 4 266 | } 267 | -------------------------------------------------------------------------------- /extras-twitter/LearningStreaming.snb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "id": "256ac0ef-d7a9-409f-a59a-fd2ce6ee42c1", 4 | "name": "LearningStreaming", 5 | "user_save_timestamp": "1970-01-01T01:00:00.000Z", 6 | "auto_save_timestamp": "1970-01-01T01:00:00.000Z", 7 | "language_info": { 8 | "name": "scala", 9 | "file_extension": "scala", 10 | "codemirror_mode": "text/x-scala" 11 | }, 12 | "trusted": true, 13 | "sparkNotebook": null, 14 | "customLocalRepo": null, 15 | "customRepos": null, 16 | "customDeps": [ 17 | "org.apache.spark %% spark-streaming @ 2.1.0", 18 | "org.apache.bahir %% spark-streaming-twitter % 2.1.0" 19 | ], 20 | "customImports": null, 21 | "customArgs": null, 22 | "customSparkConf": null, 23 | "customVars": null 24 | }, 25 | "cells": [ 26 | { 27 | "metadata": { 28 | "id": "E06591AD8AA74F1BBA6FA628F887B714" 29 | }, 30 | "cell_type": "markdown", 31 | "source": "Twitter Stream top-n HashTags\n================\nIn this notebook, we are going to explore Spark Streaming and the `DStream` API using Twitter as a stream source.\n" 32 | }, 33 | { 34 | "metadata": { 35 | "trusted": true, 36 | "input_collapsed": false, 37 | "collapsed": false, 38 | "id": "CD2D34DDB9A4465081C1F8D5EED2FEF9" 39 | }, 40 | "cell_type": "code", 41 | "source": "import org.apache.spark.streaming.{Seconds, StreamingContext}\nimport org.apache.spark.SparkContext._\nimport org.apache.spark.streaming.twitter._\n", 42 | "outputs": [] 43 | }, 44 | { 45 | "metadata": { 46 | "id": "66F3FB6D3D2C44789C0A00FBD6EF82F4" 47 | }, 48 | "cell_type": "markdown", 49 | "source": "To use the Twitter API, we require two pairs of credentials that can be created and retrieved from \n[Twitter Applications Console](https://apps.twitter.com/) : \n- consumer key\n- consumer secret\n- access token\n- access token secret\n\nHere, we define a helper function to facilitate setting these keys as `System Properties` expected by the `spark-streaming-twitter` API and the underlying library used (twitter4j)\n" 50 | }, 51 | { 52 | "metadata": { 53 | "trusted": true, 54 | "input_collapsed": false, 55 | "collapsed": false, 56 | "id": "4F95FF7380534E199F9B320776186355" 57 | }, 58 | "cell_type": "code", 59 | "source": "def configureTwitterCredentials(consumerKey: String,\n consumerSecret: String,\n accessToken: String,\n accessTokenSecret: String) {\n val configs = Seq(\"consumerKey\" -> consumerKey,\n \"consumerSecret\" -> consumerSecret,\n \"accessToken\" -> accessToken,\n \"accessTokenSecret\" -> accessTokenSecret).toMap\n val trimmedConfigs = configs.mapValues(_.trim)\n configs.foreach{ case(key, value) =>\n require(value.nonEmpty, s\"\"\"Error setting authentication - value for $key not set\"\"\")\n val fullKey = \"twitter4j.oauth.\" + key\n System.setProperty(fullKey, value) }\n }", 60 | "outputs": [] 61 | }, 62 | { 63 | "metadata": { 64 | "id": "2B0BEF3C97F0476781D6F3EDE9E8470F" 65 | }, 66 | "cell_type": "markdown", 67 | "source": "Set here your own credentials " 68 | }, 69 | { 70 | "metadata": { 71 | "trusted": true, 72 | "input_collapsed": false, 73 | "collapsed": false, 74 | "id": "75E82BE86D83492888DF3F6CC6B4E906" 75 | }, 76 | "cell_type": "code", 77 | "source": "configureTwitterCredentials(\"\", \n \"\",\n \"\",\n \"\") ", 78 | "outputs": [] 79 | }, 80 | { 81 | "metadata": { 82 | "id": "0936863808084DF4815DF39D7CC1F3B2" 83 | }, 84 | "cell_type": "markdown", 85 | "source": "## We create a Streaming Context with a `streamingInterval` of 5 seconds" 86 | }, 87 | { 88 | "metadata": { 89 | "trusted": true, 90 | "input_collapsed": false, 91 | "collapsed": false, 92 | "id": "189742F12F6A4730BE186C8B96898DA2" 93 | }, 94 | "cell_type": "code", 95 | "source": "val ssc = new StreamingContext(sparkContext, Seconds(5))", 96 | "outputs": [] 97 | }, 98 | { 99 | "metadata": { 100 | "id": "649D1A10464342588BF3713287A402E4" 101 | }, 102 | "cell_type": "markdown", 103 | "source": "And create a `twitterStream` using the keyword `spark` as subscription filter." 104 | }, 105 | { 106 | "metadata": { 107 | "trusted": true, 108 | "input_collapsed": false, 109 | "collapsed": false, 110 | "id": "A0825ADCBD91483B82732C02E636CFF8" 111 | }, 112 | "cell_type": "code", 113 | "source": "val filters = Array(\"music\")\nval twitterStream = TwitterUtils.createStream(ssc, None, filters)", 114 | "outputs": [] 115 | }, 116 | { 117 | "metadata": { 118 | "id": "52DCA2D36728413DB62D2230FB909F1A" 119 | }, 120 | "cell_type": "markdown", 121 | "source": "We are going to process the data from the Twitter subscription by:\n- extracting the `#hashtags` in the text of the tweet \n- making each `#hashtag` a pair of the form `(#hashtag, 1)` to facilitate a count operation\n- reduce with a `sum` operation to obtain a count, using a window of `60` seconds\n- invert the resulting pairs: `(#hashtag, count)` to `(count, #hashtag)`\n- sort each batch of data (each `RDD` per streaming interval) to obtain a `top-n` score" 122 | }, 123 | { 124 | "metadata": { 125 | "trusted": true, 126 | "input_collapsed": false, 127 | "collapsed": false, 128 | "id": "CF39BFF9E477447486192BCA216EB8A9" 129 | }, 130 | "cell_type": "code", 131 | "source": "import StreamingContext._\nval hashTags = twitterStream.flatMap(status => status.getText.split(\" \").filter(_.startsWith(\"#\")))\n\nval topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))\n .map{case (topic, count) => (count, topic)}\n .transform(_.sortByKey(false))", 132 | "outputs": [] 133 | }, 134 | { 135 | "metadata": { 136 | "id": "9D8BE7B8B5EB4FA78D77EDA7FBC18A97" 137 | }, 138 | "cell_type": "markdown", 139 | "source": "We declare an `Unordered List` widget that will let us display the `top-n` results from the `DStream` " 140 | }, 141 | { 142 | "metadata": { 143 | "trusted": true, 144 | "input_collapsed": false, 145 | "collapsed": false, 146 | "id": "A287B7F4744E40AD8D052D4B6F126643" 147 | }, 148 | "cell_type": "code", 149 | "source": "@transient val result = ul(10)\nresult", 150 | "outputs": [] 151 | }, 152 | { 153 | "metadata": { 154 | "id": "92C7F04D831F4CEE87A42A4DC11D5488" 155 | }, 156 | "cell_type": "markdown", 157 | "source": "(optional) Use some static test data to test the behaviour of the `Unordered List` widget" 158 | }, 159 | { 160 | "metadata": { 161 | "trusted": true, 162 | "input_collapsed": false, 163 | "collapsed": false, 164 | "id": "1B6F1C3043E14FEA93DF3505436CB23C" 165 | }, 166 | "cell_type": "code", 167 | "source": "result(Array(\"top-1\",\"top-2\",\"top-3\"))", 168 | "outputs": [] 169 | }, 170 | { 171 | "metadata": { 172 | "id": "48D85C80F3664134812B5B1663507559" 173 | }, 174 | "cell_type": "markdown", 175 | "source": "We use the generic output operation `foreachRDD` to retrieve the top-10 hashtags of the corresponding window and update the display by passing this data to the `result` widget." 176 | }, 177 | { 178 | "metadata": { 179 | "trusted": true, 180 | "input_collapsed": false, 181 | "collapsed": false, 182 | "id": "55EF4D8B55BC43F0833324161BFD57A0" 183 | }, 184 | "cell_type": "code", 185 | "source": "topCounts60.foreachRDD(rdd => {\n val topList = rdd.take(10).toList\n val r = topList.map{case (count, tag) => s\"$tag: $count\"}\n result(r)\n})", 186 | "outputs": [] 187 | }, 188 | { 189 | "metadata": { 190 | "id": "14C21F354E534D1EB564FBE76BF91D42" 191 | }, 192 | "cell_type": "markdown", 193 | "source": "Finally, we start the Streaming Context" 194 | }, 195 | { 196 | "metadata": { 197 | "trusted": true, 198 | "input_collapsed": false, 199 | "collapsed": false, 200 | "id": "303CDF07F5D945338B0E56FC924AFC68" 201 | }, 202 | "cell_type": "code", 203 | "source": "ssc.start()", 204 | "outputs": [] 205 | }, 206 | { 207 | "metadata": { 208 | "id": "ED021294EB1246FBAFD55D1CB2BD38F4" 209 | }, 210 | "cell_type": "markdown", 211 | "source": "When done, use the `stop` method to stop all streaming processing. We use the option to *not* stop the underlying `sparkContext`. That way we can restart the streaming job without having to restart the complete notebook kernel." 212 | }, 213 | { 214 | "metadata": { 215 | "trusted": true, 216 | "input_collapsed": false, 217 | "collapsed": false, 218 | "id": "439A13CBFDDE43198BAC94B80556A853" 219 | }, 220 | "cell_type": "code", 221 | "source": "ssc.stop(stopSparkContext = false, stopGracefully = true)", 222 | "outputs": [] 223 | }, 224 | { 225 | "metadata": { 226 | "trusted": true, 227 | "input_collapsed": false, 228 | "collapsed": true, 229 | "id": "3C84BF88D08A4020858D633A828A2AE3" 230 | }, 231 | "cell_type": "code", 232 | "source": "", 233 | "outputs": [] 234 | } 235 | ], 236 | "nbformat": 4 237 | } 238 | --------------------------------------------------------------------------------