├── README.md ├── countByWindow.py ├── firstStreamApp.py ├── getStarted.ipynb ├── reduceByKeyAndWindow.py ├── reduceByWindow.py ├── students.csv ├── updateStateByKey.py └── wordcount.txt /README.md: -------------------------------------------------------------------------------- 1 | # Get started with Spark Streaming 2 | 3 | ## Installation 4 | Follow this tutorial to install spark : 5 | https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm 6 | 7 | ## A streaming example 8 | Here we will just count the number of errors that occur on instantanously 9 | 10 | ### firstStreamApp.py 11 | The example to run with python 12 | ```shell 13 | # Import libs 14 | import sys 15 | from pyspark import SparkContext 16 | from pyspark.streaming import StreamingContext 17 | 18 | # Begin 19 | if __name__ == "__main__": 20 | sc = SparkContext(appName="StreamingErrorCount"); 21 | # 2 is the batch interval : 2 seconds 22 | ssc = StreamingContext(sc, 2) 23 | 24 | # Checkpoint for backups 25 | ssc.checkpoint("file:///tmp/spark") 26 | 27 | # Define the socket where the system will listen 28 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 29 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 30 | 31 | 32 | # Counting errors 33 | counts = lines.flatMap(lambda line: line.split(" "))\ 34 | .filter(lambda word:"ERROR" in word)\ 35 | .map(lambda word : (word, 1))\ 36 | .reduceByKey(lambda a, b : a + b) 37 | counts.pprint() 38 | ssc.start() 39 | ssc.awaitTermination() 40 | ``` 41 | 42 | ### Open a socket 43 | Open a socker on port 9999 using netcat in a shell 44 | ```shell 45 | $ nc -l -p 9999 46 | ``` 47 | 48 | ### Check if the port is opened 49 | Open a socker on port 9999 using netcat 50 | ```shell 51 | $ nc localhost 9999 52 | ``` 53 | 54 | ### Submit the python script 55 | Open a socker on port 9999 using netcat 56 | ```shell 57 | $ spark-submit firstStreamApp.py localhost 9999 58 | ``` 59 | 60 | ### You'll see time slots 61 | 62 | ```shell 63 | $ spark-submit firstStreamApp.py localhost 9999 64 | 17/04/12 10:38:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 65 | ------------------------------------------- 66 | Time: 2017-04-12 10:38:18 67 | ------------------------------------------- 68 | 69 | ------------------------------------------- 70 | Time: 2017-04-12 10:38:20 71 | ------------------------------------------- 72 | 73 | ------------------------------------------- 74 | Time: 2017-04-12 10:38:22 75 | ------------------------------------------- 76 | 77 | ------------------------------------------- 78 | Time: 2017-04-12 10:38:24 79 | ------------------------------------------- 80 | 81 | ------------------------------------------- 82 | Time: 2017-04-12 10:38:26 83 | ------------------------------------------- 84 | 85 | ------------------------------------------- 86 | Time: 2017-04-12 10:38:28 87 | ------------------------------------------- 88 | 89 | ``` 90 | 91 | ### Test 92 | On the shell where netcat was launch, write textes with "ERROR" to check 93 | ```shell 94 | ERROR is there 95 | NOTHING HERE 96 | Everything is ok 97 | ERROR AGAIN 98 | A LOT OF ERRORS 99 | 100 | 101 | 102 | ERROR 103 | ERROR 104 | ERROR 105 | ERROR 106 | ERROR 107 | ERROR 108 | ERROR 109 | ERROR 110 | ERROR 111 | ERROR 112 | ERROR 113 | ERROR 114 | ERROR 115 | ERROR 116 | ERROR 117 | ERROR 118 | ERROR 119 | ERROR 120 | ERROR 121 | ERROR 122 | ERROR 123 | ERROR 124 | ERROR 125 | ERROR 126 | ERROR 127 | ERROR 128 | ERROR 129 | ERROR 130 | ERROR 131 | ERROR 132 | ERROR 133 | ERROR 134 | ERROR 135 | ERROR 136 | ERROR 137 | ERROR 138 | ERROR 139 | ERROR 140 | ERROR 141 | ERROR 142 | 143 | 144 | 145 | 146 | 147 | ``` 148 | 149 | 150 | ### The output 151 | On the shell where netcat was launch, write textes with "ERROR" to check 152 | ```shell 153 | ... 154 | 17/04/12 10:42:12 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 155 | 17/04/12 10:42:12 WARN BlockManager: Block input-0-1491986532000 replicated to only 0 peer(s) instead of 1 peers 156 | ------------------------------------------- 157 | Time: 2017-04-12 10:42:12 158 | ------------------------------------------- 159 | 160 | 17/04/12 10:42:13 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 161 | 17/04/12 10:42:13 WARN BlockManager: Block input-0-1491986533200 replicated to only 0 peer(s) instead of 1 peers 162 | [Stage 473:> (0 + 0) / 2]17/04/12 10:42:14 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 163 | 17/04/12 10:42:14 WARN BlockManager: Block input-0-1491986534000 replicated to only 0 peer(s) instead of 1 peers 164 | ------------------------------------------- 165 | Time: 2017-04-12 10:42:14 166 | ------------------------------------------- 167 | (u'ERROR', 2) 168 | 169 | 17/04/12 10:42:14 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 170 | 17/04/12 10:42:14 WARN BlockManager: Block input-0-1491986534200 replicated to only 0 peer(s) instead of 1 peers 171 | ------------------------------------------- 172 | 173 | .... 174 | 175 | 176 | 177 | ------------------------------------------- 178 | Time: 2017-04-12 10:44:10 179 | ------------------------------------------- 180 | 181 | 17/04/12 10:44:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 182 | 17/04/12 10:44:10 WARN BlockManager: Block input-0-1491986650600 replicated to only 0 peer(s) instead of 1 peers 183 | 17/04/12 10:44:11 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 184 | 17/04/12 10:44:11 WARN BlockManager: Block input-0-1491986651600 replicated to only 0 peer(s) instead of 1 peers 185 | ------------------------------------------- 186 | Time: 2017-04-12 10:44:12 187 | ------------------------------------------- 188 | (u'ERROR', 40) 189 | 190 | ------------------------------------------- 191 | Time: 2017-04-12 10:44:14 192 | ------------------------------------------- 193 | 194 | 195 | ``` 196 | 197 | 198 | 199 | 200 | ## UpdateStateByKey 201 | Stateful transformation using Dstreams 202 | ### updateSateByKey.py 203 | The script to be processed for a summary count. 204 | ```shell 205 | import sys 206 | from pyspark import SparkContext 207 | from pyspark.streaming import StreamingContext 208 | 209 | # Begin 210 | if __name__ == "__main__": 211 | sc = SparkContext(appName="StreamingErrorCount"); 212 | # 2 is the batch interval : 2 seconds 213 | ssc = StreamingContext(sc, 2) 214 | 215 | # Checkpoint for backups 216 | ssc.checkpoint("file:///tmp/spark") 217 | 218 | # Define the socket where the system will listen 219 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 220 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 221 | 222 | 223 | 224 | # Update function 225 | def countWords(newValues, lastSum): 226 | if lastSum is None : 227 | lastSum = 0 228 | return sum(newValues, lastSum) 229 | 230 | word_counts = lines.flatMap(lambda line: line.split(" "))\ 231 | .map(lambda word : (word, 1))\ 232 | .updateStateByKey(countWords) 233 | 234 | word_counts.pprint() 235 | ssc.start() 236 | ssc.awaitTermination() 237 | 238 | ``` 239 | 240 | ### Launch the netcat utility as previously 241 | ```shell 242 | $ nc -l -p 9999 243 | ``` 244 | 245 | ### Submit the python script 246 | Open a socker on port 9999 using netcat 247 | ```shell 248 | $ spark-submit updateSateByKey.py localhost 9999 249 | ``` 250 | 251 | 252 | 253 | 254 | 255 | ## countByWindow 256 | Stateful transformation using Dstreams 257 | ### countByWindow .py 258 | The script to be processed for a summary count. 259 | ```shell 260 | # Import libs 261 | import sys 262 | from pyspark import SparkContext 263 | from pyspark.streaming import StreamingContext 264 | 265 | # Begin 266 | if __name__ == "__main__": 267 | sc = SparkContext(appName="StreamingcountByWindow"); 268 | # 2 is the batch interval : 2 seconds 269 | ssc = StreamingContext(sc, 2) 270 | 271 | # Checkpoint for backups 272 | ssc.checkpoint("file:///tmp/spark") 273 | 274 | # Define the socket where the system will listen 275 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 276 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 277 | 278 | ## window size = 10, sliding interval = 2 279 | counts = lines.countByWindow(10, 2) 280 | 281 | ## Display the counts 282 | ## Start the program 283 | ## The program will run until manual termination 284 | counts.pprint() 285 | ssc.start() 286 | ssc.awaitTermination() 287 | 288 | ``` 289 | 290 | ### Launch the netcat utility as previously 291 | ```shell 292 | $ nc -l -p 9999 293 | ``` 294 | 295 | ### Submit the python script 296 | Open a socker on port 9999 using netcat 297 | ```shell 298 | $ spark-submit updateSateByKey.py localhost 9999 299 | ``` 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | ## reduceByWindow 308 | Stateful transformation using Dstreams 309 | ### reduceByWindow.py 310 | The script to be processed for a summary count. 311 | ```shell 312 | # Import libs 313 | import sys 314 | from pyspark import SparkContext 315 | from pyspark.streaming import StreamingContext 316 | 317 | # Begin 318 | if __name__ == "__main__": 319 | sc = SparkContext(appName="StreamingreduceByWindow"); 320 | # 2 is the batch interval : 2 seconds 321 | ssc = StreamingContext(sc, 2) 322 | 323 | # Checkpoint for backups 324 | ssc.checkpoint("file:///tmp/spark") 325 | 326 | # Define the socket where the system will listen 327 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 328 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 329 | 330 | ## summary function 331 | ## reverse function 332 | ## window size = 10 333 | ## sliding interval = 2 334 | sum = lines.reduceByWindow( 335 | lambda x, y: int(x) + int(y), 336 | lambda x, y: int(x) - int(y), 337 | 10, 338 | 2 339 | ) 340 | 341 | ## Display the counts 342 | ## Start the program 343 | ## The program will run until manual termination 344 | sum.pprint() 345 | ssc.start() 346 | ssc.awaitTermination() 347 | 348 | ``` 349 | 350 | ### Launch the netcat utility as previously 351 | ```shell 352 | $ nc -l -p 9999 353 | ``` 354 | 355 | ### Submit the python script 356 | Open a socker on port 9999 using netcat 357 | ```shell 358 | $ spark-submit reduceByWindow.py localhost 9999 359 | ``` 360 | 361 | 362 | 363 | ## reduceByKeyAndWindow 364 | Stateful transformation using Dstreams 365 | ### reduceByKeyAndWindow.py 366 | The script to be processed for a summary count. 367 | ```shell 368 | 369 | 370 | ``` 371 | 372 | ### Launch the netcat utility as previously 373 | ```shell 374 | $ nc -l -p 9999 375 | ``` 376 | 377 | ### Submit the python script 378 | Open a socker on port 9999 using netcat 379 | ```shell 380 | $ spark-submit reduceByKeyAndWindow.py localhost 9999 381 | ``` 382 | 383 | 384 | 385 | StreamingreduceByKeyAndWindow -------------------------------------------------------------------------------- /countByWindow.py: -------------------------------------------------------------------------------- 1 | # Import libs 2 | import sys 3 | from pyspark import SparkContext 4 | from pyspark.streaming import StreamingContext 5 | 6 | # Begin 7 | if __name__ == "__main__": 8 | sc = SparkContext(appName="StreamingcountByWindow"); 9 | # 2 is the batch interval : 2 seconds 10 | ssc = StreamingContext(sc, 2) 11 | 12 | # Checkpoint for backups 13 | ssc.checkpoint("file:///tmp/spark") 14 | 15 | # Define the socket where the system will listen 16 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 17 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 18 | 19 | ## window size = 10, sliding interval = 2 20 | counts = lines.countByWindow(10, 2) 21 | 22 | ## Display the counts 23 | ## Start the program 24 | ## The program will run until manual termination 25 | counts.pprint() 26 | ssc.start() 27 | ssc.awaitTermination() 28 | 29 | -------------------------------------------------------------------------------- /firstStreamApp.py: -------------------------------------------------------------------------------- 1 | # Import libs 2 | import sys 3 | from pyspark import SparkContext 4 | from pyspark.streaming import StreamingContext 5 | 6 | # Begin 7 | if __name__ == "__main__": 8 | sc = SparkContext(appName="StreamingErrorCount"); 9 | # 2 is the batch interval : 2 seconds 10 | ssc = StreamingContext(sc, 2) 11 | 12 | # Checkpoint for backups 13 | ssc.checkpoint("file:///tmp/spark") 14 | 15 | # Define the socket where the system will listen 16 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 17 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 18 | 19 | 20 | # Counting errors 21 | ## Split errors 22 | ## filter using the condition Error in splits 23 | ## put one for the concerned errors 24 | ## Counts the by accumulating the sum 25 | counts = lines.flatMap(lambda line: line.split(" "))\ 26 | .filter(lambda word:"ERROR" in word)\ 27 | .map(lambda word : (word, 1))\ 28 | .reduceByKey(lambda a, b : a + b) 29 | 30 | ## Display the counts 31 | ## Start the program 32 | ## The program will run until manual termination 33 | counts.pprint() 34 | ssc.start() 35 | ssc.awaitTermination() 36 | 37 | -------------------------------------------------------------------------------- /getStarted.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 29, 6 | "metadata": { 7 | "collapsed": false, 8 | "scrolled": true 9 | }, 10 | "outputs": [ 11 | { 12 | "name": "stdout", 13 | "output_type": "stream", 14 | "text": [ 15 | "[(u'', 4), (u'perspiciatis', 1), (u'cillum', 1), (u'sunt', 2), (u'r\\xe9el', 1), (u'cupidatat', 1), (u'consectetur', 1), (u'quasi', 1), (u'1.10.32', 1), (u'quisquam', 1), (u'vel', 2), (u'architecto', 1), (u'non', 2), (u'odit', 1), (u'quaerat', 1), (u'proident,', 1), (u'laboriosam,', 1), (u'vitae', 1), (u'Quis', 1), (u'natus', 1), (u'Section', 1), (u'ea', 3), (u'sequi', 1), (u'illo', 1), (u'eu', 1), (u'adipisci', 1), (u'aliqua.', 1), (u'veritatis', 1), (u'incidunt', 1), (u'nostrud', 1), (u'aliquid', 1), (u'aliquip', 1), (u'sint', 1), (u'donn\\xe9.', 1), (u'eum', 2), (u'suscipit', 1), (u'unde', 1), (u'voluptas', 2), (u'pas.', 1), (u'Ciceron', 1), (u'temps', 1), (u'voluptatem.', 1), (u'accusantium', 1), (u'opposition', 1), (u'qui', 6), (u'quo', 1), (u'Excepteur', 1), (u'amet,', 2), (u'instant', 1), (u'culpa', 1), (u'(45', 1), (u'dicta', 1), (u'produites', 1), (u'\"De', 1), (u'pratique,', 1), (u'ullam', 1), (u'Storm', 1), (u'totam', 1), (u'quis', 2), (u'molestiae', 1), (u'quia', 4), (u'nesciunt.', 1), (u'officia', 1), (u'connues', 1), (u'eaque', 1), (u'\"Sed', 1), (u'quam', 1), (u'quae', 1), (u'deserunt', 1), (u'continu,', 1), (u'consequuntur', 1), (u'voire', 1), (u'r\\xe9pond', 1), (u'eius', 1), (u'irure', 1), (u'illum', 1), (u'au', 1), (u'par', 1), (u'fugiat', 2), (u'consequatur,', 1), (u'beatae', 1), (u'traiter', 1), (u'Spark', 4), (u'cadence', 1), (u'est,', 1), (u'reprehenderit', 2), (u'pr\\xe9f\\xe8rera', 1), (u'av.', 1), (u'esse', 2), (u'ullamco', 1), (u'eos', 1), (u'si', 2), (u'velit,', 1), (u'exercitationem', 1), (u'laudantium,', 1), (u'secondes,', 1), (u'secondes.', 1), (u'eiusmod', 1), (u'le', 1), (u'la', 2), (u'dolore', 3), (u'do', 1), (u'Neque', 1), (u'de', 8), (u'aperiam,', 1), (u'nihil', 1), (u'du', 1), (u'magnam', 1), (u'ipsum', 2), (u'nisi', 2), (u'occaecat', 1), (u'flux', 1), (u'consequat.', 1), (u'iure', 1), (u'modi', 1), (u'dixi\\xe8mes', 1), (u'en', 2), (u'traitement', 3), (u'consequatur?', 1), (u'donn\\xe9es', 3), (u'ex', 2), (u'quelques', 2), (u'\"Lorem', 1), (u'labore', 2), (u'ratione', 1), (u'nostrum', 1), (u'Streaming', 3), (u'rem', 1), (u'mollit', 1), (u'Malorum\"', 1), (u'Ut', 2), (u'velit', 2), (u'magna', 1), (u'Finibus', 1), (u'aspernatur', 1), (u'ipsa', 1), (u'aliquam', 1), (u'on', 1), (u'En', 1), (u'tempor', 1), (u'fugit,', 1), (u'Bonorum', 1), (u'voluptate', 2), (u'nulla', 2), (u'autem', 1), (u'adapt\\xe9', 1), (u'minima', 1), (u'dolores', 1), (u'laboris', 1), (u'elit,', 1), (u'enim', 3), (u'veniam,', 2), (u'ipsam', 1), (u'probl\\xe9matique', 1), (u'voluptatem', 3), (u'magni', 1), (u'aut', 2), (u'\\xe0', 4), (u'et', 4), (u'explicabo.', 1), (u'pariatur.', 1), (u'consectetur,', 1), (u'sera', 1), (u'corporis', 1), (u'ut', 5), (u'des', 2), (u'un', 1), (u'error', 1), (u'J.-C.)', 1), (u'omnis', 1), (u'exercitation', 1), (u'ab', 1), (u'ad', 2), (u'aute', 1), (u'in', 4), (u'id', 1), (u'incididunt', 1), (u'numquam', 1), (u'iste', 1), (u'minim', 1), (u'inventore', 1), (u'ne', 1), (u'tempora', 1), (u'laborum.\"', 1), (u'doloremque', 1), (u'commodo', 1), (u'commodi', 1), (u'porro', 1), (u'minimum', 1), (u'est', 2), (u'permet', 1), (u'sed', 3), (u's\\u2019impose', 1), (u'sit', 4), (u'pariatur?\"', 1), (u'Nemo', 1), (u'Duis', 1), (u'anim', 1), (u'adipiscing', 1), (u'dolor', 3), (u'dolorem', 2)]\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "students = sc.textFile('students.csv')\n", 21 | "#dir (students)\n", 22 | "# Display the dataset\n", 23 | "#print students.collect()\n", 24 | "\n", 25 | "# Get a limited number\n", 26 | "#students.take(2)\n", 27 | "\n", 28 | "# filter\n", 29 | "someStudents = students.filter(lambda row : 'I' in row)\n", 30 | "\n", 31 | "# print someStudents.collect();\n", 32 | "\n", 33 | "\n", 34 | "# Map\n", 35 | "names = students.map(lambda row : row.split(',')[0])\n", 36 | "#print names.collect()\n", 37 | "\n", 38 | "\n", 39 | "# Map Reduce word count\n", 40 | "lines = sc.textFile('wordcount.txt')\n", 41 | "words = lines.flatMap(lambda line : line.split(' '))\n", 42 | "# print words.collect()\n", 43 | "counts = words.map(lambda word : (word, 1)).reduceByKey(lambda a, b : a + b)\n", 44 | "\n", 45 | "print counts.collect()" 46 | ] 47 | } 48 | ], 49 | "metadata": { 50 | "kernelspec": { 51 | "display_name": "Python 2", 52 | "language": "python", 53 | "name": "python2" 54 | }, 55 | "language_info": { 56 | "codemirror_mode": { 57 | "name": "ipython", 58 | "version": 2 59 | }, 60 | "file_extension": ".py", 61 | "mimetype": "text/x-python", 62 | "name": "python", 63 | "nbconvert_exporter": "python", 64 | "pygments_lexer": "ipython2", 65 | "version": "2.7.13" 66 | } 67 | }, 68 | "nbformat": 4, 69 | "nbformat_minor": 2 70 | } 71 | -------------------------------------------------------------------------------- /reduceByKeyAndWindow.py: -------------------------------------------------------------------------------- 1 | # Import libs 2 | import sys 3 | from pyspark import SparkContext 4 | from pyspark.streaming import StreamingContext 5 | 6 | # Begin 7 | if __name__ == "__main__": 8 | sc = SparkContext(appName="StreamingreduceByKeyAndWindow"); 9 | # 2 is the batch interval : 2 seconds 10 | ssc = StreamingContext(sc, 2) 11 | 12 | # Checkpoint for backups 13 | ssc.checkpoint("file:///tmp/spark") 14 | 15 | # Define the socket where the system will listen 16 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 17 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 18 | 19 | 20 | # Counting errors 21 | ## Split errors 22 | ## filter using the condition Error in splits 23 | ## put one for the concerned errors 24 | ## Counts the by accumulating the sum 25 | 26 | 27 | ## summary function 28 | ## reverse function 29 | ## window size = 10 30 | ## sliding interval = 2 31 | counts = lines.flatMap(lambda line: line.split(" "))\ 32 | .filter(lambda word:"ERROR" in word)\ 33 | .map(lambda word : (word, 1))\ 34 | .reduceByKeyAndWindow(lambda x, y: int(x) + int(y), lambda x, y: int(x) - int(y), 10, 2) 35 | 36 | ## Display the counts 37 | ## Start the program 38 | ## The program will run until manual termination 39 | counts.pprint() 40 | ssc.start() 41 | ssc.awaitTermination() 42 | 43 | -------------------------------------------------------------------------------- /reduceByWindow.py: -------------------------------------------------------------------------------- 1 | # Import libs 2 | import sys 3 | from pyspark import SparkContext 4 | from pyspark.streaming import StreamingContext 5 | 6 | # Begin 7 | if __name__ == "__main__": 8 | sc = SparkContext(appName="StreamingreduceByWindow"); 9 | # 2 is the batch interval : 2 seconds 10 | ssc = StreamingContext(sc, 2) 11 | 12 | # Checkpoint for backups 13 | ssc.checkpoint("file:///tmp/spark") 14 | 15 | # Define the socket where the system will listen 16 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 17 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 18 | 19 | ## summary function 20 | ## reverse function 21 | ## window size = 10 22 | ## sliding interval = 2 23 | sum = lines.reduceByWindow( 24 | lambda x, y: int(x) + int(y), 25 | lambda x, y: int(x) - int(y), 26 | 10, 27 | 2 28 | ) 29 | 30 | ## Display the counts 31 | ## Start the program 32 | ## The program will run until manual termination 33 | sum.pprint() 34 | ssc.start() 35 | ssc.awaitTermination() 36 | 37 | -------------------------------------------------------------------------------- /students.csv: -------------------------------------------------------------------------------- 1 | NAME, MARKS 2 | BADO, 12 3 | KAIA, 15 4 | SANOU, 67 5 | ILLY, 56 6 | DUPOND, 89 7 | -------------------------------------------------------------------------------- /updateStateByKey.py: -------------------------------------------------------------------------------- 1 | # Import libs 2 | import sys 3 | from pyspark import SparkContext 4 | from pyspark.streaming import StreamingContext 5 | 6 | # Begin 7 | if __name__ == "__main__": 8 | sc = SparkContext(appName="StreamingErrorCount"); 9 | # 2 is the batch interval : 2 seconds 10 | ssc = StreamingContext(sc, 2) 11 | 12 | # Checkpoint for backups 13 | ssc.checkpoint("file:///tmp/spark") 14 | 15 | # Define the socket where the system will listen 16 | # Lines is not a rdd but a sequence of rdd, not static, constantly changing 17 | lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) 18 | 19 | 20 | 21 | # Update function 22 | def countWords(newValues, lastSum): 23 | if lastSum is None : 24 | lastSum = 0 25 | return sum(newValues, lastSum) 26 | 27 | word_counts = lines.flatMap(lambda line: line.split(" "))\ 28 | .map(lambda word : (word, 1))\ 29 | .updateStateByKey(countWords) 30 | 31 | ## Display the counts 32 | ## Start the program 33 | ## The program will run until manual termination 34 | word_counts.pprint() 35 | ssc.start() 36 | ssc.awaitTermination() 37 | 38 | -------------------------------------------------------------------------------- /wordcount.txt: -------------------------------------------------------------------------------- 1 | Spark Streaming répond à la problématique de traitement de données produites en flux continu, par opposition à Spark qui permet de traiter des données connues à un instant donné. 2 | 3 | En pratique, on préfèrera Spark Streaming à Storm si le traitement en temps réel des données ne s’impose pas. Spark Streaming sera adapté si la cadence de traitement est au minimum de quelques dixièmes de secondes, voire de quelques secondes. 4 | 5 | "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." 6 | 7 | Section 1.10.32 du "De Finibus Bonorum et Malorum" de Ciceron (45 av. J.-C.) 8 | 9 | "Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?" --------------------------------------------------------------------------------