├── README.md
├── countByWindow.py
├── firstStreamApp.py
├── getStarted.ipynb
├── reduceByKeyAndWindow.py
├── reduceByWindow.py
├── students.csv
├── updateStateByKey.py
└── wordcount.txt


/README.md:
--------------------------------------------------------------------------------
  1 | # Get started with Spark Streaming
  2 | 
  3 | ## Installation
  4 | Follow this tutorial to install spark :
  5 | https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
  6 | 
  7 | ## A streaming example
  8 | Here we will just count the number of errors that occur on instantanously
  9 | 
 10 | ### firstStreamApp.py
 11 | The example to run with python
 12 | ```shell
 13 | # Import libs
 14 | import sys
 15 | from pyspark import SparkContext
 16 | from pyspark.streaming import StreamingContext
 17 | 
 18 | # Begin
 19 | if __name__ == "__main__":
 20 |         sc = SparkContext(appName="StreamingErrorCount");
 21 |         # 2 is the batch interval : 2 seconds
 22 |         ssc = StreamingContext(sc, 2)
 23 | 
 24 |         # Checkpoint for backups
 25 |         ssc.checkpoint("file:///tmp/spark")
 26 | 
 27 |         # Define the socket where the system will listen
 28 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
 29 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
 30 | 
 31 | 
 32 |         # Counting errors
 33 |         counts = lines.flatMap(lambda line: line.split(" "))\
 34 |                     .filter(lambda word:"ERROR" in word)\
 35 |                     .map(lambda word : (word, 1))\
 36 |                     .reduceByKey(lambda a, b : a + b)
 37 |         counts.pprint()
 38 |         ssc.start()
 39 |         ssc.awaitTermination()
 40 | ```
 41 | 
 42 | ### Open a socket
 43 | Open a socker on port 9999 using netcat in a shell
 44 | ```shell
 45 | $ nc -l -p 9999
 46 | ```
 47 | 
 48 | ### Check if the port is opened
 49 | Open a socker on port 9999 using netcat
 50 | ```shell
 51 | $ nc localhost 9999
 52 | ```
 53 | 
 54 | ### Submit the python script
 55 | Open a socker on port 9999 using netcat
 56 | ```shell
 57 | $ spark-submit firstStreamApp.py localhost 9999
 58 | ```
 59 | 
 60 | ### You'll see time slots
 61 | 
 62 | ```shell
 63 | $ spark-submit firstStreamApp.py localhost 9999
 64 | 17/04/12 10:38:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 65 | -------------------------------------------
 66 | Time: 2017-04-12 10:38:18
 67 | -------------------------------------------
 68 | 
 69 | -------------------------------------------
 70 | Time: 2017-04-12 10:38:20
 71 | -------------------------------------------
 72 | 
 73 | -------------------------------------------
 74 | Time: 2017-04-12 10:38:22
 75 | -------------------------------------------
 76 | 
 77 | -------------------------------------------
 78 | Time: 2017-04-12 10:38:24
 79 | -------------------------------------------
 80 | 
 81 | -------------------------------------------
 82 | Time: 2017-04-12 10:38:26
 83 | -------------------------------------------
 84 | 
 85 | -------------------------------------------
 86 | Time: 2017-04-12 10:38:28
 87 | -------------------------------------------
 88 | 
 89 | ```
 90 | 
 91 | ### Test
 92 | On the shell where netcat was launch, write textes with "ERROR" to check
 93 | ```shell
 94 | ERROR is there
 95 | NOTHING HERE
 96 | Everything is ok
 97 | ERROR AGAIN
 98 | A LOT OF ERRORS
 99 | 
100 | 
101 | 
102 | ERROR
103 | ERROR
104 | ERROR
105 | ERROR
106 | ERROR
107 | ERROR
108 | ERROR
109 | ERROR
110 | ERROR
111 | ERROR
112 | ERROR
113 | ERROR
114 | ERROR
115 | ERROR
116 | ERROR
117 | ERROR
118 | ERROR
119 | ERROR
120 | ERROR
121 | ERROR
122 | ERROR
123 | ERROR
124 | ERROR
125 | ERROR
126 | ERROR
127 | ERROR
128 | ERROR
129 | ERROR
130 | ERROR
131 | ERROR
132 | ERROR
133 | ERROR
134 | ERROR
135 | ERROR
136 | ERROR
137 | ERROR
138 | ERROR
139 | ERROR
140 | ERROR
141 | ERROR
142 | 
143 | 
144 | 
145 | 
146 | 
147 | ```
148 | 
149 | 
150 | ### The output
151 | On the shell where netcat was launch, write textes with "ERROR" to check
152 | ```shell
153 | ...
154 | 17/04/12 10:42:12 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
155 | 17/04/12 10:42:12 WARN BlockManager: Block input-0-1491986532000 replicated to only 0 peer(s) instead of 1 peers
156 | -------------------------------------------
157 | Time: 2017-04-12 10:42:12
158 | -------------------------------------------
159 | 
160 | 17/04/12 10:42:13 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
161 | 17/04/12 10:42:13 WARN BlockManager: Block input-0-1491986533200 replicated to only 0 peer(s) instead of 1 peers
162 | [Stage 473:>                                                        (0 + 0) / 2]17/04/12 10:42:14 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
163 | 17/04/12 10:42:14 WARN BlockManager: Block input-0-1491986534000 replicated to only 0 peer(s) instead of 1 peers
164 | -------------------------------------------                                     
165 | Time: 2017-04-12 10:42:14
166 | -------------------------------------------
167 | (u'ERROR', 2)
168 | 
169 | 17/04/12 10:42:14 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
170 | 17/04/12 10:42:14 WARN BlockManager: Block input-0-1491986534200 replicated to only 0 peer(s) instead of 1 peers
171 | -------------------------------------------
172 | 
173 | ....
174 | 
175 | 
176 | 
177 | -------------------------------------------
178 | Time: 2017-04-12 10:44:10
179 | -------------------------------------------
180 | 
181 | 17/04/12 10:44:10 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
182 | 17/04/12 10:44:10 WARN BlockManager: Block input-0-1491986650600 replicated to only 0 peer(s) instead of 1 peers
183 | 17/04/12 10:44:11 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
184 | 17/04/12 10:44:11 WARN BlockManager: Block input-0-1491986651600 replicated to only 0 peer(s) instead of 1 peers
185 | -------------------------------------------
186 | Time: 2017-04-12 10:44:12
187 | -------------------------------------------
188 | (u'ERROR', 40)
189 | 
190 | -------------------------------------------
191 | Time: 2017-04-12 10:44:14
192 | -------------------------------------------
193 | 
194 | 
195 | ```
196 | 
197 | 
198 | 
199 | 
200 | ## UpdateStateByKey
201 | Stateful transformation using Dstreams
202 | ### updateSateByKey.py
203 | The script to be processed for a summary count. 
204 | ```shell
205 | import sys
206 | from pyspark import SparkContext
207 | from pyspark.streaming import StreamingContext
208 | 
209 | # Begin
210 | if __name__ == "__main__":
211 |         sc = SparkContext(appName="StreamingErrorCount");
212 |         # 2 is the batch interval : 2 seconds
213 |         ssc = StreamingContext(sc, 2)
214 | 
215 |         # Checkpoint for backups
216 |         ssc.checkpoint("file:///tmp/spark")
217 | 
218 |         # Define the socket where the system will listen
219 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
220 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
221 | 
222 | 
223 | 
224 |         # Update function
225 |         def countWords(newValues, lastSum):
226 |             if lastSum is None :
227 |                 lastSum = 0
228 |             return sum(newValues, lastSum)
229 | 
230 |         word_counts = lines.flatMap(lambda line: line.split(" "))\
231 |                     .map(lambda word : (word, 1))\
232 |                     .updateStateByKey(countWords)
233 | 
234 |         word_counts.pprint()
235 |         ssc.start()
236 |         ssc.awaitTermination()
237 | 
238 | ```
239 | 
240 | ### Launch the netcat utility as previously
241 | ```shell
242 | $ nc -l -p 9999
243 | ```
244 | 
245 | ### Submit the python script 
246 | Open a socker on port 9999 using netcat
247 | ```shell
248 | $ spark-submit updateSateByKey.py localhost 9999
249 | ```
250 | 
251 | 
252 | 
253 | 
254 | 
255 | ## countByWindow
256 | Stateful transformation using Dstreams
257 | ### countByWindow .py
258 | The script to be processed for a summary count. 
259 | ```shell
260 | # Import libs
261 | import sys
262 | from pyspark import SparkContext
263 | from pyspark.streaming import StreamingContext
264 | 
265 | # Begin
266 | if __name__ == "__main__":
267 |         sc = SparkContext(appName="StreamingcountByWindow");
268 |         # 2 is the batch interval : 2 seconds
269 |         ssc = StreamingContext(sc, 2)
270 | 
271 |         # Checkpoint for backups
272 |         ssc.checkpoint("file:///tmp/spark")
273 | 
274 |         # Define the socket where the system will listen
275 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
276 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
277 | 
278 |         ## window size = 10, sliding interval = 2
279 |         counts = lines.countByWindow(10, 2)
280 | 
281 |         ## Display the counts
282 |         ## Start the program
283 |         ## The program will run until manual termination
284 |         counts.pprint()
285 |         ssc.start()
286 |         ssc.awaitTermination()
287 | 
288 | ```
289 | 
290 | ### Launch the netcat utility as previously
291 | ```shell
292 | $ nc -l -p 9999
293 | ```
294 | 
295 | ### Submit the python script 
296 | Open a socker on port 9999 using netcat
297 | ```shell
298 | $ spark-submit updateSateByKey.py localhost 9999
299 | ```
300 | 
301 | 
302 | 
303 | 
304 | 
305 | 
306 | 
307 | ## reduceByWindow
308 | Stateful transformation using Dstreams
309 | ### reduceByWindow.py
310 | The script to be processed for a summary count. 
311 | ```shell
312 | # Import libs
313 | import sys
314 | from pyspark import SparkContext
315 | from pyspark.streaming import StreamingContext
316 | 
317 | # Begin
318 | if __name__ == "__main__":
319 |         sc = SparkContext(appName="StreamingreduceByWindow");
320 |         # 2 is the batch interval : 2 seconds
321 |         ssc = StreamingContext(sc, 2)
322 | 
323 |         # Checkpoint for backups
324 |         ssc.checkpoint("file:///tmp/spark")
325 | 
326 |         # Define the socket where the system will listen
327 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
328 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
329 | 
330 |         ## summary function
331 |         ## reverse function
332 |         ## window size = 10
333 |         ## sliding interval = 2
334 |         sum = lines.reduceByWindow(
335 |                 lambda x, y: int(x) + int(y),
336 |                 lambda x, y: int(x) - int(y),
337 |                 10,
338 |                 2
339 |         )
340 | 
341 |         ## Display the counts
342 |         ## Start the program
343 |         ## The program will run until manual termination
344 |         sum.pprint()
345 |         ssc.start()
346 |         ssc.awaitTermination()
347 | 
348 | ```
349 | 
350 | ### Launch the netcat utility as previously
351 | ```shell
352 | $ nc -l -p 9999
353 | ```
354 | 
355 | ### Submit the python script 
356 | Open a socker on port 9999 using netcat
357 | ```shell
358 | $ spark-submit reduceByWindow.py localhost 9999
359 | ```
360 | 
361 | 
362 | 
363 | ## reduceByKeyAndWindow
364 | Stateful transformation using Dstreams
365 | ### reduceByKeyAndWindow.py
366 | The script to be processed for a summary count. 
367 | ```shell
368 | 
369 | 
370 | ```
371 | 
372 | ### Launch the netcat utility as previously
373 | ```shell
374 | $ nc -l -p 9999
375 | ```
376 | 
377 | ### Submit the python script 
378 | Open a socker on port 9999 using netcat
379 | ```shell
380 | $ spark-submit reduceByKeyAndWindow.py localhost 9999
381 | ```
382 | 
383 | 
384 | 
385 | StreamingreduceByKeyAndWindow


--------------------------------------------------------------------------------
/countByWindow.py:
--------------------------------------------------------------------------------
 1 | # Import libs
 2 | import sys
 3 | from pyspark import SparkContext
 4 | from pyspark.streaming import StreamingContext
 5 | 
 6 | # Begin
 7 | if __name__ == "__main__":
 8 |         sc = SparkContext(appName="StreamingcountByWindow");
 9 |         # 2 is the batch interval : 2 seconds
10 |         ssc = StreamingContext(sc, 2)
11 | 
12 |         # Checkpoint for backups
13 |         ssc.checkpoint("file:///tmp/spark")
14 | 
15 |         # Define the socket where the system will listen
16 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
17 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
18 | 
19 |         ## window size = 10, sliding interval = 2
20 |         counts = lines.countByWindow(10, 2)
21 | 
22 |         ## Display the counts
23 |         ## Start the program
24 |         ## The program will run until manual termination
25 |         counts.pprint()
26 |         ssc.start()
27 |         ssc.awaitTermination()
28 | 
29 | 


--------------------------------------------------------------------------------
/firstStreamApp.py:
--------------------------------------------------------------------------------
 1 | # Import libs
 2 | import sys
 3 | from pyspark import SparkContext
 4 | from pyspark.streaming import StreamingContext
 5 | 
 6 | # Begin
 7 | if __name__ == "__main__":
 8 |         sc = SparkContext(appName="StreamingErrorCount");
 9 |         # 2 is the batch interval : 2 seconds
10 |         ssc = StreamingContext(sc, 2)
11 | 
12 |         # Checkpoint for backups
13 |         ssc.checkpoint("file:///tmp/spark")
14 | 
15 |         # Define the socket where the system will listen
16 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
17 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
18 | 
19 | 
20 |         # Counting errors
21 |         ## Split errors
22 |         ## filter using the condition Error in splits
23 |         ## put one for the concerned errors
24 |         ## Counts the by accumulating the sum
25 |         counts = lines.flatMap(lambda line: line.split(" "))\
26 |                     .filter(lambda word:"ERROR" in word)\
27 |                     .map(lambda word : (word, 1))\
28 |                     .reduceByKey(lambda a, b : a + b)
29 | 
30 |         ## Display the counts
31 |         ## Start the program
32 |         ## The program will run until manual termination
33 |         counts.pprint()
34 |         ssc.start()
35 |         ssc.awaitTermination()
36 | 
37 | 


--------------------------------------------------------------------------------
/getStarted.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 29,
 6 |    "metadata": {
 7 |     "collapsed": false,
 8 |     "scrolled": true
 9 |    },
10 |    "outputs": [
11 |     {
12 |      "name": "stdout",
13 |      "output_type": "stream",
14 |      "text": [
15 |       "[(u'', 4), (u'perspiciatis', 1), (u'cillum', 1), (u'sunt', 2), (u'r\\xe9el', 1), (u'cupidatat', 1), (u'consectetur', 1), (u'quasi', 1), (u'1.10.32', 1), (u'quisquam', 1), (u'vel', 2), (u'architecto', 1), (u'non', 2), (u'odit', 1), (u'quaerat', 1), (u'proident,', 1), (u'laboriosam,', 1), (u'vitae', 1), (u'Quis', 1), (u'natus', 1), (u'Section', 1), (u'ea', 3), (u'sequi', 1), (u'illo', 1), (u'eu', 1), (u'adipisci', 1), (u'aliqua.', 1), (u'veritatis', 1), (u'incidunt', 1), (u'nostrud', 1), (u'aliquid', 1), (u'aliquip', 1), (u'sint', 1), (u'donn\\xe9.', 1), (u'eum', 2), (u'suscipit', 1), (u'unde', 1), (u'voluptas', 2), (u'pas.', 1), (u'Ciceron', 1), (u'temps', 1), (u'voluptatem.', 1), (u'accusantium', 1), (u'opposition', 1), (u'qui', 6), (u'quo', 1), (u'Excepteur', 1), (u'amet,', 2), (u'instant', 1), (u'culpa', 1), (u'(45', 1), (u'dicta', 1), (u'produites', 1), (u'\"De', 1), (u'pratique,', 1), (u'ullam', 1), (u'Storm', 1), (u'totam', 1), (u'quis', 2), (u'molestiae', 1), (u'quia', 4), (u'nesciunt.', 1), (u'officia', 1), (u'connues', 1), (u'eaque', 1), (u'\"Sed', 1), (u'quam', 1), (u'quae', 1), (u'deserunt', 1), (u'continu,', 1), (u'consequuntur', 1), (u'voire', 1), (u'r\\xe9pond', 1), (u'eius', 1), (u'irure', 1), (u'illum', 1), (u'au', 1), (u'par', 1), (u'fugiat', 2), (u'consequatur,', 1), (u'beatae', 1), (u'traiter', 1), (u'Spark', 4), (u'cadence', 1), (u'est,', 1), (u'reprehenderit', 2), (u'pr\\xe9f\\xe8rera', 1), (u'av.', 1), (u'esse', 2), (u'ullamco', 1), (u'eos', 1), (u'si', 2), (u'velit,', 1), (u'exercitationem', 1), (u'laudantium,', 1), (u'secondes,', 1), (u'secondes.', 1), (u'eiusmod', 1), (u'le', 1), (u'la', 2), (u'dolore', 3), (u'do', 1), (u'Neque', 1), (u'de', 8), (u'aperiam,', 1), (u'nihil', 1), (u'du', 1), (u'magnam', 1), (u'ipsum', 2), (u'nisi', 2), (u'occaecat', 1), (u'flux', 1), (u'consequat.', 1), (u'iure', 1), (u'modi', 1), (u'dixi\\xe8mes', 1), (u'en', 2), (u'traitement', 3), (u'consequatur?', 1), (u'donn\\xe9es', 3), (u'ex', 2), (u'quelques', 2), (u'\"Lorem', 1), (u'labore', 2), (u'ratione', 1), (u'nostrum', 1), (u'Streaming', 3), (u'rem', 1), (u'mollit', 1), (u'Malorum\"', 1), (u'Ut', 2), (u'velit', 2), (u'magna', 1), (u'Finibus', 1), (u'aspernatur', 1), (u'ipsa', 1), (u'aliquam', 1), (u'on', 1), (u'En', 1), (u'tempor', 1), (u'fugit,', 1), (u'Bonorum', 1), (u'voluptate', 2), (u'nulla', 2), (u'autem', 1), (u'adapt\\xe9', 1), (u'minima', 1), (u'dolores', 1), (u'laboris', 1), (u'elit,', 1), (u'enim', 3), (u'veniam,', 2), (u'ipsam', 1), (u'probl\\xe9matique', 1), (u'voluptatem', 3), (u'magni', 1), (u'aut', 2), (u'\\xe0', 4), (u'et', 4), (u'explicabo.', 1), (u'pariatur.', 1), (u'consectetur,', 1), (u'sera', 1), (u'corporis', 1), (u'ut', 5), (u'des', 2), (u'un', 1), (u'error', 1), (u'J.-C.)', 1), (u'omnis', 1), (u'exercitation', 1), (u'ab', 1), (u'ad', 2), (u'aute', 1), (u'in', 4), (u'id', 1), (u'incididunt', 1), (u'numquam', 1), (u'iste', 1), (u'minim', 1), (u'inventore', 1), (u'ne', 1), (u'tempora', 1), (u'laborum.\"', 1), (u'doloremque', 1), (u'commodo', 1), (u'commodi', 1), (u'porro', 1), (u'minimum', 1), (u'est', 2), (u'permet', 1), (u'sed', 3), (u's\\u2019impose', 1), (u'sit', 4), (u'pariatur?\"', 1), (u'Nemo', 1), (u'Duis', 1), (u'anim', 1), (u'adipiscing', 1), (u'dolor', 3), (u'dolorem', 2)]\n"
16 |      ]
17 |     }
18 |    ],
19 |    "source": [
20 |     "students = sc.textFile('students.csv')\n",
21 |     "#dir (students)\n",
22 |     "# Display the dataset\n",
23 |     "#print students.collect()\n",
24 |     "\n",
25 |     "# Get a limited number\n",
26 |     "#students.take(2)\n",
27 |     "\n",
28 |     "# filter\n",
29 |     "someStudents = students.filter(lambda row : 'I' in row)\n",
30 |     "\n",
31 |     "# print someStudents.collect();\n",
32 |     "\n",
33 |     "\n",
34 |     "# Map\n",
35 |     "names = students.map(lambda row : row.split(',')[0])\n",
36 |     "#print names.collect()\n",
37 |     "\n",
38 |     "\n",
39 |     "# Map Reduce word count\n",
40 |     "lines = sc.textFile('wordcount.txt')\n",
41 |     "words = lines.flatMap(lambda line : line.split(' '))\n",
42 |     "# print words.collect()\n",
43 |     "counts = words.map(lambda word : (word, 1)).reduceByKey(lambda a, b : a + b)\n",
44 |     "\n",
45 |     "print counts.collect()"
46 |    ]
47 |   }
48 |  ],
49 |  "metadata": {
50 |   "kernelspec": {
51 |    "display_name": "Python 2",
52 |    "language": "python",
53 |    "name": "python2"
54 |   },
55 |   "language_info": {
56 |    "codemirror_mode": {
57 |     "name": "ipython",
58 |     "version": 2
59 |    },
60 |    "file_extension": ".py",
61 |    "mimetype": "text/x-python",
62 |    "name": "python",
63 |    "nbconvert_exporter": "python",
64 |    "pygments_lexer": "ipython2",
65 |    "version": "2.7.13"
66 |   }
67 |  },
68 |  "nbformat": 4,
69 |  "nbformat_minor": 2
70 | }
71 | 


--------------------------------------------------------------------------------
/reduceByKeyAndWindow.py:
--------------------------------------------------------------------------------
 1 | # Import libs
 2 | import sys
 3 | from pyspark import SparkContext
 4 | from pyspark.streaming import StreamingContext
 5 | 
 6 | # Begin
 7 | if __name__ == "__main__":
 8 |         sc = SparkContext(appName="StreamingreduceByKeyAndWindow");
 9 |         # 2 is the batch interval : 2 seconds
10 |         ssc = StreamingContext(sc, 2)
11 | 
12 |         # Checkpoint for backups
13 |         ssc.checkpoint("file:///tmp/spark")
14 | 
15 |         # Define the socket where the system will listen
16 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
17 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
18 | 
19 | 
20 |         # Counting errors
21 |         ## Split errors
22 |         ## filter using the condition Error in splits
23 |         ## put one for the concerned errors
24 |         ## Counts the by accumulating the sum
25 | 
26 | 
27 |         ## summary function
28 |         ## reverse function
29 |         ## window size = 10
30 |         ## sliding interval = 2
31 |         counts = lines.flatMap(lambda line: line.split(" "))\
32 |                     .filter(lambda word:"ERROR" in word)\
33 |                     .map(lambda word : (word, 1))\
34 |                     .reduceByKeyAndWindow(lambda x, y: int(x) + int(y), lambda x, y: int(x) - int(y), 10, 2)
35 | 
36 |         ## Display the counts
37 |         ## Start the program
38 |         ## The program will run until manual termination
39 |         counts.pprint()
40 |         ssc.start()
41 |         ssc.awaitTermination()
42 | 
43 | 


--------------------------------------------------------------------------------
/reduceByWindow.py:
--------------------------------------------------------------------------------
 1 | # Import libs
 2 | import sys
 3 | from pyspark import SparkContext
 4 | from pyspark.streaming import StreamingContext
 5 | 
 6 | # Begin
 7 | if __name__ == "__main__":
 8 |         sc = SparkContext(appName="StreamingreduceByWindow");
 9 |         # 2 is the batch interval : 2 seconds
10 |         ssc = StreamingContext(sc, 2)
11 | 
12 |         # Checkpoint for backups
13 |         ssc.checkpoint("file:///tmp/spark")
14 | 
15 |         # Define the socket where the system will listen
16 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
17 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
18 | 
19 |         ## summary function
20 |         ## reverse function
21 |         ## window size = 10
22 |         ## sliding interval = 2
23 |         sum = lines.reduceByWindow(
24 |                 lambda x, y: int(x) + int(y),
25 |                 lambda x, y: int(x) - int(y),
26 |                 10,
27 |                 2
28 |         )
29 | 
30 |         ## Display the counts
31 |         ## Start the program
32 |         ## The program will run until manual termination
33 |         sum.pprint()
34 |         ssc.start()
35 |         ssc.awaitTermination()
36 | 
37 | 


--------------------------------------------------------------------------------
/students.csv:
--------------------------------------------------------------------------------
1 | NAME, MARKS
2 | BADO, 12
3 | KAIA, 15
4 | SANOU, 67
5 | ILLY, 56
6 | DUPOND, 89
7 | 


--------------------------------------------------------------------------------
/updateStateByKey.py:
--------------------------------------------------------------------------------
 1 | # Import libs
 2 | import sys
 3 | from pyspark import SparkContext
 4 | from pyspark.streaming import StreamingContext
 5 | 
 6 | # Begin
 7 | if __name__ == "__main__":
 8 |         sc = SparkContext(appName="StreamingErrorCount");
 9 |         # 2 is the batch interval : 2 seconds
10 |         ssc = StreamingContext(sc, 2)
11 | 
12 |         # Checkpoint for backups
13 |         ssc.checkpoint("file:///tmp/spark")
14 | 
15 |         # Define the socket where the system will listen
16 |         # Lines is not a rdd but a sequence of rdd, not static, constantly changing
17 |         lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
18 | 
19 | 
20 | 
21 |         # Update function
22 |         def countWords(newValues, lastSum):
23 |             if lastSum is None :
24 |                 lastSum = 0
25 |             return sum(newValues, lastSum)
26 | 
27 |         word_counts = lines.flatMap(lambda line: line.split(" "))\
28 |                     .map(lambda word : (word, 1))\
29 |                     .updateStateByKey(countWords)
30 | 
31 |         ## Display the counts
32 |         ## Start the program
33 |         ## The program will run until manual termination
34 |         word_counts.pprint()
35 |         ssc.start()
36 |         ssc.awaitTermination()
37 | 
38 | 


--------------------------------------------------------------------------------
/wordcount.txt:
--------------------------------------------------------------------------------
1 | Spark Streaming répond à la problématique de traitement de données produites en flux continu, par opposition à Spark qui permet de traiter des données connues à un instant donné.
2 | 
3 | En pratique, on préfèrera Spark Streaming à Storm si le traitement en temps réel des données ne s’impose pas. Spark Streaming sera adapté si la cadence de traitement est au minimum de quelques dixièmes de secondes, voire de quelques secondes.
4 | 
5 | "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
6 | 
7 | Section 1.10.32 du "De Finibus Bonorum et Malorum" de Ciceron (45 av. J.-C.)
8 | 
9 | "Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?"


--------------------------------------------------------------------------------