├── README.md ├── image.png ├── iotsimulator.py └── kafka-direct-iotmsg.py /README.md: -------------------------------------------------------------------------------- 1 | IoT: Real-time Data Processing and Analytics using Apache Spark / Kafka 2 | ============================================= 3 | 4 | ![ProjectOverivew](image.png) 5 | 6 | ## Table of Contents 7 | 1. [Overview](#1-overview) 8 | 2. [Format of sensor data](#2-format-of-sensor-data) 9 | 3. [Analysis of data](#3-analysis-of-data) 10 | 4. [Results](#4-results) 11 | 12 | ## 1. Overview 13 | 14 | ##### Use case 15 | - Analyzing U.S nationwide temperature from IoT sensors in real-time 16 | 17 | ##### Project Scenario: 18 | - Multiple temperature sensors are deployed in each U.S state 19 | - Each sensor regularly sends temperature data to a Kafka server in AWS Cloud (Simulated by feeding 10,000 JSON data by using kafka-console-producer) 20 | - Kafka client retrieves the streaming data every 3 seconds 21 | - PySpark processes and analizes them in real-time by using Spark Streming, and show the results 22 | 23 | ##### Key Technologies: 24 | - Apache Spark (Spark Streaming) 25 | - Apache Kafka 26 | - Python/PySpark 27 | 28 | ## 2. Format of sensor data 29 | 30 | I used the simulated data for this project. ```iotsimulator.py``` generates JSON data as below format. 31 | 32 | ``` 33 | 34 | 35 | { 36 | "guid": "0-ZZZ12345678-08K", 37 | "destination": "0-AAA12345678", 38 | "state": "CA", 39 | "eventTime": "2016-11-16T13:26:39.447974Z", 40 | "payload": { 41 | "format": "urn:example:sensor:temp", 42 | "data":{ 43 | "temperature": 59.7 44 | } 45 | } 46 | } 47 | ``` 48 | 49 | 50 | 51 | Field | Description 52 | --- | --- 53 | guid | A global unique identifier which is associated with a sensor. 54 | destination | An identifier of the destination which sensors send data to (One single fixed ID is used in this project) 55 | state | A randomly chosen U.S state. A same guid always has a same state 56 | eventTime | A timestamp that the data is generated 57 | format | A format of data 58 | temperature | Calculated by continuously adding a random number (between -1.0 to 1.0) to each state's average annual temperature everytime when the data is generated. https://www.currentresults.com/Weather/US/average-annual-state-temperatures.php 59 | 60 | 61 | 62 | If you need to generate 10,000 sensors data: 63 | 64 | ``` 65 | $ ./iotsimulator.py 10000 > testdata.txt 66 | ``` 67 | 68 | ## 3. Analysis of data 69 | In this project, I achieved 4 types of real-time analysis. 70 | - Average temperature by each state (Values sorted in descending order) 71 | - Total messages processed 72 | - Number of sensors by each state (Keys sorted in ascending order) 73 | - Total number of sensors 74 | 75 | #### (1) Average temperature by each state (Values sorted in descending order) 76 | 77 | ```python 78 | avgTempByState = jsonRDD.map(lambda x: (x['state'], (x['payload']['data']['temperature'], 1))) \ 79 | .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1])) \ 80 | .map(lambda x: (x[0], x[1][0]/x[1][1])) 81 | sortedTemp = avgTempByState.transform(lambda x: x.sortBy(lambda y: y[1], False)) 82 | ``` 83 | 84 | - In the first ```.map``` operation, PySpark creates pair RDDs (k, v) where _k_ is a values of a fileld ```state```, and _v_ is a value of a field ```temperature``` with a count of 1 85 | 86 | ``` 87 | 88 | 89 | ('StateA', (50.0, 1)) 90 | ('StateB', (20.0, 1)) 91 | ('StateB', (21.0, 1)) 92 | ('StateC', (70.0, 1)) 93 | ('StateA', (52.0, 1)) 94 | ('StateB', (22.0, 1)) 95 | ... 96 | ``` 97 | 98 | - In the next ```.reduceByKey``` operation, PySpark aggregates the values by a same key and reduce them to a single entry 99 | 100 | ``` 101 | 102 | 103 | ('StateA', (102.0, 2)) 104 | ('StateB', (63.0, 3)) 105 | ('StateC', (70.0, 1)) 106 | ... 107 | ``` 108 | 109 | - In the next ```.map``` operation, PySpark calculates the average temperature by deviding the sum of ```temperature``` by the total count 110 | 111 | ``` 112 | 113 | 114 | ('StateA', 51.0) 115 | ('StateB', 21.0) 116 | ('StateC', 70.0) 117 | ... 118 | ``` 119 | 120 | - Finally, PySpark sorts the value of average temperature in descending order 121 | 122 | ``` 123 | 124 | 125 | ('StateC', 70.0) 126 | ('StateA', 51.0) 127 | ('StateB', 21.0) 128 | ... 129 | ``` 130 | 131 | 132 | #### (2) Total messages processed 133 | 134 | ```python 135 | messageCount = jsonRDD.map(lambda x: 1) \ 136 | .reduce(add) \ 137 | .map(lambda x: "Total count of messages: "+ unicode(x)) 138 | ``` 139 | 140 | - Simply appends a count 1 to each entry, and then sums them up 141 | 142 | 143 | #### (3) Number of sensors by each state (Keys sorted in ascending order) 144 | 145 | ```python 146 | numSensorsByState = jsonRDD.map(lambda x: (x['state'] + ":" + x['guid'], 1)) \ 147 | .reduceByKey(lambda a,b: a*b) \ 148 | .map(lambda x: (re.sub(r":.*", "", x[0]), x[1])) \ 149 | .reduceByKey(lambda a,b: a+b) 150 | sortedSensorCount = numSensorsByState.transform(lambda x: x.sortBy(lambda y: y[0], True)) 151 | ``` 152 | 153 | - In the first ```.map``` operation, PySpark creates pair RDDs (k, v) where _k_ is a value of fields ```state``` and ```guid``` concatenated with ":", and _v_ is a value of count 1 154 | 155 | ``` 156 | 157 | 158 | ('StateB:0-ZZZ12345678-28F', 1) 159 | ('StateB:0-ZZZ12345678-30P', 1) 160 | ('StateA:0-ZZZ12345678-08K', 1) 161 | ('StateC:0-ZZZ12345678-60F', 1) 162 | ('StateA:0-ZZZ12345678-08K', 1) 163 | ('StateB:0-ZZZ12345678-30P', 1) 164 | ... 165 | ``` 166 | 167 | - In the next ```.reduceByKey``` operation, PySpark aggregates the values by a same key and reduce them to a single entry but the values stay 1 168 | 169 | ``` 170 | ('StateB:0-ZZZ12345678-28F', 1) 171 | ('StateB:0-ZZZ12345678-30P', 1) 172 | ('StateA:0-ZZZ12345678-08K', 1) 173 | ('StateC:0-ZZZ12345678-60F', 1) 174 | ... 175 | ``` 176 | 177 | - In the next ```.map``` operation, PySpark removes characters of ":" and guid 178 | 179 | ``` 180 | 181 | 182 | ('StateB', 1) 183 | ('StateB', 1) 184 | ('StateA', 1) 185 | ('StateC', 1) 186 | ... 187 | ``` 188 | 189 | - In the last ```.reduceByKey``` operation, PySpark aggregates the values by a same key and reduce them to a single entry 190 | 191 | ``` 192 | 193 | 194 | ('StateB', 2) 195 | ('StateA', 1) 196 | ('StateC', 1) 197 | ... 198 | ``` 199 | 200 | - Finally, PySpark sorts the values in ascending order 201 | 202 | ``` 203 | 204 | 205 | ('StateA', 1) 206 | ('StateB', 2) 207 | ('StateC', 1) 208 | ... 209 | ``` 210 | 211 | 212 | ####(4) Total number of sensors 213 | 214 | ```python 215 | sensorCount = jsonRDD.map(lambda x: (x['guid'], 1)) \ 216 | .reduceByKey(lambda a,b: a*b) \ 217 | .reduce(add) \ 218 | .map(lambda x: "Total count of sensors: " + unicode(x)) 219 | ``` 220 | 221 | - In the first ```.map``` operation, PySpark creates pair RDDs (k, v) where _k_ is a value of a field ```guid```, and _v_ is a count of 1 222 | 223 | ``` 224 | 225 | 226 | ('0-ZZZ12345678-08K', 1) 227 | ('0-ZZZ12345678-28F', 1) 228 | ('0-ZZZ12345678-30P', 1) 229 | ('0-ZZZ12345678-60F', 1) 230 | ('0-ZZZ12345678-08K', 1) 231 | ('0-ZZZ12345678-30P', 1) 232 | ... 233 | ``` 234 | 235 | - In the next ```.reduceByKey``` operation, PySpark aggregates the values by a same key and reduce them to a single entry but the values stay 1 236 | 237 | ``` 238 | 239 | 240 | ('0-ZZZ12345678-08K', 1) 241 | ('0-ZZZ12345678-28F', 1) 242 | ('0-ZZZ12345678-30P', 1) 243 | ('0-ZZZ12345678-60F', 1) 244 | ... 245 | ``` 246 | 247 | - In the next ```.reduce``` operation, PySpark sums up all values 248 | 249 | 250 | ## 4. Results 251 | 252 | The result shows console output of Spark Streaming which processed and analyzed 10,000 sensor data in real-time. 253 | 254 | ``` 255 | [ec2-user@ip-172-31-9-184 ~]$ spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.0.0-preview.jar \ 256 | ./kafka-direct-iotmsg.py localhost:9092 iotmsgs 257 | 258 | 259 | 260 | ------------------------------------------- 261 | Time: 2016-11-21 13:30:06 262 | ------------------------------------------- 263 | 264 | ------------------------------------------- 265 | Time: 2016-11-21 13:30:06 266 | ------------------------------------------- 267 | 268 | ------------------------------------------- 269 | Time: 2016-11-21 13:30:06 270 | ------------------------------------------- 271 | 272 | ------------------------------------------- 273 | Time: 2016-11-21 13:30:06 274 | ------------------------------------------- 275 | 276 | ------------------------------------------- 277 | Time: 2016-11-21 13:30:09 <- Average temperature by each state (Values sorted in descending order) 278 | ------------------------------------------- 279 | (u'FL', 70.70635838150288) 280 | (u'HI', 70.59879999999998) 281 | (u'LA', 67.0132911392405) 282 | (u'TX', 64.63165467625899) 283 | (u'GA', 64.22095808383233) 284 | (u'AL', 63.29540229885056) 285 | (u'MS', 62.92658730158729) 286 | (u'SC', 62.889361702127644) 287 | (u'AZ', 61.161951219512204) 288 | (u'AR', 60.006074766355134) 289 | (u'CA', 59.56944444444444) 290 | (u'NC', 59.13968253968251) 291 | (u'OK', 59.10108108108111) 292 | (u'DC', 57.916810344827596) 293 | (u'TN', 57.18434782608696) 294 | (u'KY', 56.375510204081664) 295 | (u'DE', 54.6767634854772) 296 | (u'VA', 54.5506726457399) 297 | (u'MD', 54.30196078431374) 298 | (u'KS', 53.60306748466258) 299 | (u'MO', 53.59634146341466) 300 | (u'NM', 53.55384615384617) 301 | (u'NJ', 52.90479452054793) 302 | (u'IN', 52.55497382198954) 303 | (u'IL', 51.9223958333333) 304 | (u'WV', 51.89952380952379) 305 | (u'OH', 50.52346368715085) 306 | (u'NV', 50.38380281690144) 307 | (u'RI', 49.90240963855423) 308 | (u'PA', 49.61223404255321) 309 | (u'UT', 49.00546448087432) 310 | (u'CT', 48.47242990654204) 311 | (u'NE', 47.96193548387097) 312 | (u'OR', 47.908675799086716) 313 | (u'WA', 47.88577777777777) 314 | (u'MA', 47.81961722488036) 315 | (u'IA', 47.54875621890548) 316 | (u'SD', 45.449999999999996) 317 | (u'CO', 45.16935483870966) 318 | (u'NY', 44.81830985915495) 319 | (u'MI', 44.58102564102565) 320 | (u'ID', 44.56483050847461) 321 | (u'NH', 43.39304347826085) 322 | (u'MT', 43.05155709342561) 323 | (u'WY', 42.9689655172414) 324 | (u'VT', 42.668322981366465) 325 | (u'WI', 41.81523809523809) 326 | (u'ME', 41.695061728395046) 327 | (u'MN', 40.348076923076924) 328 | (u'ND', 40.23502538071064) 329 | (u'AK', 26.85450819672129) 330 | 331 | ------------------------------------------- 332 | Time: 2016-11-21 13:30:09 <- Total messages processed 333 | ------------------------------------------- 334 | Total number of messages: 10000 335 | 336 | ------------------------------------------- 337 | Time: 2016-11-21 13:30:09 <- Number of sensors by each state (Keys sorted in ascending order) 338 | ------------------------------------------- 339 | (u'AK', 53) 340 | (u'AL', 34) 341 | (u'AR', 47) 342 | (u'AZ', 40) 343 | (u'CA', 28) 344 | (u'CO', 37) 345 | (u'CT', 41) 346 | (u'DC', 44) 347 | (u'DE', 50) 348 | (u'FL', 39) 349 | (u'GA', 34) 350 | (u'HI', 50) 351 | (u'IA', 45) 352 | (u'ID', 41) 353 | (u'IL', 42) 354 | (u'IN', 41) 355 | (u'KS', 35) 356 | (u'KY', 42) 357 | (u'LA', 36) 358 | (u'MA', 44) 359 | (u'MD', 43) 360 | (u'ME', 38) 361 | (u'MI', 41) 362 | (u'MN', 42) 363 | (u'MO', 50) 364 | (u'MS', 50) 365 | (u'MT', 57) 366 | (u'NC', 41) 367 | (u'ND', 40) 368 | (u'NE', 33) 369 | (u'NH', 41) 370 | (u'NJ', 34) 371 | (u'NM', 37) 372 | (u'NV', 30) 373 | (u'NY', 26) 374 | (u'OH', 42) 375 | (u'OK', 36) 376 | (u'OR', 47) 377 | (u'PA', 41) 378 | (u'RI', 32) 379 | (u'SC', 39) 380 | (u'SD', 39) 381 | (u'TN', 53) 382 | (u'TX', 34) 383 | (u'UT', 36) 384 | (u'VA', 45) 385 | (u'VT', 38) 386 | (u'WA', 45) 387 | (u'WI', 47) 388 | (u'WV', 44) 389 | (u'WY', 42) 390 | 391 | ------------------------------------------- 392 | Time: 2016-11-21 13:30:09 <- Total number of sensors 393 | ------------------------------------------- 394 | Total number of sensors: 2086 395 | 396 | ------------------------------------------- 397 | Time: 2016-11-21 13:30:12 398 | ------------------------------------------- 399 | 400 | ------------------------------------------- 401 | Time: 2016-11-21 13:30:12 402 | ------------------------------------------- 403 | 404 | ------------------------------------------- 405 | Time: 2016-11-21 13:30:12 406 | ------------------------------------------- 407 | 408 | ------------------------------------------- 409 | Time: 2016-11-21 13:30:12 410 | ------------------------------------------- 411 | 412 | 413 | 414 | ``` 415 | -------------------------------------------------------------------------------- /image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yugokato/Spark-and-Kafka_IoT-Data-Processing-and-Analytics/03b2573f2845c22957d7c2008c9f2833ba1e6cf4/image.png -------------------------------------------------------------------------------- /iotsimulator.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | ''' 4 | To generate JSON data: 5 | $ ./iotsimulator.py 6 | 7 | ''' 8 | 9 | import sys 10 | import datetime 11 | import random 12 | from random import randrange 13 | import re 14 | import copy 15 | 16 | 17 | # Set number of simulated messages to generate 18 | if len(sys.argv) > 1: 19 | num_msgs = int(sys.argv[1]) 20 | else: 21 | num_msgs = 1 22 | 23 | # mapping of a guid and a state {guid: state} 24 | device_state_map = {} 25 | 26 | # average annual temperature of each state 27 | temp_base = {'WA': 48.3, 'DE': 55.3, 'DC': 58.5, 'WI': 43.1, 28 | 'WV': 51.8, 'HI': 70.0, 'FL': 70.7, 'WY': 42.0, 29 | 'NH': 43.8, 'NJ': 52.7, 'NM': 53.4, 'TX': 64.8, 30 | 'LA': 66.4, 'NC': 59.0, 'ND': 40.4, 'NE': 48.8, 31 | 'TN': 57.6, 'NY': 45.4, 'PA': 48.8, 'CA': 59.4, 32 | 'NV': 49.9, 'VA': 55.1, 'CO': 45.1, 'AK': 26.6, 33 | 'AL': 62.8, 'AR': 60.4, 'VT': 42.9, 'IL': 51.8, 34 | 'GA': 63.5, 'IN': 51.7, 'IA': 47.8, 'OK': 59.6, 35 | 'AZ': 60.3, 'ID': 44.4, 'CT': 49.0, 'ME': 41.0, 36 | 'MD': 54.2, 'MA': 47.9, 'OH': 50.7, 'UT': 48.6, 37 | 'MO': 54.5, 'MN': 41.2, 'MI': 44.4, 'RI': 50.1, 38 | 'KS': 54.3, 'MT': 42.7, 'MS': 63.4, 'SC': 62.4, 39 | 'KY': 55.6, 'OR': 48.4, 'SD': 45.2} 40 | 41 | # latest temperature measured by sensors {guid: temperature} 42 | current_temp = {} 43 | 44 | # Fixed values 45 | guid_base = "0-ZZZ12345678-" 46 | destination = "0-AAA12345678" 47 | format = "urn:example:sensor:temp" 48 | 49 | # Choice for random letter 50 | letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 51 | 52 | iotmsg_header = """\ 53 | { "guid": "%s", 54 | "destination": "%s", 55 | "state": "%s", """ 56 | 57 | iotmsg_eventTime = """\ 58 | "eventTime": "%sZ", """ 59 | 60 | iotmsg_payload ="""\ 61 | "payload": {"format": "%s", """ 62 | 63 | iotmsg_data ="""\ 64 | "data": { "temperature": %.1f } 65 | } 66 | }""" 67 | 68 | 69 | ##### Generate JSON output: 70 | if __name__ == "__main__": 71 | for counter in range(0, num_msgs): 72 | rand_num = str(random.randrange(0, 9)) + str(random.randrange(0, 9)) 73 | rand_letter = random.choice(letters) 74 | temp_init_weight = random.uniform(-5, 5) 75 | temp_delta = random.uniform(-1, 1) 76 | 77 | guid = guid_base + rand_num + rand_letter 78 | state = random.choice(temp_base.keys()) 79 | 80 | if (not guid in device_state_map): # first entry 81 | device_state_map[guid] = state 82 | current_temp[guid] = temp_base[state] + temp_init_weight 83 | 84 | elif (not device_state_map[guid] == state): # The guid already exists but the randomly chosen state doesn't match 85 | state = device_state_map[guid] 86 | 87 | temperature = current_temp[guid] + temp_delta 88 | current_temp[guid] = temperature # update current temperature 89 | today = datetime.datetime.today() 90 | datestr = today.isoformat() 91 | 92 | print re.sub(r"[\s+]", "", iotmsg_header) % (guid, destination, state), 93 | print re.sub(r"[\s+]", "", iotmsg_eventTime) % (datestr), 94 | print re.sub(r"[\s+]", "", iotmsg_payload) % (format), 95 | print re.sub(r"[\s+]", "", iotmsg_data) % (temperature) 96 | -------------------------------------------------------------------------------- /kafka-direct-iotmsg.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | from __future__ import print_function 4 | import sys 5 | import re 6 | from pyspark import SparkContext 7 | from pyspark.streaming import StreamingContext 8 | from pyspark.streaming.kafka import KafkaUtils 9 | from operator import add 10 | import json 11 | 12 | 13 | if __name__ == "__main__": 14 | if len(sys.argv) != 3: 15 | print("Usage: kafka-direct-iotmsg.py ", file=sys.stderr) 16 | exit(-1) 17 | 18 | sc = SparkContext(appName="IoT") 19 | ssc = StreamingContext(sc, 3) 20 | 21 | brokers, topic = sys.argv[1:] 22 | kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) 23 | 24 | # Read in the Kafka Direct Stream into a TransformedDStream 25 | jsonRDD = kvs.map(lambda (k,v): json.loads(v)) 26 | 27 | 28 | ##### Processing ##### 29 | 30 | # Average temperature in each state 31 | avgTempByState = jsonRDD.map(lambda x: (x['state'], (x['payload']['data']['temperature'], 1))) \ 32 | .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \ 33 | .map(lambda x: (x[0], x[1][0]/x[1][1])) 34 | sortedTemp = avgTempByState.transform(lambda x: x.sortBy(lambda y: y[1], False)) 35 | sortedTemp.pprint(num=100000) 36 | 37 | # total number of messages 38 | messageCount = jsonRDD.map(lambda x: 1) \ 39 | .reduce(add) \ 40 | .map(lambda x: "Total number of messages: "+ unicode(x)) 41 | messageCount.pprint() 42 | 43 | 44 | # Number of devices in each state 45 | numSensorsByState = jsonRDD.map(lambda x: (x['state'] + ":" + x['guid'], 1)) \ 46 | .reduceByKey(lambda a,b: a*b) \ 47 | .map(lambda x: (re.sub(r":.*", "", x[0]), x[1])) \ 48 | .reduceByKey(lambda a,b: a+b) 49 | sortedSensorCount = numSensorsByState.transform(lambda x: x.sortBy(lambda y: y[0], True)) 50 | sortedSensorCount.pprint(num=10000) 51 | 52 | # total number of devices 53 | sensorCount = jsonRDD.map(lambda x: (x['guid'], 1)) \ 54 | .reduceByKey(lambda a,b: a*b) \ 55 | .map(lambda x: 1) \ 56 | .reduce(add) \ 57 | .map(lambda x: "Total number of sensors: " + unicode(x)) 58 | sensorCount.pprint(num=10000) 59 | 60 | 61 | ssc.start() 62 | ssc.awaitTermination() 63 | 64 | 65 | --------------------------------------------------------------------------------