├── LICENSE ├── README.md ├── dashboard.py ├── push_data_to_kafka.py ├── sensor.py └── structure_validate_store.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Pahul Preet Singh Kohli 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Real-Time Data Pipeline Using Kafka and Spark 2 | 3 | ## Data Pipeline Architecture 4 | 5 | 6 | 7 | ![](https://lh4.googleusercontent.com/eZykZAZj43p1oYAZFf_X3CINjHx6qz1rRevNptNWWisXYmDYDEae7Fhla7ETWZ2TmGRvTECBlMtFBe6aKHWaVUac7imu_hOXgVLZwFebuvE-_O_FmSZgdb5kBJAFMAxBl3AAgsYD) 8 | 9 | - ### API 10 | 11 | - The API mimics the water quality sensor data similar to the one shared [here](https://data.world/cityofchicago/beach-water-quality-automated-sensors). 12 | 13 | - The implementation is done in flask web framework and the response is as follows: 14 | 15 | 16 | ‘2020-02-17T11:12:58.765969 26.04 540.1 13.12 Montrose_Beach 758028’ 17 | 18 | ![](https://lh6.googleusercontent.com/TDsc79yE-D_GBX7hFNrbgGlnP81TaRvBESeE2JvyEb8VaFzO_h1jNezTLsTg8CRsjfMtJOFrxPJi0EkqTOuRXlpP6U0SwuSMtFg4_rYYzNF5iASjx3MFIM4jKe5fjTKlVbAm4OMK) 19 | 20 | - ### Kafka Producer (Topic: RawSensorData) 21 | 22 | 23 | - The data from the API stream is pushed to Kafka Producer under topic: RawSensorData 24 | 25 | 26 | 27 | 28 | ![](https://lh6.googleusercontent.com/KqaLvzLkdC2aYar0UeQ9raBgJgf0QXLyGe9GFr6z0uT6O-sx4ZizobVCdgIMTSZ8itXtiHfIThLHc5FoAwXtkA2U_lVZRJDQdLNvcNPKAIfS1Sa6GuiaTcCiABlpSlnhrfoSqn1s) 29 | 30 | - ### Apache Spark and Kafka Consumer (Topic: CleanSensorData) 31 | 32 | 33 | - The data under the topic RawSensorData is streamed through Kafka Consumer. The data is then structured and validated using Spark. 34 | 35 | 36 | 37 | 38 | - The cleaned data is then pushed to MongoDB and Kafka Producer under topic: CleanSensorData 39 | 40 | 41 | 42 | 43 | ![](https://lh6.googleusercontent.com/DBMkx3tX90NCtokgNYT4BkjJGujCyeZk08X4w99vo2zfsBN9Yz1YGtb38Tcc3F6_HtMbML9NLVcHPFW310MDSSLWg8G8KoTuo-sC00aApDdNW9ql1ny605pwV6r5DS-Y5D325elU) 44 | 45 | - ### MongoDB 46 | 47 | 48 | - The structured data is pushed to MongoDB collection with the following schema: 49 | 50 | ```markdown 51 | | Keys | Data Type | 52 | |------------------|-----------| 53 | | _id | Object Id | 54 | | Beach | String | 55 | | MeasurementID | long | 56 | | BatteryLife | Double | 57 | | RawData | String | 58 | | WaterTemperature | Double | 59 | | Turbidity | Double | 60 | | TimeStamp | timestamp | 61 | ``` 62 | 63 | 64 | 65 | - ### Realtime Dashboard 66 | 67 | 68 | - The dashboard is implemented in the bokeh visualization library and data is streamed using Kafka Consumer under topic CleanSensorData. 69 | 70 | 71 | ![](https://lh5.googleusercontent.com/qtt7B4EC1FCRpqWreTOrk74gAXTDvtJ3TxTKs6KWaAbtB_5MZ5-4-GSJYkbuLGRHMEUK5Gzp4njgEiklshdTs-LbCAhOeI-u96k5g9vf0IU6Av_RQx0CiR1PXY4jbMHkmesMnNhM) 72 | 73 | 74 | 75 | ## How to run the code 76 | 77 | 78 | 79 | - #### Start the API (port: 3030) 80 | 81 | 82 | python sensor.py 83 | 84 | 85 | 86 | 87 | - #### Start Zookeeper 88 | 89 | 90 | bash /opt/zookeeper-3.4.14/bin/zkServer.sh start 91 | 92 | 93 | 94 | 95 | - #### Start Kafka 96 | 97 | 98 | bin/kafka-server-start.sh config/server.properties 99 | 100 | 101 | 102 | 103 | - #### Create RawSensorData Topic 104 | 105 | 106 | ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic RawSensorData 107 | 108 | 109 | 110 | 111 | - #### Create CleanSensorData Topic 112 | 113 | 114 | ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic CleanSensorData 115 | 116 | 117 | 118 | 119 | - #### Push Data From API Stream to Kafka Topic: RawSensorData 120 | 121 | 122 | python push_data_to_kafka.py 123 | 124 | 125 | 126 | 127 | - #### Structure and Validate Data, Push To MongoDB and Kafka Topic CleanSensorData 128 | 129 | 130 | ./bin/spark-submit structure_validate_store.py 131 | 132 | 133 | 134 | 135 | - #### View RawSensorData Topic 136 | 137 | 138 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic RawSensorData --from-beginning 139 | 140 | 141 | 142 | 143 | - #### View CleanSensorData Topic 144 | 145 | 146 | bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic CleanSensorData --from-beginning 147 | 148 | 149 | 150 | 151 | - #### Real-Time DashBoard - Visualization 152 | 153 | 154 | bokeh serve --show dashboard.py 155 | -------------------------------------------------------------------------------- /dashboard.py: -------------------------------------------------------------------------------- 1 | import random 2 | import pandas as pd 3 | from bokeh.driving import count 4 | from bokeh.models import ColumnDataSource 5 | from kafka import KafkaConsumer 6 | from bokeh.plotting import curdoc, figure 7 | from bokeh.models import DatetimeTickFormatter 8 | from bokeh.models.widgets import Div 9 | from bokeh.layouts import column,row 10 | import ast 11 | import time 12 | import pytz 13 | from datetime import datetime 14 | 15 | tz = pytz.timezone('Asia/Calcutta') 16 | 17 | 18 | UPDATE_INTERVAL = 1000 19 | ROLLOVER = 10 # Number of displayed data points 20 | 21 | 22 | source = ColumnDataSource({"x": [], "y": []}) 23 | consumer = KafkaConsumer('CleanSensorData', auto_offset_reset='earliest',bootstrap_servers=['localhost:9092'], consumer_timeout_ms=1000) 24 | div = Div( 25 | text='', 26 | width=120, 27 | height=35 28 | ) 29 | 30 | 31 | 32 | @count() 33 | def update(x): 34 | for msg in consumer: 35 | msg_value=msg 36 | break 37 | values=ast.literal_eval(msg_value.value.decode("utf-8")) 38 | x=((values["TimeStamp"]["$date"])/1000.0) 39 | x=datetime.fromtimestamp(x, tz).isoformat() 40 | x=pd.to_datetime(x) 41 | 42 | print(x) 43 | 44 | div.text = "TimeStamp: "+str(x) 45 | 46 | 47 | y = values['WaterTemperature'] 48 | print(y) 49 | 50 | source.stream({"x": [x], "y": [y]},ROLLOVER) 51 | 52 | p = figure(title="Water Temperature Sensor Data",x_axis_type = "datetime",plot_width=1000) 53 | p.line("x", "y", source=source) 54 | 55 | p.xaxis.formatter=DatetimeTickFormatter(hourmin = ['%H:%M']) 56 | p.xaxis.axis_label = 'Time' 57 | p.yaxis.axis_label = 'Value' 58 | p.title.align = "right" 59 | p.title.text_color = "orange" 60 | p.title.text_font_size = "25px" 61 | 62 | doc = curdoc() 63 | #doc.add_root(p) 64 | 65 | doc.add_root( 66 | row(children=[div,p]) 67 | ) 68 | doc.add_periodic_callback(update, UPDATE_INTERVAL) 69 | -------------------------------------------------------------------------------- /push_data_to_kafka.py: -------------------------------------------------------------------------------- 1 | import time 2 | import json 3 | import requests 4 | import datetime 5 | from kafka import KafkaProducer, KafkaClient 6 | from websocket import create_connection 7 | 8 | 9 | def get_sensor_data_stream(): 10 | try: 11 | url = 'http://0.0.0.0:3030/sensordata' 12 | r = requests.get(url) 13 | return r.text 14 | except: 15 | return "Error in Connection" 16 | 17 | 18 | producer = KafkaProducer(bootstrap_servers=['localhost:9092']) 19 | 20 | while True: 21 | msg = get_sensor_data_stream() 22 | producer.send("RawSensorData", msg.encode('utf-8')) 23 | time.sleep(1) 24 | 25 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /sensor.py: -------------------------------------------------------------------------------- 1 | 2 | import time 3 | import random 4 | from datetime import datetime 5 | from flask import Flask, Response 6 | 7 | app = Flask(__name__) 8 | 9 | #data 10 | @app.route('/sensordata') 11 | def get_sensor_data(): 12 | beach='Montrose_Beach' 13 | 14 | timestamp="{}".format((datetime.now()).now().isoformat()) 15 | water_temperature=str(round(random.uniform(31.5, 0.0),2)) 16 | turbidity=str(round(random.uniform(1683.48, 0.0),2)) 17 | battery_life=str(round(random.uniform(13.3,4.8),2)) 18 | measurement_id=str(random.randint(10000,999999)) 19 | 20 | response=str(timestamp+" "+water_temperature+" "+turbidity+" "+battery_life+" "+beach+" "+measurement_id) 21 | 22 | return Response(response, mimetype='text/plain') 23 | 24 | if __name__ == '__main__': 25 | app.run(host='0.0.0.0',port='3030') 26 | -------------------------------------------------------------------------------- /structure_validate_store.py: -------------------------------------------------------------------------------- 1 | import json 2 | from bson import json_util 3 | from dateutil import parser 4 | from pyspark import SparkContext 5 | from kafka import KafkaConsumer, KafkaProducer 6 | 7 | #Mongo DB 8 | from pymongo import MongoClient 9 | client = MongoClient('localhost', 27017) 10 | db = client['RealTimeDB'] 11 | collection = db['RealTimeCollection'] 12 | 13 | 14 | 15 | def timestamp_exist(TimeStamp): 16 | if collection.find({'TimeStamp': {"$eq": TimeStamp}}).count() > 0: 17 | return True 18 | else: 19 | return False 20 | 21 | def structure_validate_data(msg): 22 | 23 | 24 | data_dict={} 25 | 26 | #create RDD 27 | rdd=sc.parallelize(msg.value.decode("utf-8").split()) 28 | 29 | data_dict["RawData"]=str(msg.value.decode("utf-8")) 30 | 31 | #data validation and create json data dict 32 | try: 33 | data_dict["TimeStamp"]=parser.isoparse(rdd.collect()[0]) 34 | 35 | except Exception as error: 36 | 37 | 38 | data_dict["TimeStamp"]="Error" 39 | 40 | try: 41 | data_dict["WaterTemperature"]=float(rdd.collect()[1]) 42 | 43 | if (((data_dict["WaterTemperature"])>99) | ((data_dict["WaterTemperature"])<-10)): 44 | 45 | data_dict["WaterTemperature"]="Sensor Malfunctions" 46 | 47 | 48 | except Exception as error: 49 | 50 | 51 | data_dict["WaterTemperature"]="Error" 52 | 53 | 54 | try: 55 | data_dict["Turbidity"]=float(rdd.collect()[2]) 56 | 57 | if (((data_dict["Turbidity"])>5000)): 58 | 59 | data_dict["Turbidity"]="Sensor Malfunctions" 60 | 61 | 62 | except Exception as error: 63 | 64 | 65 | data_dict["Turbidity"]="Error" 66 | 67 | 68 | 69 | try: 70 | data_dict["BatteryLife"]=float(rdd.collect()[3]) 71 | 72 | except Exception as error: 73 | 74 | data_dict["BatteryLife"]="Error" 75 | 76 | 77 | try: 78 | data_dict["Beach"]=str(rdd.collect()[4]) 79 | 80 | except Exception as error: 81 | 82 | data_dict["Beach"]="Error" 83 | 84 | try: 85 | data_dict["MeasurementID"]=int(str(rdd.collect()[5]).replace("Beach","")) 86 | 87 | except Exception as error: 88 | 89 | data_dict["MeasurementID"]="Error" 90 | 91 | 92 | 93 | return data_dict 94 | 95 | sc=SparkContext.getOrCreate() 96 | sc.setLogLevel("WARN") 97 | 98 | consumer = KafkaConsumer('RawSensorData', auto_offset_reset='earliest',bootstrap_servers=['localhost:9092'], consumer_timeout_ms=1000) 99 | 100 | producer = KafkaProducer(bootstrap_servers=['localhost:9092']) 101 | 102 | for msg in consumer: 103 | if msg.value.decode("utf-8")!="Error in Connection": 104 | data=structure_validate_data(msg) 105 | 106 | if timestamp_exist(data['TimeStamp'])==False: 107 | #push data to mongo db 108 | collection.insert(data) 109 | producer.send("CleanSensorData", json.dumps(data, default=json_util.default).encode('utf-8')) 110 | 111 | print(data) 112 | --------------------------------------------------------------------------------